Embracing Risk Responsibly: Moving beyond inflexible SLAs and exception hell

Name: Embracing Risk Responsibly: Moving beyond inflexible SLAs and exception hell
Uploaded: 2022-07-06
Duration: 47 min 24 s
Description: Eric Ellett - Embracing Risk Responsibly: Moving beyond inflexible SLAs and exception hell by treating security vulnerabilities and risk like actual debt At Segment, we were sick of having breached SLAs; we were tired of a junk drawer of exceptions that continued to grow without bound. Two years ag

BSidesSF · 202247:242.5K viewsPublished 2022-07Watch on YouTube ↗

Speakers

Eric Ellett

Tags

CategoryTechnical

TopicDevSecOps

StyleTalk

About this talk

Eric Ellett - Embracing Risk Responsibly: Moving beyond inflexible SLAs and exception hell by treating security vulnerabilities and risk like actual debt At Segment, we were sick of having breached SLAs; we were tired of a junk drawer of exceptions that continued to grow without bound. Two years ago we decided to move beyond inflexible SLAs and permanent exceptions to enable our business to “Embrace Risk Responsibly” by treating vulnerabilities like debt. Sched: https://bsidessf2022.sched.com/event/rjpD/embracing-risk-responsibly-moving-beyond-inflexible-slas-and-exception-hell-by-treating-security-vulnerabilities-and-risk-like-actual-debt

Show transcript [en]

next up we have eric elliott senior director of r d security at twilio he's going to talk to us about embracing risk responsibly [Music] hello thank you all right we got the slides up here all right uh thank you for joining me on my talk on embracing risk responsibly uh here's my intro slide uh my name is eric elliott i have a dog named marty mart short for marty mcfly um this is him and his adobe smile here i was born in indiana work lived in dc work lived in san francisco and now i still work in san francisco but live in chicago i lead effectively what's called a security engineering function over at twilio and what that means is i'm

responsible for the technical security outcomes with the exception of cert and incident response so i want to tell you my intent up front just so i want you to to know like what i want you to get out of this talk and the goals i have is to share a pattern that i've used over the years to develop a capability from this is fine kind of dumpster fire to something that i feel like we can be proud of as an org give talks about and also blog about and i'm actually going to do that today specifically with a technique that we've developed for vulnerability management using debt so you might ask okay well why is that

in your intent i i want you to have or enjoy this talk and i realize when i look at the talks that i've really enjoyed they're the ones that show how the sausage is made right so it's you could see the cool parts but also how did you get there and then also teach by real world examples i learned best by real world examples and also you know i want to note that everyone has just a different environment that they're operating in we're solving similar problems and it's great to see what your current state is where you ended up and how you navigated that environment and um selfishly i want to start a conversation with you all

about how you handle these heart problems so please feel free to ask questions or if you see me in the conference you have like a cool way of doing vuln management please talk to me or or anything really i always love to hear how other people are doing things so how i'm going to do it uh i'm going to break this pattern down into three stages so you see the emoji one two three for each stage i'll tell you what it is why it's important what the goals are and then what that story i'll tell you a story of what this stage looks like in practice i have a little mini map up here in the

top right just to keep track of where we are so depending on what stage we're in you'll see that up there and then also when we're in story time you'll see the little book emoji here so there are three stages for this pattern uh stop the bleeding and breathe rebuilding foundation with data and experimenting strategically so let's start off a little story time so it's 2018 i just joined a company named segment i have a co-worker who has a nifty name leaf you might have seen him earlier on this stage and literally my first day i think it was two hours into my first day i got my laptop i got into octa leaf that says

hey let's let's go into a room real quick i'm like great he's going to tell me about the program i'm going to you know he's going to give me a brain dump and i'm going to learn and figure out how i can start contributing no he actually said hey i just checked the bug bounty cue and we have some p1s i'm like okay one p1 he said no actually three that just came in and we just paid 16k in bounties you might want to go make friends with the finance people because we're going to need to re-up our pool no problem this is actually a recreation of the conversation i had with leaf he said

welcome aboard eric rolled into town last night and we have three p ones and my response was we can just use the existing bull management process right leaf he looked at me vacantly and make sure you heard me i repeated myself okay we don't have a robust management process i get it the program just started you have to start from somewhere but maybe they had bits and pieces of their program in place so i asked some questions well we're all vulnerabilities ticketed even the ones we weren't going to fix just so we have that documented nope we don't do that yet but that's a good idea we should start that um okay no problem do we have a standard

workflow for vulnerabilities i know that there's like handoffs or yeah triage and you hand it off to the people are going to fix it no we don't do that currently do we at least have consistent severity in slas like just fields in jira nope we don't have that yet and i feel like i knew the answer to this last question uh well what about metrics yeah nope we don't do that either so um this is a little bit how i was feeling uh this is fine and again we have to start from somewhere so let's start with our pattern um that's enough story for now let's go to stage one stop the bleeding and breathe

so there are two goals for the stage the first one is buy time with good enough quote unquote good enough um and what does good enough mean well you need to know who your customers are you might say well i'm security team i don't i don't have customers i'm not customer facing that's not true you do have customers you develop a product right phone management your capabilities they should be version products that your customers the internal customers such as your developers your pms your ems or product managers your executives consume right and so you need to figure out who they are and engage them to figure out what good enough is so remember we want to go from this

to this we want to breathe so let's get back to the story we asked ourselves who are our customers so they're the obvious ones engineering they fix the volumes grc they're the ones that need to ensure that we have a valid process for external customers and auditors to to vet us and so we can get like iso 27001 and all of our certifications and the big pain customers can can talk to us and then there were the not so obvious product managers they actually owned the products that these vulnerabilities existed in um and leif's talk earlier today was really talking about the partnership there and why it's important and then board members this one's kind

of obvious in retrospect about the time something i didn't think about as a customer but they they typically know what good looks like and they want to ensure their startup doesn't fall into the move fast and breach stuff mode which um i've seen a few times going back to ypms again this is a conversation i'm sure you've all had if you worked in appsec or had to triage any vulnerability so maybe there's a vulnerability an activation service so you go into slack and say hey who owns activation service engineering might say hey we do what's up there is a critical p1 vulnerability that we need to fix can you help sure thing and that fix could be one

hour four hours one day it could be really systemic it might take a week and the pms might be like where are my engineering resources um and so we wanted to make sure that they were part of this process from the beginning and then our other customers grc for good enough what they wanted was a risk exception workflow right so we're not going to fix every vulnerability and the ones that we do not fix let's make sure that we document that and let's make sure that we have a plan to remediate it eventually we have image manager i'd like to see vulnerabilities related to my team so we always talk about we want that single

pane of glass when we talk to vendors our customers internally want that too they don't care necessarily to see every vulnerability that exists within segment they want to see what they own and what they have to do so that they can bake it into their sprint and then board members just like all executives they love metrics and reporting and so we wanted to make sure that we had that so this is where we ended up for v1 you can see here something that's pretty unique we add product triage actually as the first step and that's because our product teams they manage bugs for their product and when we find vulnerabilities and hopefully i don't get in trouble for saying this but we

kind of look at them as a class of a class of a bug right and so when we find something let's say in personas we have it go to the personas pm to help make a decision if they're going to fix it or not and if they don't it goes down the risk acceptance flow here you can see the grc person's happy and then metrics nothing fancy like we didn't need to knock it out of the park jira i mean it's not the coolest tool in the world um i mean i disagree but uh for a lot of people it's not the uh the most interesting or or people eyes might glaze over when they

look at jira dashboards but they actually do they're pretty flexible and you can get pretty far and get good enough and so here we have for example vulnerabilities by vrt which if you're not familiar with vrt it's a bug crowd vulnerability rating taxonomy something that we adopted pretty early so that we can see how our bugs our vulnerabilities were broken down and see where we might needed to spend some time and energy in the appsec program to eliminate that vulnerability class and then you can't see it here and we'll talk about a little bit more later but we also had breach slas by a product manager and that's like that single pane of glass for them

all right enough story time for now how do we do for stage one buy time with good enough identify and engage our customers we we did both of those so we're feeling pretty chill we can maybe take that now that we have time we can maybe take that extra peloton class and and and relax a little bit and move on to stage two so stage two is rebuilding the foundation with data so the first thing you want to do is decide if good enough is good enough for now and the reason why that's important is because especially in a startup you have a lot of things to do to get started our vulnerability management program v1

was good enough we had an sdlc to build we had design reviews to figure out we had general security metrics to figure out and so that's why that's very important and then also we want to make sure that when we do design v2 let's make sure that we do it with our customers transparently and using data from v1 so let's go back to the story like i said earlier good enough was good enough for now and so what we did was actually revisited in a couple years so what year is it it's 2020 now so 2018 to 2020. so looking at the data from v1 we had our customers customer stories that is data it's maybe

a little bit more qualitative but that's fine we had our metrics more on the quantitative side and then importantly your team you should always consider what inputs from your team like no one wants to work on a process that has a ton of toil it could burn out your engineers and that puts you in the worst place from a security perspective because now you have to hire and all the opportunity costs that that goes with that and so what we did is we wrote our learnings in the doc and i want to go through the top three learnings right now so the key learning for number one that we had was people need the system's workflow to be

intuitive number of states should be minimized technical jargon even security technical jargon can be distracting and confusing especially for folks that aren't in the security space and so what we did for v2 is we this is our new workflow to the right you can see there's one less state also our volun management process or i'm sorry the risk acceptance process got translated into a due date extension because people understand due date extension better than they understand temporary risk acceptance and so this was just a lot easier for people to navigate so if they have a vulnerability that's due they can't make that due date they can extend it obviously there's approvals and things like that required

key learning number two notifications need to be improved and actionable so surprise factors should be minimized at all costs our customers are the ones that need to navigate this process and they don't know how or why they have vulnerabilities and and they you know show up in some report to their boss it's not going to be great they're going to potentially get defensive and then also actionable instruction should be readily present in the process itself so what this looks like for us we leveraged email quite a bit so when you get a new vulnerability you get an email notification tells you what the due date is the details the metadata that you need and then simple instructions are right

here they don't need to go read a primer standard it's right in the process itself and ideally very few steps and then if you have a volume due soon we sometimes bring in the department leads depending on how close the vulnerability is due again consistent format where possible and then simple instructions and then when there's a due date extension aka accepting risk this is what the approver sees they see the all the information in one spot and sometimes this is executives right they live in email so they want to see what the justification is how do they feel comfortable accepting this risk and then if they have questions they can always reply and it brings in

the security team and also brings in the person that asked for the exception and of course this gets reflected back into jira which is what we were using so for auditing purposes key learning three of three so attribution should be flexibly tied to the org chart so accountable attribution should not be hard to maintain i i'll go into more detail about this but the team field was very hard to maintain teams were disappearing teams were appearing we had vulnerabilities that had no owner because that team disappeared and so there was a lot of toil there and then attribution should allow you to roll up metrics and i'll go into more detail here in a second

so let's talk about work structure change you have the root of your orgs is where the ceo typically sits and then you have a one-to-many relationship with your departments and this is it could be departments divisions i it's all semantics but um for us it's g a gtm r d this is like where you're cto this is where your ceo might sit or your people officer or your revenue officer then there's a one-to-many relationship with maybe macro teams or divisions underneath and then there's a one-to-many relationship with teams at the bottom so our v1 team field was just it was anchored at the bottom and like i said there was a ton of toil and obvious in retrospect but the rate

of change increases as you move down um this or chart and so the team changes all the time whereas the org and department ideally hopefully is pretty much stable and so this was the sweet spot for us we ended up actually anchoring all of our vulnerabilities to the division level um and we actually tied them into the department as well and uh that was required and teams were kind of optional they they weren't required and um if they let's say the division lead wanted that last mile attribution say okay which team owns this they can do it themselves it seemed to work pretty well that way and then the other thing that we can do

when you start tying the vulnerabilities to the org chart you can start doing interesting things such as roll up and you can enable competition so imagine personas at the division level they want to see how their vulnerabilities break down uh you can see oh sorry jumped ahead there this is v1 this is where we were limited to so we could only see team level breached sla metrics which was fine but i mentioned it introduced a lot of toil and you couldn't really roll these up cleanly but when you anchor it to the organization you can start to roll them up so in this case you can say presonus has six breached slas and you can see

protocols has 25 data plane has zero and so the department lead in this case the cto might be asking or responsible to the ceo says yeah we have 31 breach slas in our org i want to get that down to zero who do i need to go talk to let me look at the metrics and you can see six were from personas zero we're from data plane a plus student here and uh 25 were from protocols shame shame shame so we um what this does is it enables competition you don't want to be the director or the team that has 25 when your peer down the hall has zero it's kind of inexcusable or hard to

explain to your boss when other your peers are doing it as well so this is what the dashboard looked like you'd see that we have the department and then the if you wanted to zoom in you can go into the individual pillars or divisions and then so going back to one of the things that we need to do is design this with our customers so those learnings that i put together we actually brought the pms in we brought the executives in they were engaged they were interacting they were getting hyped which is great to see we also created user stories um and so like for example the executive what you want to see i want to be able to see all breach

vulnerabilities by function and furthermore i want commentary and from owners and on breach status and why then why do you want that so i can ensure vulnerabilities within my org are being remediated in a timely fashion and then you have your flow and we do this for all the different customers or actors in our org or in for the vulnerability management system created mock-ups just so that we could get approval from them to to make sure that this made sense and was easy for them to consume we had strong reviews uh at the end of the day and then when the process did launch about two years or a year and a half later tito who is our chief development

product officer uh mentioned that this was one of the best operational processes that we have at segment and we actually started rolling this out to um betterment so like sev betterments like sub one sub two sub three um you will have an incident a postmortem you'll have um betterments that come out of that they were actually leveraging a lot of the the system as well all right enough story for now how do we do stage two all right let's get to the fun part experimenting strategically so there's three parts here identify a good problem space using v2 data leverage prior art in this industry or json ones like sre for example and experiment and share your

results early to see if there's traction so let's go back to the story back to our customers all right we're looking at all this again some problems we observed so parts of the org had vastly different risk appetites so personas may be very thoughtful anytime they needed to do an extension or accept a risk they would make sure that they would only come back to the well once and ensure that they had allocation for that particular work to be done in their roadmap whereas maybe data processing again these are fake pillars i'm not trying to call them out data processing was a lot more rubber stamp happy to extend risks and then we also realized as extensions

grew breached sla metric became less meaningful so sure data processing might have zero breach slas but they might have a thousand open volumes and so what does a breach sla mean at this point if they have a ton of exceptions and then lots of toil and helping teams with prioritization like we love to work with our engineering partners but they always were asking us okay how do we prioritize these it's not always as simple as we need to do p1 p2 p3 burn down for your vulnerabilities they were looking for a little bit more uh hand crafting and artisanal prioritization which is great but it takes a lot of time so we wanted to figure out what our problem

space was and we needed to define it so could we come up with a metric that was more meaningful than a breach sla that enabled different parts of the org to have different risk appetites to ensure that those risk appetites were in a healthy range and then could provide a stronger prioritization beyond severity so we want to start looking at different industries leveraging different prior art and i don't know if you folks are familiar but there's this amazing book from google that's free the site reliability engineering book and you might ask okay why do you care about sre and what i say is that they are very similar to what we do they take customer facing

risks and impact they translate them into ttps like tools concepts like sli's slos slas and they hold the business accountable right they speak to the engineers they typically are the most advanced engineers in the org that i've seen and so i thought they had a lot of great learnings and specifically chapter three and you can see where i got the title this talk mostly inspired from is around embracing risk and the intro talks about extreme reliability comes at a cost and that makes sense right we can't have a perfectly secure system and just like in sre you can't have a perfectly reliable system because it costs too much money and gene spafford from purdue um

mentioned the only system which is truly secure is one is switched off and unplugged locked in the titanium line safe buried in a concrete bunker is surrounded by nerve gas and very highly paid armed guards and even then i wouldn't stick my life on it so we have to manage risk we have to have it within our business and that's fine because if we didn't we wouldn't have a product so going back to the sre book in this chapter they also mentioned what their focus is is to seek a balance of risk of unavailability with the goals of rapid innovation so how i looked at that was can we balance security risk and balance it

with like the need to ship product and then there was a i'm sure a lot of you are familiar with this part which is they talk about error budgets which um basically this metric you have two opposing incentives you have the product team that wants to ship quicker and you have the reliability team that wants to make sure that services are reliable sometimes those incentives aren't aligned so they they wanted to come up with a metric that allowed reduce the friction between those two competing actors and so what this looks like in practice um product manager defines an slo so think of this like up time per quarter slo is measured by initial third parties

so the monitoring system try to eliminate bias service goes down for a period of time that eats up that up time per quarter budget and then if that budget is exceeded new releases are halted and reliability work is prioritized so some takeaways again perfect reliability reliable systems are not the goal we need to find a balance and then air budgets allow teams to align unclear and objective metric per product so i was thinking about this um well we were thinking about this sorry uh it's not just me uh and we realized that due date extension temporary risk acceptance we were trying to figure out if there is a way to um with budgets we were thinking

about currency and we're like okay maybe we can think about debt debt is a form of currency right and so we were thinking um maybe every time you do an extension you're kind of telling the business hey i can't fix this right now but i promise to to fix it in the future and therefore you're taking on technical debt or security debt and we were wondering okay can we make this debt can we manage it can we budget it per department and um if each extension is debt what's what are the interest rates right and so the requirements for that we need to make sure that we discourage people from taking out egregious loans or egregious debts so interest rates

should be higher for more severe vulnerabilities the highest interest rates or loans should be pre-approved at the right level of authority in the business and then interest rates should be intuitive and we actually realize our sla to approval mapping and sla time and severity this all kind of came together and something that we can leverage so we we sat down and and we figured out okay what would a debt metric look like and here it is this is what we use today it's it's the current date minus the original due date divided by sla i know this is a very simple formula but that's by design because people look at this as actually elapsed sla periods that's easy to grok

executives for them to understand that and that's who you need to convince i'm not saying it's perfect we're still working on how to make sure people can fully understand it but we didn't want to add a bunch of opaque factors into this debt where people can't um understand it and maybe immediately push back on it so this is what an example would look like so let's say a debt value for a vulnerability that's six months overdue so keep in mind this is after the sla has elapsed um let's say it was june 30th and it was due on jan 1st it's 180 days overdue sla for p3 is 30 days and then it's six elapsed sla periods it's just our unit i

don't know i forgive me physicists in in the room hopefully i didn't offend um okay so let's go through an example vulnerability is identified in personas p3 severity 30 days to fix due date is on 101 2022 the team believes there is a more important risk to address first in q1 therefore they extend the due date until 4 15 20 22. the director of that department is notified they sign off you saw the notification earlier same system and then the vulnerability begins to generate technical security debt so in this case 104 days over 30 days that gives us about 3.5 elapsed sla periods before the end of um or when that vulnerability is now due

so what can you do with this metric like okay cool you have a number but we can start doing org wide roll-ups we can say hey what is the debt look like for segment as a whole how does that break down within departments in this case we're looking at the median versus the average because the outliers were throwing things off and so we ended up looking at just the median debt again these numbers are fake um but this is a real system that we leverage you can also look at how debt is managed at the division level um so here in this case on the left side you're seeing the total debt that presonus has which is quite high

however that debt is on the i'm sorry the median pillar debt on the right side is actually fairly low so that to me could signify that yes they have a lot of open vulnerabilities but they are paying them down fairly regularly whereas stream processing here maybe has a smaller surface area so less debt overall but they aren't paying it down and that to me is a an area that i would probably spike into especially if you look at the peers here that have like 18 16 14 so it might be an area of interest you can also see how debt breaks down by severity so maybe certain teams might be more interested in p3 p2 or p1s and they

don't care about p4 p5 and that might be fine you could also see oh in this case stream processing they have a lot of vulnerabilities at the p2 phase that are accruing a lot of debt so again we might want to look closer here um total debt this one i haven't found an interesting value or use case of this so i'm just going to skip it i could not get the dashboard i was i was literally editing the html in in snowflake i can't for some reason for like list views i can't do that there so i just copied it over to a spreadsheet but what this is showing that we we have the ability to look at the highest loans

taking in the last day week or month and we really we realized that it's not always the p1s or p2s you have to worry about sometimes it's the things like p3s right if you push a p3 for a year that's a lot more egregious in my opinion than maybe a p1 for a couple days and so when you have these vulnerabilities listed or oriented this way you can maybe adjust your focus on the ones that maybe are more concerning we also um have our uh within all of twilio or segment soon to be twilio we have all our vulnerabilities um in jira then we have uh the apps lapse sla periods associated with them so we can actually generate a

list of our most egregious or or our top 10 security debt items and it actually goes through all of our vulnerabilities so it's like the top 250 or whatever or yeah it's arbitrary we also have a top 10 debt review so this is something where we invite executives pillar leads or department leads security leadership and what we do is we get in a room we talk about the top 10 vulnerabilities and we try to get an understanding of will the due date be met like this is how much debt it already has are there blockers how can we eliminate this and sometimes they actually push back they say hey this severity is too high and that's fine

that's a good outcome we know that sometimes there's a little bit of an art and science to severity and while i wish we could be very scientific there it's not reality and so um we sometimes litigate and go back and forth until we come up with a severity that we feel is about correct and uh down here in the bottom right is a little bit of a promotion from albert who is a really great vp of avenge um at twilio and he he mentioned it was like one of his favorite meetings uh of the year so it was great to see that our vps of anger engaged but what's the value of this metric right it's easy to understand there's no

hidden factors right elapsed slas directly tied to severity so it's not too much of a jump from where you are with your current vulnerability program allows for aggregation and importantly budgets and we'll talk about that in a second and better prioritization than um severity alone so looking forward we're scaling up uh and and we actually are doing this at twilio julio is a very large company relative to segment and this is one of the top five metrics we share with executives we're iterating on it to get it to where maybe we feel like it's it's actionable and where it needs to be but we have a spreadsheet that we share and you can see here there is a budget

per bu so a target so in this case let's say this was segment and we have a target of 12 well if you recall the dashboard earlier was like 12.5 in this case then maybe it's not in such a good spot so we would want to push on them to maybe halt or slow down their development of new features and work on security betterments and then also just because we have that hierarchy we can also attribute to products individually as well so product one product two product three how do they contribute to that um that four point eight score and then we're also thinking about budget enforcement and ci so if we know repo belongs to a particular

team they let's say it's a activation service and we want to help kind of bake this and push it left as far as possible maybe we can have a github check that says hey you're approaching your division budget you're currently at 12 budget is 13. maybe you should start fixing some of your vulnerabilities because otherwise we might block you soon and again this budget is set by the the pm or product manager and so or the the the gm uh the general manager of that particular part of the business so they are uh incentivized to keep this down this isn't just security saying this um and then this this idea is a little uh bonkers

but uh we are trying to figure out maybe calibrating costs with incidents and calibrating um that cost to debt budget so or to debt so i'll go through an example but for each vulnerability that materializes into a security incident could you calculate the cost of that incident maybe engineering time to resolve business loss and then divide by debt score very very simple um so in this case let's say vulnerability with a death score of 30 resulted in an incident that cost the business 2 million dollars can you divide this and say oh well maybe one debt point is 100k and so let's say a new division has an aggregate debt of 50 then they're carrying 5 million of potential security

debt again this idea might be bonkers but i do feel like there's something here to help us do more of an apples to apples comparison between future work that it's really easy to understand what revenue that can bring in versus the security work which is always a a couple steps away from revenue so and then some miscellaneous ideas we're thinking about credit scores so maybe teams are really good at paying off their debt or keeping their debt low maybe we have less restrictions on them taking onset debt and then also this can be applied to any tech debt that has severity in sla so set betterments and so we did present this to our sev team there was a lot of

interest follow up on that so how do we do on stage three goals uh we hit all three of these and um that's it we're hiring come work with me um thank you and dm me on linkedin also check out jeevan's talk 3 30 in this room now i'll take your questions how do i do on time okay good yes uh so within segment the oh oh yeah good idea um how was it received across the org so at segment uh the chief development product officer tito thought it was great he thought it was um you know one of the most most mature uh processes he's seen for operational work and betterments and and burning down

technical debt on the the twilio side we're still like iterating on it um i'll be honest like the first iteration was a little they were like but if i'm fixing my vulnerabilities then i have a low debt like how are we accounting for maybe like burn down of debt like uh so we're we're iterating on different types of ways of looking at the debt itself versus just the median so like almost like a cash flow like yeah maybe you have a lot of debt coming in but you have a lot of debt being burned down can we like kind of show that type of metric as well it's it's going to be a process right it's going to be

v4 right or v3 so that's fine overall yes

uh yeah the process of basically rolling this out well if you have severity and that's actually why i did this talk is just kind of show you where we started where we had nothing but and like that whole process was have severity um if you have vulnerabilities and what we do is actually we take um for us we really start debt with p3 or higher because that's where most people care for now and then as we get more mature we can start including p4 p5 and that that formula that i put up there should you throw that into a spreadsheet or you can use snowflake or however your etl and your jira data it's pretty pretty trivial like query

that you can make and and you can always tweak um that formula to whatever resonates within your work but yeah so we're doing that at twilio right now and how we're starting is we're starting with our top vulnerabilities that we've seen in that business we're tagging them and those get counted towards that that debt metric we're not looking at p5s or p4s just because it's too noisy right now we need to get people used to this metric before we start really going all in on it yes

yeah so the question is um if teams push back on the severity um how do you how do you manage that um and so what we do is i mean we discuss right we try to come out of the room like on agreement we actually lean a lot on cwss score like it's a calculator that we have that you even created that allows us to generate kind of a ideally an unbiased metric but some things that we do actually is say hey is there any compensating controls that will lower this debt which will lower the interest rate so if you drop it down from a p1 to p3 with a compensating control now you're not accruing a point every day

you're occurring a point every 30 days and so when that debt for that particular vulnerability gets recalculated typically drops off like especially the top 10 but maybe the top 20 top 30 so it's not as much of a big deal and that's that's kind of the outcome we want right we want them to put compensating controls in place and if we can't fix the vulnerability right away then we should we should um i guess reward them for for putting that compensating control in place yes

yeah um that is something i hope maybe in a year or two oh i'm sorry yeah so the question is combining with other types of technical debt um and yes it is something that i'm very interested in because if you can i'm pretty big in like dx or ux for our engineers because the processes that are usable or ones that people are going in to to to actually have the most impact on the business and so if we can take um and we want that single pane of glass if they have instant betterments they have vulnerability betterments they have i don't know any other type of betterment if we can put that all into one spot and ideally even have like

similar potentially denominators and interest rates so it just makes things a lot easier you can start to rationalize and plan better as an organization on what you should be burning down and and why so yes definitely something that i would hope to see in maybe a year or two and that's why i presented it to we have like an sre meeting um that has like everyone in twilio that's sre and ops related and i presented this and um there was a decent amount of interest and follow-up from that so hopefully yes

so the question is how do these vulnerabilities actually repeat it one more time

yeah that's a really great question so the question is um if you have a vulnerability in like a common component so let's say like your infrastructure team has a vulnerability and that debt impacts and i guess a team higher up on the stack how does that translate we don't really have a good answer for that today like one thing that we were thinking about there which is um so let's say we have a a systemic issue like secrets management like it's not great um hypothetically speaking at twilio and that that's like part of the um the uh platform right um what we were thinking is okay that should be attributed to the platform level or like that big part of

the business maybe even r d and it's kind of the cto that needs to figure out how to burn that down so it gets attributed at you know their particular level but it's not i don't know if i can give you a better answer than that but happy to to brainstorm some ideas offline but yeah it's a good question yes

uh does the process flow still hold true for zero days yeah we push it through into the same system as much as possible um so yeah zero days typically uh it's people aren't going to extend those for too long but yeah i mean it's possible they could extend them and and we want to make sure that uh it's not too egregious of an extension we can't fix everything in one day right sometimes it takes 10 days 20 days so those extensions are they show up in that that top loan like a list for that day or for that week or that month so um but yeah it all ideally gets pushed into the same system just so it makes it easier for us

yes uh yeah actually you have two two hands over here yeah either or uh

yeah so the question is how do you uh maybe another way of saying is like how do you prevent people from kind of sandbagging their budget or putting like a high budget in uh relative uh to their peers well that that competition uh roll up that i was talking about earlier when you were able to compete compare peers to other peers um is one way to kind of combat that right if one team is is really sandbagging the other team or the other part of the division isn't um you would want to question that so why why is this part of the business um have such a lower budget whereas your part of the business has a

higher budget and maybe maybe there's good answers that maybe it's like hey we don't have that many customers there's no pii there's no sense of data here and we're in product market fit so we just want to move fast okay that's the business to decide like as long as there's an appropriate level of sign off on that and it's not yeah we just let that we let that happen it's their decision right at the end of the day but we we do kind of push that uh compare yourself to your peers because you don't want to be that person that has a ridiculous budget while everyone else has a pretty low one it's like okay what's going on here

yes

um so the question is do we account differently for vulnerabilities that have a patch versus vulnerabilities that do not have a patch um no we don't right now um could be something interesting like that would be maybe an additional factor that we could add but like i mentioned earlier i'm a little hesitant to do additional factors in that metric until people start to adopt it i think one way to to deal with that is like for the risk justifications there's no patch here i think that's a lot easier to to say okay that makes sense like we we can't fix this right now so let's push it off um that would just go into that risk

justification that's how that would appear today that's a good question all right five minutes yes

yeah so we don't so the question is um actually repeat it one more time make sure i grock it oh how are you getting the balls to the right teams yeah um we it's an asset management problem at the end of the day so um one thing that that we built uh segment and it's actually becoming a thing over at twilio um we had this concept of like ownership so if we can attribute every repo to a team ideally at like at least the division level sometimes at the team level that's where we we hook off of there's a tool called backstage that we're looking at for twilio wide that can help with that type of asset

inventory and so yeah it's not there's just another fundamental component that you need to have in your security program which is asset management that's not going to be solved by this necessarily but like i said it is a lot easier to figure out uh hey we don't who owns this we don't know i mean eventually the cto is the one that owns it so you can always move up and they have to figure out that attribution ideally we don't want to get to that level but um you know when you're hooking your vulnerabilities into the org you can escalate up if you need to any other questions out of time or two minutes okay cool

anything else really appreciate the questions so yes appreciate the enthusiasm say that one more time how long does it take for bud um to be optimized so the the whole budgeting part i i think you might have let me go back here real quick it's actually future looking like we are doing it right now um twilio for this year um and so this is where we have the budgets my gut tells me that this is going to take probably a couple more quarters before we feel better about it um if what this is doing is starting the conversation like we can't get below 12. well why is that let's go into there and like okay we understand why this is

maybe harder for you let's say you don't have like an infrastructure that runs on something like kubernetes you have vms and it's very manual it's very painful okay maybe that's um one of the things for next year is let's let's move you to kubernetes let's move you to something that's a little bit more operations friendly and allow us to to fix these vulnerabilities uh more uh more quickly and and lower your debt so cool all right i think unless there's no other questions thank you

all right thanks eric for speaking at uh b-side san francisco and uh on behalf of the conference and our uh gift sponsor montego you get a undisclosed bag of goodies thanks

Embracing Risk Responsibly: Moving beyond inflexible SLAs and exception hell

Related talks