← All talks

Advancing DevSecOps Metrics

BSides SATX · 202241:4358 viewsPublished 2023-03Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
2022-06-18, 11:00–11:45, Track 2 (Moody Rm 101) Everyone wants better metrics to make better decisions. Do you know how to advance your metrics to create success? Integrate a clear path to improved performance. Every company wants to accelerate, deliver more, faster, and increase customer value and return profit to stakeholders. Every DevOps business understands core metrics but advancing values requires resolving known unknowns behind the initial numbers. Accomplishing initial metrics leads to some understanding but allows deriving additional details. Subsequent metrics should be small, focused on acceleration and provide clarity to feedback. Provided examples demonstrate how time to change, deployment frequency, restoration time, and change failure rates can be deconstructed to provide advanced solutions. Software tools exist to easily manage telemetry everywhere and increase overall value through advanced metrics. Key Takeaway: Understanding why metrics matter and how they apply to your company Mark Peters I work for BrainGu as a Product Manager providing Senior Engineering Technical Assistance on a System Coordination Team on a US Air Force cyber weapon system program in San Antonio, TX. Since retiring from US Air Force intelligence career, I have worked on 4 different major programs associated with DevOps. As a cybersecurity expert, I hold several certifications including a CISSP. During a practical doctorate in strategic studies, I authored, "Cashing in on Cyberpower" to analyze 10 years of cyber-attacks from an economic perspective. I have a BS in English, an MS in Management, a DSS in Security and am a doctoral Candidate at Capella in Information Technology/Cybersecurity and Information Assurance
Show transcript [en]

I thought I had tested it I thought I had my metric correct but apparently I did uh Dr strategic Security in my spare time I do Judo I write I speak uh sometimes in that order and I have two dogs which are Great Danes which take up a lot of the time especially with the heat and they like swimming in the pool so we're gonna talk about the basics we're going to talk about Advanced metrics and accelerating that advancement for you I say one completely 100 not not safe there's no virus it's your malware if you're getting this on distributed version however just knowing the metrics [Music] so focused uh this is how you can get

you there also if you've made a devops transition I said 12 months it's probably not for you and the reason we caveat that is because it takes some time to get into those Baseline metrics it takes some time to understand what you're doing and after think about it and make the advanced changes starting with those Basics is going to get you that that first take and once you've got those bases you start knowing where they don't do the job for you where they don't get you where you need to be in order to get further you really need to think through that process and think through that understanding you have of what you're doing why you're doing

so metrics logs and traces are great and you will get them automatically from any number of installed programs apis however we're not going to fix your problems they're going to tell you where our problems is if you understand it enough of the metrics to get there so just looking at those numbers looking at those processes uh like we talked about some of the other talks getting your vulnerability checks knowing that there's 79 vulnerabilities is great and some people put that out there as a metric but it's that metric over time right looking at how that changes and how that either decreases or increases over time and then beyond that when you talk about Advanced with vulnerability

looking at what those categories are if they're all critical vulnerabilities you've got to fix it right now if they're lower vulnerabilities you can wait uh I've seen some DOD programs right when they were wrong with me they said hey we're going to release every three months which is not a great timeline uh but they said well our cap one vulnerabilities need to be fixed in 48 hours so I'll address all the capital with cat 2 vulnerabilities we have 30 days and with the cap three vulnerabilities they're down at the bottom we've got six months so you know what we're never going to fix that because we're going to release every three months and we're just going to

take this accepted risk of things off uh so is it good maybe not but at least it's a metric for them moving forward uh so advancing your devsecops practice is fine-tunes your ability uh to create those Solutions but it's not the metrics it takes the problem it's the humans that fix the problems because you have to have something to talk about and you have to have a ready number and a steady number to be able to talk and quantify what the problem is when you go into management or when you go in and take that uh that diagram cat mentioned with you know paragraphs and paragraphs and paragraphs and you need to be able to

simplify that to an easy answer so starting some of the basics uh why do you use metrics right right why is it important we'll use a little bit of 5y structure trying to ground and talk around some of the discussions well our metrics give us a quantifiable and repeatable source of observability we don't just have the observability but we can move forward with that observability to measuring multiple processes and create things to drive actions because ultimately that's what you want from the method so you want to drive an action you want to create some type of number or some type of analysis that helps you get to the next step that helps you get to the idea you have to

move forward so if I say why and I do my next lie I said well why do I need a quantifiable number why is it important for me to gather the situation to Spanish so I can move forward well people make a lot of decisions they sentence there's somebody say well I think we're doing good I think we're doing bad I feel like we're delivering often asked but they don't actually have a qualification so they've done some studies and a lot of these other directions uh so there's a book out there called Noise by a convinciboney and sunstein they did a lot of analysis of things that affect your analysis they found that 77 were randomly generated linear

model something that makes decisions based on a random generally linear are better than the human making decisions against the same data so any model is better than somebody coming in making decision out of it right so that says well if I have it quantifiable at least I can have some type of linear model applied which is going to help me make a better decision so why don't people understand the numbers well people tend to overestimate their own ability to make good judgments based on that right most people think they can make good predictions if you ask them they'll say oh yeah that's absolutely the way we're going so while you're confident in that yes I'm not I'm

confident in my abilities I'm confident that the next threat is coming from here we're going to deliver at this rate the product is going to be done at this time but they don't get the right answer in this book called The Wisdom of crowds by Sir Ricky found that large groups getting better pants but only when they're selected to that random group right when they're assessing the weight of a cow in a state fair and you're looking at the cow and see how much cow people put in their guesses they put in wildly different guesses based on their own experience what they think ways well the average guess turns out to be about 10 percent of the right number right it

gets very very close when you take that average so that average is good but the people have still made the wrong decisions on that individual basis so we need to get something that takes in that Aggregate and builds it back out for us to move forward because we don't get the right answers and actually as a side note to this one when people get the opportunity to talk to each other and they have to put in one guest together they're still wildly wrong so it's not the group information it's the aggregate of that group information because numbers so how do numbers improve my options well numbers provide that repeatable comparison to linear judgments they give

you that measure over time that you can show and you're improving or you're uh failing to hit the target you want by understanding the metric and understanding where to go forward from there a great book for this one is measure What Matters by John Norm he talks about key performance indicators how to set those uh more importantly he says for reference to metrics that you don't always need to succeed you want to have some things that you're not going to succeed at so you can Target those for improvement right you can Target where you need to make those changes and where you need to take those actions to drive forward and then finally why do I need to guide

my access well if we go back to Agile and devops uh we was actually this one accelerated delivery and if we don't know what's going on we can't accelerate we just continue making the same process in the same thing over and over again and get to those same answers and that's not what we want we want to have that continuous Improvement right that uh flow and the feedback in the continuous learning so how do you know what you know if you're looking at metrics I've decided that I need some veterans uh this is a model I use a lot intelligence I find it still applies here it's a basic model critical thing right thinking through a problem

the first question is how do you know somebody's given you a number and they said hey this is what we're doing you say okay well how do you know and then in an actual discussion they said well we took all the numbers off the system and we got the user rates and we got the percentages the downloads and then we added that in and then we wanted it to be in this format from one to five so being multiplied it by this random number that would be concerned at the algorithms school okay what is the random number actually doing so well it's converted into the range you want wow that's kind of good but you should know that in the

map what kind of manipulations you're making and how that affects the correlation of those numbers as you go forward so so after you know how to know I know I'm getting some Persona I know I'm getting rid of it for me Prometheus I know I'm getting into my head is that they're bringing me the right information uh what should happen next if I get that number should I see him go up should I see it go down should I say continue on that straight level what do I expect to have happen once I get that initial number and then once you have that and you're kind of sure to move forward the next is well what also has to happen to make

that true in the case of my Edge scanners I have to make sure they're working correctly I have to make sure they're connected back to my central service right and they're communicating the information correctly I have to make sure that I'm getting in the time of manner so that I don't have one ad scan to support it every 24 hours on this report again for 15 minutes and then I'm getting a wildly different variation of that information so do that in following that through the conclusions I have and the assumptions I've made about what should happen follow what I said I had for my premises do they follow that path and then I look at it again from my own

perspective taking a look and say are there any other arguments I need for those preferences to be true is there anything else that has to happen that I haven't thought about that maybe sitex the reset on a regular basis so I have to account for their reset how their reset is going to change my numbers because they're going to drop off line so now I'm only getting information from site Y and Z but uh Y and Z don't get too much business as I accidentally side x is your primary customer so I have to know how those go forward and then finally I want to compare apples to apples and not oranges to elephants [Music]

they both have a name so that makes them relatively comparable right well now can we look a little bit but they're not quite the same thing at all okay so we said well but they are because you know what they're both juicy on the inside like all right we're stretching it but they're not comparable and you see this when people try to take a number they try to compare experience you see it when kanban teams try to repair it they're out to put the strong teams and they're not measuring the same rates they're not measuring the same details uh one of the stories I had back from intelligence that really illustrates this example is they were measuring

power rates in uh in Baghdad During the Reconstruction when they were doing operational active treatment they were trying to build Baghdad back up and they wanted to make sure that they had power for all the citizens well they were measuring it by the average megawatts supplied course of the day and they said you know what we're supplying 15 megawatts of power on average every hour every day that's great we can keep everything at the time I said well that's great but then why are people so unhappy so they were taking a high production in hours averaging against low production but where the low production was happening or they weren't getting enough power or they were getting brownouts was

between about 10 o'clock in the morning and four o'clock in the afternoon when it was really hot outside and all the air conditioning was shutting off which meant instead of staying in the house has been packing people were going outside talking to each other and talking about how frustrating they were which was not what they wanted at all so you can see where the metric wasn't comparing the right Apples to Apples because the amount of power they had didn't actually apply uh and finally as a note for this when you're looking at dashboards do not equal awareness they create an observability but your dashboard should be a start for your discussion it should be a track of those metrics track those

items that you can then look at and drive further so if you're looking at this you need to start where you are you need to recognize where your organization your team your company is and figure out how the metrics apply you what you need to do to move forward uh and what your process do the initial surveys do the assessment talk to people find out both what you're getting what those premises are by actually talking to people or inputting those numbers or collecting those numbers the data attribute for the process you want to make sure that the metrics are repeatable and contextual data that you context that is appropriate for maybe some sense in sessions not

just initial survey but getting a group of people together and saying you know what these are some of the things you found on the survey this is where we need to talk about and a lot of that goes to building that trust algorithm assessment right people have to trust because the humans that are going to fix those problems and that comes from the credibility reliability of this video that's probably not a whole other talk about itself but nobody want to go with your metrics create that shared Vision objects that says I want to deliver I don't want to deliver and maybe use a value stream uh techniques if you're using them to align it to your pipeline

to figure out how all those things act together I'm probably talking way too fast but if we go back to devops and we think about a basic devops model and why we're looking at devops we want to create a flow right we want to create structure that goes for uh for knowing where our understanding is knowing where our constraints are and making sure that what we have moves forward over time and the reason we want a smooth flow is because it allows quick feed it allows us to get the information we want in the term of the metric it allows that contextual linear data that provides us the right answers to get to the next step it allows our individuals

to fix it in the series to figure out where our lowest big amperage or flow stops it allows to maximize our value by having those quick answers but then that drives the continual experimentation you learn which is a lot more metrics a new metric is continual experimentation it's putting in saying the metric I have doesn't answer the questions I want it answers a question but I need a different question I need a different sense of approaching it in order to get the right answer or to get an answer to eat something for me uh and we'll talk about this a little in the later slides where we talk about you know that I haven't changed failure but

I don't really know what that means I just got a rough number of rough assessments I've got a time to deploy but I've modeled it and you know there's a great quote out there that says once you start gaining the metrics once you start figuring that if I get a good metric I get a promotion those metrics no longer matter because what you have is instead of getting a true metric a true analysis of what's out there you started gaming and doing things to make you have better metrics without actually having a better system so once we have the three ways have four packs right four things that we never normally look at to create

metrics the first is logs Logs with the track of everything that goes forward in the system they are based both on time and a complete description so they include all those calls all those apis all those references back all that registration all goes into that big huge thing right and that could be good and we can build metrics off of that but it's not always so sometimes we just use traces which are the specialized type of volume which track is the input and output for a specific program one API or one set of calls One servers or one server tracking everything that's going on there so if you have the right answers the metrics have become a standard

measure degree to which is system their property uh or process possesses some problem some comparison or analysis maybe it's how many users use over time maybe it's the power shooting the city uh by hour or by other cents again lead time to deploy how often how long does it take to deployed or frequency am I deploying it and along the way we also have measurements which some people forget measurements are just looking at something and saying this is what it is right this is how often I'm deploying a good thing this is how often I'm doing this but they don't necessarily have that same sense of metric that they're going to drive it's just a number and we see a lot of

people who want to pass off measurements as metrics they want to say I deployed so many things and this is a lot we've been able to we had a team that looked at it and we said hey we're going to run out of you know a scrum basis we're gonna have these evidence which are these huge stories and we think it happens should take you six to 12 weeks one of the team owners decided that they're not doing this could be installed on any version every time they install the new version they were going to pass off the metric so they looked across the teams and they're like you know what your team is doing 70 ethics

every week and this team is doing one Epic every three months so what's the difference why is this team working so much better and that's when you look at it you find out hey you know what in context all they were actually doing was installed and each install took them about 15 to 30 minutes uh so all of a sudden they they changed the reference for the metric they changed how they were looking at it and they changed how they were going forward so the four devops metrics and we'll center for the rest we'll kind of concentrate on these look at how basic metric adorable accelerated by the coal forger tells you should have right tells

everybody what you have so what we see is because she told that recommendation happens every organization starts up and says I'm going to be agile devops starts with these four rifles but they're not really sure what they mean and they're not really sure how to apply them other than the fact that they haven't if they happen they must be good right because the book said it's good so lead time to change the time takes bug fix to feature any other change to go from the development production so when somebody brings me a new iPhone when a user comes back and says I have a wallet how long does it take for that idea what's in the background all the

way in production for this thing is actually fixed uh now one of the things I saw this really recently with the team that you have on this one is they say okay we're going to measure this but we're only going to measure this with things that we've completed right any item that's done we have a lead time to change and they were feeling really good about themselves uh and then we went back to the background and looked at the issues we had and they weren't counting anything that was started announced so things that had been put in as an idea that they agreed they were going to work that they prioritized and broke down but they

weren't finished and that's a big change to that that's right that's a big change to how you look at that how you lived in advance so that top metric is good but it's not necessarily the best uh the second one to have it more is the deployment frequency the measure of how frequently your team employees go another generator right if we're deploying we must be in good shape we're deploying new stuff new items we're doing great what you see with this one is people tend to gamify uh especially in large organizations they look at it they say well every time I'm deployed to production I'm getting a point because I have appointments and that's intended to keep the

deployment small which it is because smaller things get done faster and we work on them faster but they take that even further they say well every time I write a line of code I'm going to deploy I'm going to make those changes because it's not fixing anything you have to comment it out but it's going to deploy and uh now I've got this huge number of deployments for things that aren't really deliver functional software and then we get to the opposite of it the meantime to recovery how long does it take a device to recover on average uh from those failures and the change failure rates the number of deployments in which something goes wrong on the

number of deployments at any given time right how often am I deploying bad things same thing you gamify if I'm only deploying a line of code and I comment it out it's probably not going to break the overall program uh so I don't have to worry about that as much so so quick samples when you look at it right if you start at devops uh and I like to look at security it's just in the middle okay you have all those things out there so you have those four basic metrics lead time to change deployment frequency changeability rate and meantime those are great but there's probably some other answers some other metrics we want to look at in those things that get

us there first of all we don't have anything that solely measures we don't have anything that security Champion for that security team to look at and say where is my security in these devops metrics are you calling it a change failure rate because we've released the vulnerability are you calling on any time to restore how long it takes to take so to add that New Passion well that doesn't really sit there so if we break out we use that same metric you know how do I know what do I know how do I get there these are some of the things I kind of whipped off and said you know maybe we want to talk about pipeline success

maybe we want to talk about how often those pipelines if we've got a good pipeline we should have a high number of failures if you're finding things and we're fixing them on the left side of the equation right we're fixing things to identifying those things to fix early in the process anybody wants to talk about languages maybe we've got a bunch of teams so they're all using different languages to get their systems through on the security side maybe we don't just want to talk about the failure they didn't want to talk about how many misses released I mentioned earlier about those uh those code ingredients are those cats are English maybe we want to talk about how many of those were

actually released as we run the process forward maybe there's a scandal and some of the a lot of times we talk about test coverage or test coverage as far as how many things we're actually tested what percentage of the code is actually getting tested before we feel safe to release and we get that in the Bible maybe there's a scan coverage as well how much are we actually scanning for those numbers do we have a set percentage that we feel a safe distance move forward do we remediate those smarter abilities in a sense of Time how often do we fix them how long does it take us to fix it if we can fix cap one vulnerability in

three days but it takes us six months they said just download it into the system or it takes a separate policy statement because we're going to have to get signed and prove that we need to wait on somebody else maybe we want to know how many of those vulnerabilities how many of those tools are we using for stands how many scans do we need for release how often does it take we had a team that was working and they've done their desktop stuff and they built the scan in the pipeline and they were using the NASA State cast system to do their scans and it was great everything was going they knew what the risk work but

then they decided that it was taking too long because every time they ran it took four hours regardless of how much it was scanning so they turned it off during the week it only turned out well great idea right at the same time but they're working in coding more religions so everything had been deployed into production before and they said to scans we're running on the weekends and what scanning everything new so even though they shifted it left to start they shifted it back right and remove the process and now they're changing that time so maybe we want to know how many states on the op side uh maybe we want to know how long operations right

and how much time do you lose how many analysts is taking to find a failures I'm talking about changes how many people do I have to take off what their normal job is they put into that to look at that family to the process how many hours does it take you to do that how much time do I have available how much time does it take me to respond once it goes into the system when do I know that I start that mean time to restore when do I know that it becomes a failure is it in my internal sense or is it when somebody called me up and said you know what I can't get into my

account I can't get to my checking I need to pay bills what's going on now that's one way to look at the other way is we can look at calls right which is our other devops model the culture automation mean measurement sharing and we can break it down the same way and we can look at some of those things but that drives us to some different types of metric ideas as we drive forward maybe in culture we want to talk about happens maybe we're going to talk about the number of events in Tech Global reach retrospective couples how much are we talking about in Russia to actually go back and talk about the things of

trauma failures and one of the other models one of the other metrics the dashboards creating that awareness to drive that discussion maybe we want to talk about how many automation tools we're using a lot of our teams have a tendency to bring in new automations new tools all the time we saw the new shiny object and we break that in as well and then we bring the next one and then we bring the next one and now we've got multiple tools all doing the same thing and we don't know what that overlap works we don't know what that interaction between the things again we get things like test coveration test duration uh path coverage how many

was passed forward you have multiple VMS that we're using the world we kind of some of those with the deck maybe some of them are larger we talk about the lean right quality establature review blockers and these get on board that the team sets looking at metrics right uh transparency how much can we see uh when we're talking to each other do we have a lot of chat channels do we have a common bug here everybody

for bugs fixed we have one for bugs released we have one for user feedback right and we can't correlate the information between all those things to get to the sense we want so that gives us a great number of ideas and a great number of things that we we talk about or we think about but how do we do it individually how do we break back to those things and we think about what are those steps that you want to think about in your advancing your metric and you're starting with it so we're going to start with the four door record so we'll talk about some of those questions that we have uh as we drive forward maybe the

first one we talk about is the only time we take to talk about how long does it take to complete an action what kind of measurements what kind of systems are they want to talk about when I talk about story points I talk about hours I talk about days uh if you use the elastic tools the last thing I'd like to talk about Azle that's your users are actually putting in their actual hours so instead of having that it took me three hours to complete the section I haven't completely one day whether it takes me three hours or eight hours does that affect my lead time to change if I can only measure in days maybe I can hit

1.37 days of hours if there is something else but I don't necessarily get it all the time right maybe it's been work in progress how much work and progress do I have

or the other team has 16 but I'm still changing those times I'm changing those factors to identify the constraints Advance what slows down what factors affect my lead time when I'm thinking about is it that I have to wait for the end of Sprint demo before I can release that I need to get that approval as I step forward or I can bring it back and do something else I say who built the task when we get it is a customer building the task or putting it back on that backlog that I'm handing it off to a developer it doesn't have interaction can I do a variant tasks and very complexity I have no way

to measure between those complexities but we see Historic Hotel the matter with this one we see managing the team structuring to get to assign versus complete knowing what the various parameters are and then comparing the different features am I focusing on the priority that I have the priority for scaling as we go forward how much work do I actually assign to that team we see teams too often in the sense of being overconfident uh in being making the wrong judgment regularly take on more work than they can actually handle in the time they're supposed to handle it right we regulate the teams and they say well if the example is I've done 38 story points last time with 38

story points of time before this time I'm going to take on 60 story points because I know we can accelerate Android and then you go to that magic over time and say how much did you play so well you've never done more than 41 and that's over the past six months so what makes you think you can do this an important caveat for this one to talk about metrics and why I said this is 12 months if you don't have that Baseline if you don't have that structure you don't know how to build a forward you don't have enough of a baseline to drive and every time you make changes to that team every time you start swapping

people out there integral to that team your metrics are going to change and they're going to change a lot to the point where you need a new Baseline to go forward teaching this as a replaceable part right they seem more like Legos well if he was a a yellow Lego eight you know little notches on the top and he's a blue Lego it doesn't matter because I'm swapping an eight beats for eight beats right but they're not they make a change in that overall structure and they make a change in the metrics and then of course we've got to talk about security approach right how does the change affect the vulnerabilities that we have how does what you put in

lead us to the next change did I comment out the MFA because it was causing problems and I didn't have it I know that said that single sign-on wasn't coming till later so I can just comment that out until I need it but I'm still submitting the change uh but it's going to take me a long time to get that SSO working as I integrated multiple tools are there impacts in code delivery do I have the time and the right teams to sign for security do I have the champions in place who are actually looking at it or do I have a bunch of developers that are just taking an initial look at it not that they

don't know security but they may be taking a different look at than doing a more integrated threat Model A more integrated process uh we talked about advancing the deployment frequency how long from delivery to production do you know the constraints you have in the delivery production does production mean it's actually delivered to the customer this is one of the ones we see the design measure the team says it's done they hand it off product notification says it's done they do the demo and then it doesn't actually make it to the customer even though it's in the production because they're still supplying to a remote we're supplying it somewhere else where someone has to go and download

that make use of that product as they go forward how do you know that it's been a production are you getting the feedback do you have the metrics and the network analysis that people are actually downloading and using them uh we see a lot on teams but they haven't they went out there but nobody's using the new version they're also going to use the old version they haven't done the patch they haven't done the update how do you make sure that you have the awareness of that that deployment is actually making sense uh when you do the testing tool what makes success in your test is it just a check uh this is another good story I had the

team we briefed him on using devsec Ops they were doing their initial transition they called me in the next day they said hey we've got this devsecops we've got to answer and we put the scanner in every time we finish the code and we're ready for deployment it runs this game I said great let's do a little hard to talk about vulnerabilities let's talk about what's actually wrong with those deployments and how you're fixing and the individual with me so what do you mean so what you stand it there's a log and it talks about your probabilities it kind of said oh I said well let's pull it up right show me the camera so we looked at the pipeline turn off

the scandals in there right to the left every time I ran the pipeline ran the scan and things were looking good and variable for the scan except every time you scan ran it countable good uh so we did help him in the end we got back we got it set for it actually produces the log but once we got the log because there's a lot of information about it right it shows everything uh we ran it difficult so they compared our scan against our Baseline and identified anything's different so any known risk that they had or anything that was changing as they did the employment was coming back and then they could fix those things instead of trying to fix

everything and get into all those cat threes that we already knew we're going to be no risks or accepted risks that they didn't have to spend their time uh so we make assumptions about change when we're doing deployment frequency we should identify those assumptions with change we should identify the assumptions we have we should write those down um these are things we want to constantly be talking about right this is where the dashboard helps to start those discussions because we have that measurement over time that shows us all the processes are and how they go through so time to restore do you have a process to recall that figures in your time to restore you have to redeploy everything

in the production and are the only people that can do that devs are you running out of blue and green system you have two different systems and their graduate Futures you have a canary employment that's telling them and then you can use that again you can shift back to another one do you have the good site reliability engineering tools right have you done a site level reading that if you're depending on something else if you're going to that edge now you've got to get a field engineer say x goes out there and resets them on manual do you have the agreement of them what their time how often they're going to take without fixing and what are your indicators of those

which become a whole other set of metrics right for the service level integration or identity indicators when you start talking about these meantime restore you want to talk about the root cause because if you don't have a root cause you're just going to employ ment and that is the numbers just to continue and nobody likes to see that happening right you want to sit there you get your system back up online as soon as possible and you get them working as soon as possible or if you can't get the new version you can roll back to the old version and it's still working for that customer they're seeing the same things how long does it take to discover how

long did it take you before you started your store how long is that failure in production and what did it actually cost how long does it take you to fix it once you identify what will cost the security view the root cause for that threat model are they looking at the attributions for that are they actually looking at who's out there now an interesting note for this a lot on the commercial side we don't really care who did it to us right attribution doesn't mean anything unless I want to take attention I don't need to know necessarily all I need to know is that they did what happened and then I could fix it so we can't do it

again how does that convert to implementation lead time how does this affect my lead time to time does it change the rate at which I'm getting new answers in if I was expecting to have a new version of production and I'm expecting to start getting bugs back right so I blocked some time with my develop valuables now none of those new bugs because

what are some of the factors that affect the time items and slow recovery time restore management approval how many of you have been an assistant where in order to make a new Improvement in order to fix it you got to go back and it's two o'clock in the morning so you've got to get the manager up you've got to get aware of the decision you've got to tell them what that decision is that he's got to call them and improve the new change because she can't just release another girl is there an SRE you need to come we can fix these we could have a psychological leadership that has the authority it has the accountability to make those changes

we can automate some of those fixes like we talked about hey as soon as I have it I go back to the old version restore up for the customers I'm good now I can work on fixing those new changes that I've made to move forward a help desk that can tier those requests and find out who can actually solve that problem is it a quick fix to the help desk console that's in the documentation the user didn't see it is it a slightly longer fix it's going to require a little bit of configuration a little bit of code the engineer or is it a whole new bug that we need to send back the development change to get through

and how long does it take to get through those notification sheets right so advancing change failure rate we don't want to deploy failures find out why it still starts with the rooftops two great books which one you want uh assignments Simon cynics start with Y it's a process of the leadership at devops book but it's great in looking for some analysis right and give me the thing about how do you know what you know what are those factors and as I mentioned before John door measurable matters talk about Cape dots what are the things that make us successful what are the things that drive a school successful and drive us to those abilities uh we didn't compare those wives when

you think about are we getting the same root cause over and over again we have that change failure maybe it's time for company-wise a company-wide training on that new version of kubernetes that was supposed to be used but have it always changing maybe if we understood what those processes were if we understood that common why across everything would have less families because we fixed it earlier in the process we build the knowledge we build the identification and partial success is what made it a failure at what point do we declare that it's failed is it when it brings the whole system down is when it decreases Network performance by 10 or network performance by 90 where

have we set the standard of where we decide that this change would deployed that's actually pretty easy okay it's made through the pipelines we've had the tests we've had everything that's supposed to check but now it's still failed why is happening how do we measure when that failure happens to compare across uh so and we identify the premises it was just what I was just doing right around the failure what are those standards around affiliate moves forward uh what are some of the root causes for early identification you want to fail early we want to fail the pipeline not when it's actually deployed to production uh we want to gather our group together and learn quick uh

failure is an opportunity to learn the more we fail the more we tell early the more we should do together is finding answers to accelerate our processes and accelerate our equipment it creates the opportunity for us to succeed in the future because we've had this small forward and that's what we want is the small failures to move forward so where do you go from here uh well first non-decision creates America or decisions created right anytime you make a decision you're creating an error set alarm process because even if you don't think there's an air there's going to be an error film there's going to be something that your decision makes that creates a problem and even as far back as hey I've got

higher he's great he's got all the skills okay at some point he's going to make this because and that's going to create that error but not decisions not making that decision is kind of like disaster you need to have the information at hand you need to have the metrics at hand so you can look at what's happening and you can make an effective decision a useful decision even if it creates an error photo because the errors create our opportunity for success to create an opportunity to learn and an opportunity to move forward we say both of these go back to the second law firmative way too many engineering costs along the way the second long-term Advance is any static

system attribute is going to increase if you have a closed system if you're not making any changes to it it's going to eventually fail over time nothing runs for even code right there's going to be friction development particularly running the process so you should be friction involved by doing things eventually there's going to be some entries chaos you're going to have to make a decision to fix it because if you don't we're back to disaster things right identification

and with that that gets us back to our metrics you have to know the basement you can't know what that base rate is what that Baseline metric is unless you half of it unless you have the metric geared to the way that you want that makes the most sense for your team it creates value and creates success one of the key ways we learn about making new metrics is the growth stops when our growth stops we want to model best tools we want to find why people stop if we know why the Girl Scout is there a metric we can use that'll show us that the growth is slow before we actually get to that stopping point

before we get to that failure how do I identify how do I identify the various things around that metric that I need to know maybe I need new sensors maybe I just need a new metric maybe I need a new process along the way that does something different it shows me a different analysis we see those four ways to our metrics and we see the teams that go with that they say well we have lead times and I'm going to focus on those things well it only makes sense right if you look at those metrics and they're vanity metrics the only reason you're collecting metrics is because they make you look good they may not be helping

your case or they may not be helping you improve even though they help their case management the expansion but the time is it's going to get to something that matters and it's going to get to something you have to do differently and you're not going to have enough collaboration you're not going to have enough basis uh I had a recent discussion and this goes back to uh people making assessments people are overconfident and I was talking to the individual if we were talking about how many items they had blocked over the previous event how many items they had blocked and what was blocked said well we've been blocked this often and this is what happened

okay she said it's this organization they are blocking us they are stopping us from doing what we need to do I need you to go and prove that in the numbers I said I am happy to go and look at the numbers I am happy to go in and do all the research do the analysis look at your tickets look at the notes look at the code and figure out why you're blocked I came back to her I said hey you know the reason you were black is because your people were submitting the wrong information about your organization to get the document approved if they submitted the right information they would have approved it today

because they submitted the wrong information they took a week to review it they took another we need to look at and they sent it back and said you sent us the wrong information really need this and then they went back and forth again and she said well that's great but that's still not the information I want I want to know how many days actually that Team Rockets I said all right they're not your primary cause but for the sake I remember we'll go back and look at the bedroom so we went back in we looked at the metric we found out of 176 days their various code items were blocked so the various teams they were responsible

for there were 43 days that were blocked because that organization delayed didn't have the right information didn't give them an answer well unfortunately there are 176 things other companies such as they were all internally because the folks folks so you got to know how the numbers you got to know why they're important what these suctions because that's what enables you to find that basement find that standard rate you know how to change and you do that through using metrics and through understanding your metrics and getting to the next steps thank you very much appreciate your attention

let me go find this if you have any questions

all right thank you for your time