Ooga Booga - Avoiding Reinvention of the Wheel (Useful Security Tools and Lessons to Know)

Name: Ooga Booga - Avoiding Reinvention of the Wheel (Useful Security Tools and Lessons to Know)
Uploaded: 2022-07-06
Duration: 47 min 24 s
Description: Carla Sun - Ooga Booga - Avoiding Reinvention of the Wheel (Useful Security Tools and Lessons to Know) Security can be pretty overwhelming, but you don’t have to build anything from scratch! Under resourced security teams often reinvent the wheel when it comes to solving common security problems. J

BSidesSF · 202247:24211 viewsPublished 2022-07Watch on YouTube ↗

Speakers

Carla Sun

Tags

StyleTalk

About this talk

Carla Sun - Ooga Booga - Avoiding Reinvention of the Wheel (Useful Security Tools and Lessons to Know) Security can be pretty overwhelming, but you don’t have to build anything from scratch! Under resourced security teams often reinvent the wheel when it comes to solving common security problems. Join me as I introduce Marie Kondo-style techniques that should help manage the madness. Sched: https://bsidessf2022.sched.com/event/rjqN/ooga-booga-avoiding-reinvention-of-the-wheel-useful-security-tools-and-lessons-to-know

Show transcript [en]

all right everyone we are getting ready to start our next talk our next speaker is carla sung given sorry and she is going to talk about uga thank you thank you thank you so much

there we go um hello everyone my name is carla sun i am a security partner on the uh the product security team at gusto and today i'm going to talk to you about i'm going to talk to you about security incident response um today in particular what it takes to develop and scale out an effective security incident response program rather than be like cave people relying on blunt force trauma or attempting to reinvent the wheel from scratch we'll actually examine incident response and how it works in other non-software fields we can beg borrow steel their techniques from harder lessons learned and make our computer security incident response a little bit easier um so i want to be explicit uh this talk

comes with a fundamental content warning which is we will be taking a look at aviation and medicine and the harder lessons that they learn and their incidents tend to come with a cost in human lives we had a crazy three years absolutely no judgment if you don't want to hear about this stuff consent matters to me so why am i qualified to talk about incident response and how to improve it at present i am a security partner at gusto but i was previously a security incident response lead this experience greatly motivates my work as a security partner i joke to my peers that i use all my trauma from security incident response to try to

stop them from repeating at gusto i'm also a contributor to a book that was recently published reinventing cyber security the chapter is called confronting confronting inherently flawed systems hi hello i am an inherently flawed system i compare confronting my own undiagnosed adhd symptoms to high confronted inherently flawed systems in computing and incident response thanks to everybody who came to the book signing yesterday and uh we'll be giving away more books on tuesday at rsa so let's get started in dealing with incident response it feels like an endless and repetitive cycle something bad happens you deal with it in the moment the problem returns in another context maybe some details changed maybe in the exact same way

infosec twitter is rife with examples of people lamenting the state of affairs the high level goal of incident response is to in addition to mitigating a particular incident is to help prevent similar instance incidents from happening in the future [Music] to do that we need to build an effective program that holistically tackles all the incidents that occur over time and integrate all the experience and learnings from each in order to really make us effective at building incident response this is key so that we don't have to read the same sad stories over and over again and it's not an easy task i may be as i may as well be asking how do we boil the ocean

um as we know it's not possible which brings me to the first historic lesson that i believe we can learn in solving these problems which is that engineering has always been about the ability to problem solve and the benefit of computers has always been to scale that ability to solve problems the way that computing solves these problems by recognizing patterns of issues and resolving each issue in a way that can be answered in smaller questions however the scale of problems solved by consume computers also scales the security issues that we see so then too must we rely on pattern recognition and breaking down huge problems of huge problems into smaller yes or no questions so returning to the definition

of an incident that is an event or an occurrence we face thousands of little incidents every day so where the heck do we start when first putting together an incident response program um we are like babies who don't even know how to crawl we need to work learn how to walk and then run so any problem complex task or hurdle we need to learn how to crawl walk then run and security being the huge mess as it is is good to draw inspiration from the people who know what they're doing when faced with a huge mess that's marie kondo marie kondo famously enabled people to tackle large anxiety-inducing situations to organize their home like engineering

she has a script that allows her to break things down into smaller problems so marie kondo's method is put everything in a pile decide and label if everything's going to be keep donate or trash then she teaches them to organize her their items in such a way that can be easily cataloged and then she asks you to repeat clothing kitchen items books sentimental items if we match marie kondo's method to crawling walking and running we can similarly apply pattern solving a pattern when problem solving we must scope quantify and recognize patterns before we can scale problem resolution to meet the supply and scale of problems to solve this pattern in problem solving we can apply directly to marie's condo

method to security incidents by putting all the incidents in one place prioritizing risks and efforts and vulnerabilities recognizing patterns by organizing this information in bold management and risk registers before we can effectively leverage the wealth of data available to iterate and scale upon processes and responses to any given security incident now that i've defined the pile or scope of the problem i'm trying to demonstrate in building security incident response let's focus in on prioritizing vulnerabilities risks and efforts so the first act um let's learn how to crawl and to crawl we must figure out how to prioritize vulnerabilities risks and efforts um what's a vulnerability uh it's a flaw in a system that weakens overall security

of that system and in general a confirmed vulnerability usually has a fairly clear root cause and a mitigation plan which brings me to my first example in referencing aviation which is uh landing an aircraft so this is a diagram depicting the necessary components of an airport whenever you're landing any aircraft um how does air traffic control prioritize which lan plane to land first um usually in the best case scenarios there's a schedule in the rare case that something does happen a schedule has to be disrupted disrupted and thanks to experience atc generally knows what to do so when systems are all normal and you know there happens to be aircrafts at the same altitude the priority actually goes to the

aircraft with the least amount of maneuverability so hot air balloon doesn't have a lot of ways to maneuver has a right-of-way over gliders gliders over airships and airships over airplanes but what if one aircraft is vulnerable or what if multiple aircrafts are in distress at the same time within the vicinity of the same airport so in a landing emergency the scope of the problem is the plane's the number of expected planes in that airspace as well as the airports to quantify the situation step one uh you have to figure out what the situation is and get all that data so is the plane on fire is there a medical emergency is there equipment failure um

atc will always ask a pilot how many souls are on board and how much available fuel is left uh the number of planes that are nearby that we're planning on landing there or in close proximity those additional stats also need to be tabulated step two in organizing we need to make decisions so planes with enough resources that are really really close by maybe they can hold at different altitudes planes with enough resources that are far away enough can possibly divert other airports more importantly the goal here is to clear the runway so that the plane that's in distress can land as safely as possible the third step in iterating and scaling everybody writes a report

they gather data from black boxes which is our equivalent of logs and if something unfortunately happens usually the faa and the ndsb investigates so how do we apply this to vulnerabilities well when i was a baby application security engineer i had one pile of things to manage to manage and that was bug bounty reports um quantifying the severity of each report was one of the first things that i did with my bug bounty program and i used various tools like the cvss calculator the bug crowd vrt and additional data that i thought was interesting so um my journey with application security and bug bounties uh kind of lacked consistency and communication in the art of priority and so uh bug crowd actually

resolved this for me with the vrt that stands for vulnerability rating taxonomy it was first used as a baseline when it was introduced in 2016. uh the nice thing is that the vrt itself has evolved to cover even more context than it originally did and uh the early versions of the of the vrt best served me as an abstract engineer until i switched to security incident response um so the cvss calculator if you've never encountered it before it counts it's called common vulnerability scoring system it allows you to across various selections calculate like what is the priority that you should be having on a given vulnerability um what i loved about cvss is that it

covered nearly every situation related to computers which hey so does security incident response the base metric as you can see accounts for exploitability and directly incorporates impact to the cia triangle which you can definitely see confidentiality integrity and availability temporal metrics allow you to also determine what the score is in terms of was it a zero day is there no patch or is there a patch incoming the environmental metric group accounts for special circumstances related to your environment because we are all special snowflakes every cve has a cvss score and when everybody speaks the same language communication is faster less mistakes are made less misunderstandings the beauty of the cvss calculator in practice was that my colleagues and i

were never more than like a full number off when it came to comparing our cvss evaluations which meant we were all on board with how severe the situation was these are examples of metrics that i found really useful to track alongside what i saw in cvss and what i got from the vrt things like the priority how much we rewarded uh date reported resolved slas who owned the fix who owned the service that it was affecting and most of these stats were actually inspired by a talk i watched here at b-sides in 2018 by arkiti i probably butchered your name i'm so sorry but it inspired a lot of my work in bug body um

so uh once you've organized all these datas you know you are able to quickly sort like what are completed issues what are old old and overdue issues what's missing sla and what are the overall vulnerability categories that we're seeing and trending in our bug bounty metrics to iterate all this data it allowed me to prioritize which tools which are the most common vulnerabilities were tied to teams and also to prioritize which secure code trainings i wanted to focus on next um i eventually applied this methodology to the way that we were handling internally uh reported findings and then eventually we i used the same methodology to evaluate our security incident response vulnerabilities it was thanks to all this aggregate data

related to my bug bounty that i knew on a monthly quarterly and yearly basis what our most common vulnerabilities were and like i said i could tailor secure code training and content specifically to promote awareness mitigations and best practices based making a database decision um i also put out a yearly state of the union report of bug bounty and it allowed our security engineering teams to prioritize what their next project was as well additional data that i was able to get i could point out which teams had the most overdue issues and how many days overdue in presenting this information to my cto my cto could request that each team prioritize these issues um this helped my developers reset their

priorities especially when stuff was overdue and they didn't know where to fit it with all of their pre-existing priorities the most interesting trend is actually the chart on the right um and some time in like 2017 i had built an automatic reminder bot for overdue issues uh and in 2018 because i'm a terrible programmer the bot was broken i checked the reminders and the reminders actually stopped back in april um so clearly the number of overdue issues spiked until i figured out that the bot was broken uh in q4 i turned it back down on and it continued to trend downwards well into 2019 and i wouldn't have noticed this if i had not gathered the data and tracked

the behavior i go into more details about these findings and in my accidental experiment at this talking peninsula the link will be in my slides and the slides will be on my twitter account no later than tomorrow but the thing to know is that prioritizing when and how to resolve vulnerabilities ultimately reduces risks when i approach teams with known vulnerabilities usually the conversation is to illustrate that impact so that they can compare it with their pre-existing priorities and when we when we actually agree upon that something can actually get prioritized and done so what about everything in security that doesn't have a clear success criteria or really clear root cause how do we prioritize a risk

so a risk is a potential loss or a chance or a situation that leads to said losses and um looking to other industries and how they handle risks we can actually look at medicine the unfortunate risk in medicine is usually death when you walk into an er how do they prioritize the patients when it's not clear what the ruse cause what the root cause is and therefore what the potential treatment might actually be and let's take for example me in april 2021 i was in a super bad way but i was actually really worried that they were going to take a bed from a severe covet patient when everything was said and done i actually researched how hospitals

actually triage patients so this is an example of an er triage priority chart um the triage nurse will check things like are you choking or is your throat closing up you know from an allergy what's your blood pressure what's your oxygen level heart rate coherence they take your blood and a urine test and i was solidly doing solidly between a three and four on this particular chart um everyone that landed in red orange yellow ahead of me yes please process them first like all things considered are probably would have died after months of neglect rather than seconds so um in the context of prioritizing risk in the er you have everybody that's in the er

waiting room potentially whatever they're calling ahead for that's coming in ambulances um the stats that i mentioned on that chart earlier and then treatment and monitoring which turned out to be iv antibiotics and then monitoring watching my heart rate come back down as i was feeling better ideally we see these uh situations as either a confirmation of a checklist that exists or an unfortunate post-mortem in the worst case so how can we apply this to security risks well okay murray condo right we have to compile all the risks threat models audits and review findings into one pile and we're gonna have to crawl therefore quantify we enabled we need to be able to calculate the possibility of a risk

happening and what the impact is and you can't really calculate a possibility at scale there are thousands of factors to be aware of and computing the compound possibilities of millions of potential outcomes is freaking impossible so um there are ways to tackle this um and the easiest way is to the easiest way to prioritize a risk is to quantify it so this is an impact versus likelihood matrix and it does exactly that you take a risk and you consider how likely it is to happen and you evaluate what the damage is if it is exploited or not mitigated i think this is at least useful in mapping your priorities and getting you past decision paralysis because when you

have a lot of stuff it can be really hard to start a few downsides about this method is that it doesn't account for effort and if different people are evaluating different risks every every person's version of unlikely or extremely common or rare are just slightly different and it turns out mathematically the errors lead to worse than random outcomes so it's not my preference to solely rely on impact versus effort in a first case basis um it's an okay stop-back gap it's not the worst thing so this book how to measure anything in cyber security risk specifically says there's no need for cyber security to reinvent well-established quantitative methods used in many equally complex problems and so my proposal is that the cvss

score already accounts for likelihood by quantifying exploitability as well as impact so that's exploitability and likelihood that's impact first.org actually documents the history and evolution of the calculator to demonstrate its improvement and its improvements in precision and with that understanding it actually improves my confidence and using this as a consistent means to compare risks to one another so after evaluating what the priority is on a particular risk um uh you know we compile all these risks into a register and ideally we are tracking the state of all collective risks as we make improvements or whether or not we grow in scope from there we are able to prioritize efforts and possibly mitigations sorry mitigations based off of what we are

able to see trending in our risk registers i will reiterate that i believe that the risk cannot uh if you that if you really believe that a risk cannot be captured by the cvss calculator because maybe there's something exist that i don't know then you should utilize the impact versus likelihood matrix to mitigate hesitation because doing something is better than doing nothing to reduce inconsistency in measuring the priority of a risk across people there's actually a technique that i learned from eighth grade science class which is um does this look familiar to anyone raise sign of hands yeah okay so it's a graduated cylinder and this is used to measure the volume of liquids and the unfortunate thing is that

everyone in your lab group measures slightly differently and so uh the technique here is to actually have the same person measure um every time if somebody's off by 10 then they're at least they're consistently off by 10 compared to having five different people with five different variances measuring like making five different deltas right so the benefit of data is to be able to observe changes between states or conditions and using the same person reduces the overall evaluation variance compared to different people in precisely making the same measurement i theorize that risk registers with one administrator reduces qualitative variance compared to every contributor of registers giving their own version of extremely likely or rare so in conclusion the er clearly

knows that stats like oxygen coherence bp and heart rate were the best stats to triage er patients and the same must happen if we are to go in any direction to start mitigating risks so i've identified a ton of problems you've i'm sure all identified a ton of problems how are we going to start fixing it prioritizing risks and vulnerabilities don't account for effort that it would take to remediate or mitigate them so how do we prioritize it um there are two types of efforts that i think exist out there one is one with absolute order uh you know you're solving an ongoing emergency and you already have the luxury and the experience of knowing what order things

should go these are closer analogs to playbooks but playbooks require maturity then there's there are efforts without a clear order like a pile of vulnerabilities and risks maybe they're all the same priority and i theorized actually that the covenant 19 vaccination authorization was analogous to how we can prioritize effort we had 300 million people in the in the u.s and by the end of 2020 vaccine supply was still limited due to manufacturing constraints vaccine authorization is actually sorted by age and health status and there was also the status of the vaccines being available under emergency authorization use so with these node figures what's the best order to prioritize distribution so this chart is a very unscientific

assumption of how much effort it would take to remediate covid while unvaccinated we knew that the risks to elderly disabled and immunocompromised people generally had worse outcomes this is compared to the cdc vaccination schedule so the only thing to know here in this column is the number of vaccines that we're eligible for um only grows as we hit certain age milestones so like 3 5 17 whatever until you're past 65. um so and then of course at the very top of the list of the highest number of vaccinations available is usually for immunocompromised individuals um i noticed that the department of health and human services authorized the vaccine in the order of how many vaccines the age group was eligible for

versus the likely impact of complications from coven 19. the age groups were based off of how we studied vaccines in the past and which ages they're considered necessary so the populations were authorized but vaccine supply was still limited so there was another consideration i know it feels like so long ago but this was a resulting order of priority in cova-19 vaccine distribution um yeah immunocompromised individuals were able to get theirs even young even if they were younger than 65 well before the elderly or non-disabled people so the cdc continued to distribute and monitor the effectiveness of the coven 19 vaccine and they're able to conclude that all approved vaccinations would require boosters currently the fourth shot is recommended

for elderly and severely immuno compromised individuals yeah so they figured that out right so how do we sort security efforts that may not have an obvious order of operations first we need to pile all the ideas together so resolutions mitigations and whatever your resource constraints are and let's say that this is a list of projects your team wishes to tackle how do you choose one or over the other and which efforts are actually worth doing i believe that this question can actually be resolved by a product design exercise that sort of helps you prioritize and organize your project ideas and this is called an impact versus effort matrix so you have your team put all of their

project ideas and sticky notes and the first exercise is to actually order all the ideas in a row from lowest to highest impact the only rule for this step is no two projects can occupy the same x coordinate so your team should also start sorting these things as advised this is also a nice nice exercise to see if your team can communicate why they believe what projects will have the impact that justifies their chosen order so um like to do it at year planning yearly planning so they're they're trying to figure out the order and finally we have a nice row after you are able to order everything based off of low or high impact in order

um you repeat the excess the exercise again vertically so you order your project ideas from low to high effort and you want to keep the same x coordinate of the sticky note so the same rule applies when you are putting everything in low to high effort that no two project ideas can occupy the same y-coordinate so let's say that your team has already done this exercise um we'll draw some interesting lines and i'm going to tell you what all the lines mean so the upper left quadrant describes projects that have low impact and high effort and essentially these these kinds of ideas should be reprioritized it would be considered a luxury to complete these things because

there's a pretty low payoff to what is being proposed the upper right corner is something that's considered would contain all the projects that are considered strategic there is a significant amount of effort that is needed to complete the project however the impact can be seen as extremely beneficial the lower left quadrant is full of easy wins so low impact low impact low effort probably easy to prioritize and the lower right quadrant are all of the must-dos because they are low effort and high impact the parallel red lines means is meant to be sort of a gauge for how practical your team's project ideas are and ideally you're either within or close to the red lines and so this kind of like

brings up the question um does your team end up accidentally executing a lot of luxurious ideas that end up in the upper left quadrant um is the team accidentally leaving must-do projects behind um you don't know until you kind of all put it all together and compare them all to each other and understand what the team sees as beneficial to do and what seems like a bit like a bit of a waste of time um even with the impact versus effort matrix there's limited time and resources but at least this could help um make it more obvious which projects you should choose to prioritize and which ones you shouldn't when you retrospect your year quarter or

sprint you can see what was done what was chosen and then decide what is best to do next utilize the wealth of data to know that at least you did something and you did good things to move the needle you choose the most impactful efforts thanks to the matrix and you know we know as security professionals we will never win forever so at least use that so we know that computing relies on sorted information that allows us to recognize patterns and scale the way we solve smaller broken-down problems any old algorithm we know that vulnerabilities we need to quantify their serverities uh be able to spot trends and then you know strategically deploy effort with vulnerabilities that are nebulous

in terms of what their outcomes might be or what their root causes are there's still value in quantifying and prioritizing these risks when it's all compiled in your risk register you can use that to best communicate potential proactive efforts that may need leadership to justify in terms of resources or even funding and when you're ready with your solutions and mitigations you either already know what to do based off of experience or you can put all of the different solutions in an impact versus effort matrix these are obviously oversimplifications of evaluation trade-offs but it needs to be simplified because we're going to have to apply all of these thousands of little decisions to resolve incidents you're going to get a

lot of reports you're going to have limited resources you're going to have to make tough calls quickly time is of the essence when resolving these incidents so you want to reduce the amount of time spent so act two how do you build security incident response beginning with our core principles to scope quantify and recognize patterns of scale let's look at how pilots are meant to resolve aviation incidents now in an aviation emergency um to make sure i'm not spreading unnecessarily unnecessary fear uncertainty and doubt let's look at available data related to us caregivers so flying in the u.s is actually incredibly safe we've had a 95 decrease in the uh in u.s air carrier fatalities in the last 20 years i'm

really sorry i trip over that and you know the pandemic didn't really help i believe this is because pilots know to follow the same pattern in an aviation emergency when something goes wrong in the air the pilot's scope is the plane planes that might be nearby passengers where they are what the plane is telling you and let me tell you a modern airliner tells you a lot of different things at the same time um that's their essentially that their murray condo pile of information all that data allows pilots to organize the data that they're receiving understand their instruments and what they're telling them and they use previously recognized patterns to tell to figure out what has happened

pilots usually have thousands of hours of training before they actually go up in the air and fly you around and previous experience it makes them faster at choosing the right checklist because there's still so much to manage in the cockpit for example a pilot must holistically decide how to handle all the tasks in the air they must perform task initiation monitoring prioritization allocation interruption resumption termination for every task there is probably a checklist affiliated with it up in the air um so much so that pilots actually strap whole computers and checklists to their thighs these are called knee boards and even with the references task management requires mastery in knowing which checklist to use and what data to

surmise from all of the different telemetry on top of the telemetry that is available on the aircraft so thanks to the recognized patterns in past emergencies pilots are now able to prioritize their efforts in an emergency because uh in an emergency i'm sorry because of the checklists following the specified uh prioritize and that allows them to prioritize what's happening when you know that first alarm comes on which is uh okay okay pilots when faced with an immediate emergency have to follow three things in this order which is to aviate navigate then communicate basically that means don't crash the plane don't crash into a mountain while you're trying to get in control of the plane and then after only after you've done

those things you can talk things through and try to debug the situation they have to organize the incoming tsunami of information and process it and then prioritize their efforts for the best possible outcome the third step in scaling the benefits of identifying problems in the best case scenario there are no crashes or fatalities and actually if all things go well in a bad situation pilots can confidentially and voluntarily file a report of the issue they encountered this leads to safer skies for everyone but in the worst case scenario there's a lot of oversight so ntsb comes in does a post-mortem and the faa also will observe incidents in aggregate prioritize passenger safety above all else and then they do this to minimize

the occurrence of future accidents and they actually start rewriting and fixing playbooks and checklists for padlets let's apply this framework to the 737 max 8 aviation accident in indonesia october 2018 lion air flight 610 uh flew into the java sea 13 minutes after take off takeoff people on the ground saw that the aircraft was visibly pointing downwards and when they recovered the black box telemetry which is again the equivalent system of application or system logging for us they could tell that the plane was trying to force itself downwards pilots were receiving confusing warming warnings and indications from their instruments and recall aviate navigate communicate the incident was severe enough that they never completed the 88 stage of

this priority list because there was too much going on the post crash investigation determined that there was a newly added system called mcas it was discovered that it was not documented in flight training at the time for this new aircraft model and the investigation side of the lack of training with the new system the new system created unexpected behavior in the cockpit that the pilot pilots were not trained on the manufacturer announced that it would update the aircraft software to fix the issue and train pilots with an additional playbook until the software fix was distributed upon discovery of the issue caused by mcas the playbook was to disable the system when it was suspected of

misoperation in our scope we have a stricken airplane with confusing uh warnings and um to and to begin dealing with that new plane design the mcas system was not unknown to pilots at the time uh and it was in fact emitted from training mod manuals so there were no relevant checklists and trainings to deal with the new mcas system so step one here is unfortunately completely omitted pilots were made to walk before they could even crawl in step two they never really really made it out of 88 before they hit the ground and step three of course was the post crash investigation and recommendations it issues which was to disable the system if it was suspect

and by now um we can see some of the similarities between a security incident response for a single incident and an aviation incident um the scope is the singular incident report which is all the things that are going wrong step one is to crawl we have to quantify priority and quant and then prioritize the relevant vulnerabilities risks and efforts and then step two is to walk in the air pilots need to make sense of the data they have and figure out what they're going to do about it and security incident response we have to figure out what efforts are going to fix the problem and prioritize them this enables us to come up with a

cohesive mitigation plan and only with these plans can we effectively execute these prioritized efforts step three would be to iterate after the incident is over so effective postmortems are key um ideally we are looking at these postmortems to make future incidents smoother and less painful so about six months into uh incident response leadership um pattern recognition allowed me to make a simplified workflow simplified i'm sorry it looks really complicated um but essentially this was how to handle incidents consistently and this was built from data that i had in incident reports so um the data that i had in incident resupports were specifically documented timelines and all of the detailed action items for every accepted incident in my

scope so like for example every square rectangle was actually a task some of them had their own checklists and every yellow rectangle was actually who are we calling in in terms of subject matter experts um one of them i used to call in is right there and then diamonds were if then statements to trigger what's the next test task or checklist um another pattern that i recognized because i had all this aggregate data was my fellow incident manager on calls imocs were experiencing incident fatigue and the data demonstrated that we were taking on too many low-level incidents um in aggregating that data and performing a post-mortem uh we figured out that we weren't declaring incident severity soon

enough at the very beginning um we changed our protocols to ensure that we determined incident severity before we decided the incident would be accepted for incident management so prevented burnout for my imocs um so we used the data available with the cvss calculator to decide what incidents were going to go to actual incident management and which incidents were going to go to vulner management where it could be self-managed this allowed us to only accept the highest priority incidents and created rules to push lower priority incidents to vulner management complete with their own post-mortem worksheet and then also i figured out how to like that most of this work fell into three main work streams which was remediation

investigation and then our reporting obligations um because all of these things kind of have to start at the same time when your slas are less than 72 hours so um that was another thing that i found sometimes however even when we have followed these steps something bad might happen again so in march 2019 five months after lion air flight 610's accident another 737 max 8 goes down ethiopian airlines flight 302 crashes in ethiopia so we have to look how and why did this happen again ethiopian airlines flight 302 will now require its own investigation and postmortem and when we see the same thing happening over and over again we suspect that something might be wrong in the way

we're holistically investigating incidents the scope now includes two individual crashes and this is a very strong signal to audit the entire system used to run investigations you have to scrutinize multiple incidents in aggregate and not just the specifics of the incident but also the effectiveness of postmortem action items so first the individual incident what happened with flight 302. um they find a black box of the aircraft and they see in the logs that mcas was triggered and the plane starts to tilt itself down pilots turn off mcas system as trained and advised in the playbook but the outcome is not as expected pilots around the world are incensed they begin to investigate and thanks to

the logs from the black box of the aircraft they are actually able to simulate the exact the exact conditions of the aircraft in training simulations in recreating flight 302 situation the mcas tilts the plane down and it turns out that in flight 302's case there was only 10 seconds to turn off mcas we've seen how many considerations pilots must make for individual checklists let alone tasks thousands of things can go wrong in the air how can one irresponsibly force a pilot to react to something in less than 10 seconds for some perspective pilots are literally trained worldwide to safely land airplanes when all engines have failed and let me tell you it does not tend to take 10 seconds to do that

remediations from airline lion airline sorry lion air flight 610 post mortem put more pilots in an impossible situation

so the prevailing theory is that mcas is actually governed by a single sensor on the max 8 and usually on other planes there are actually redundant sensor inputs so the mcat sorry the max 8 lact sensor disagree logic there is no way to signal or to correct for a faulty sensor if there is a single sensor feeding that information um unfortunately the mcas put these planes in an unrecoverable state if they were not fast enough to react if the pilots were not fast enough to react within 10 seconds as an aside the oauth top 10 top 10 vulnerabilities include something called insufficient monitoring and logging how would you know if a single sensor is failing if you don't have

other sensors to compare and correlate it to how are investigators and pilots supposed to correct for this with insufficient logging and monitoring um the result of all this was that the max 8 was grounded for almost two years and there are other theories that potentially contribute to the root cause as of today the final report for 302 is actually still pending so lion air flight 610 and ethiopian airlines flight 302 were repeat incidents but clearly required much more scrutiny when you see repeated computer or web application security incidents you end up needing to audit your entire program holistically so now the scope is all reports especially those that are duplicates or repeats of incidents that you're seeing

to crawl we have to quantify our metrics in aggregate what we did how we did it why we did it who did it and the effectiveness of what we did about it after the fact what were the similarities and differences between the two investigations in the case of the max eight this data for example uh across multiple security incidents would allow you to better ascertain which factors would make your incident response program better or worse to walk we have to revisit all of our checklists and playbooks for how we handle incidents and to run we must always be improving our incident response plans it is essential to retrospect and to introspect all the views that encompass

encapsulate this prior incidents and similar incidents we must post-mortem our entire methodology as well in the case of lion air um it was really easy to point some figure fingers at insufficient training the assumption was that if the pilots were aware of the system then the system could be turned off to prevent another cat crash in isolation it was reasonable to add some training add another playbook another checklist call it a day but the ethiopian crash deter demonstrated that blaming the pilots was not enough additional training was not enough to save the day pilots successfully performed the proposed playbook and did all the right things and mcas still put the plane into the ground anyway

insight in hindsight um boeing's plan to require more training keep the plane flying while we were waiting for our software update was all actually approved by the faa and had disastrous results um boeing was fined 20 billion dollars they lost 60 billion dollars in sales and there was a lot more federal oversight as a result of this which if you are curious about these reports they're actually pretty fascinating i'll make these available on my twitter all of this additional oversight is a clear example of the meta analysis of incident response where a repeat incident forces people to change forces the people in charge to hold a mirror to the process and to strive for it to change better and um

one can only imagine that similar mirror gazing exercises are being conducted at boeing as well um to quote uh rory kennedy who recently put out the netflix documentary downfall they kept the planes in the air banking on the hope that they'd create a fix before another plane was crushed so in summary um we've covered a lot of ground today the high level takeaway i leave you with we don't need to reinvent the wheel we do not need to repeat the same sad stories and we need to be able to crawl walk and run and to crawl at a bare minimum we need to know what the data is to organize it and make it quantifiable we need to be able to

prioritize or our vulnerabilities risks and efforts unless we be caught without a playbook when push comes to shove in walking we need to be able to track all the data these are the individual building blocks so that we can build the context and history that we have all that we have quantified um into the format of timelines triash metrics slas and everything relevant to an incident and we should organize all of that into playbooks workflow diagrams we should review with a critical eye each incident holistically as they end and this should also improve what we provide to incident responders finally we need to be able to run to scale out our incident response programs effectiveness as a whole and so that we

can plan for the future tabletop exercises that simulate incidents with people who have the most power makes a huge difference in the smoothness of an incident response so this is a callback to pilots recreating the situation in pilot training simulations um we have to learn when to say no as well to prioritize and in saying no that allows us to prioritize only the highest priority incidents with the resources available and with all of this we must continue to update our playbooks build templates for incidents and automate what we can from a holistic perspective oh sorry reading the same sad story over and over again can make us really unhappy and perhaps studying history should make us feel bad

history gives us the context we need to learn from it the wisdom to prevent it from happening again and to give us the motivation to say never again but there's a benefit while there is a benefit in learning how to build the real wheel there's no need to actually reinvent it always remember that those who don't do not learn from history are doomed to repeat it so these are books of the documentary and the besides talk that were formative to my perspectives provided today i want to briefly thank b-sides i've been coming since 2017. thank you for hosting my first major 50-minute talk and i'd like to thank every teammate co-worker and product council every cso

every security conference mentor teammate every history lesson thank you for empowering me to do things better and let's all do better thank you for your time [Applause]

Ooga Booga - Avoiding Reinvention of the Wheel (Useful Security Tools and Lessons to Know)

Related talks