← All talks

The Fault in Our Metrics: Rethinking How We Measure Detection & Response by Allyn Stott

BSides Tampa41:58350 viewsPublished 2024-05Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
The Fault in Our Metrics: Rethinking How We Measure Detection & Response by Allyn Stott Senior Staff Engineer Airbnb Description Your metrics are boring and dangerous. Recycled slides with meaningless counts of alerts, incidents, true and false positives… SNOOZE. Even worse, it’s motivating your team to distort the truth and subvert progress. This talk is your wake-up call to rethink your detection and response metrics. Metrics tell a story. But before we can describe the effectiveness of our capabilities, our audience first needs to grasp what modern detection and response is and its value. So, how do we tell that story, especially to leadership with a limited amount of time? Measurements help us get results. But if you’re advocating for faster response times, you might be encouraging your team to make hasty decisions that lead to increased risk. So, how do we find a set of measurements, both qualitative and quantitative, that incentivizes progress and serves as a north star to modern detection and response? Metrics help shape decisions. But legacy methods of evaluating and reporting are preventing you from getting the support and funding you need to succeed. At the end of this talk, you’ll walk away with a practical framework for developing your own metrics, a new maturity model for measuring detection and response capabilities, data gathering techniques that tell a convincing story using micro-purple testing, and lots of visual examples of metrics that won’t put your audience to sleep.
Show transcript [en]

[Music] well hey y'all thanks for coming to my talk I've worked in detection and response for about the last decade and I've made a lot of mistakes but especially when it comes to metrics this is the talk I wish I had seen today you'll get three things you'll get a charity model that I've been using to describe and measure detection and response capabilities a framework to guide you to helpfully make much better metrics and then lots of examples to help give you a starting point my story with metrics starts on a Monday morning I'm only a few months into a new job and I get a message from the boss and he says that the board of

director meeting's coming up and he's looking for updated program tricks and you can tell by my response that I'm new to senior leadership I don't ask any questions I'm eager to please so I send a message to my new team and I asked them hey what have we presented in the past and what's the response oh yeah bad news last guy he made those up but good news is you're going to do so much better how many times does this happen to you where you inherit somebody else's amazingly shitty metrics yeah yeah uh this is often our starting Place uh metrics that haven't been well thought out and maybe even worse uh fudged to avoid questions or

more work and so I did what you probably did I Googled it and then I ended up just copying the metrics I used at my last job and unfortunately that's led me to using a lot of bad metrics but so what why do I care about metrics and why should I care about metrics why do you care about metrics you decide to attend a talk that had metrics in the title why do you care money honest answer that's right y Pro we doing stuff what else why else do we have metrics yeah so we can make Intelligent Decisions so then we can drive Improvement yes Trends yeah there's a good quote here about uh metrics driving

improvement uh Carl Pearson he's a late 1800s 1900s guy he's widely viewed as the founder of modern statistics and he's got this quote he's famous for and if if you're ever writing a talk about metrics it'll come up once in your Google search that which is measured improves which sounds like a great plug for metrics but there's an implied warning in that message what if you're measuring the wrong thing there's this paper written by these two guys out of MIT uh Hower and cats called metrics you are what you measure and they talk about how metrics affect actions and decisions those could be your technical engineering decisions it could be your senior leadership strategy decisions but

as you pay more attention to your metrics you start to make decisions and take action to improve those metrics the metrics you choose are the ones that you'll improve and over time you'll become what you measure metrics also help us communicate what we do and why people should care I took this really great course it's a two-day course with uh Edward tufty it's called presenting data and information and if you ever need a visualize data I really recommend this course has nothing to do with security which is you just a nice change of pace um and he has this entire section where he makes fun of bad PowerPoints which is a really fun highlight he's got this quote that says

metrics reveal data metrics are a tool that enables us to present the greatest number of ideas in the shortest amount of time with the least amount of ink in the smallest space and why because honestly we need a budget we need headcount and metrics are usually the tool we use to communicate that metrics help us prove our value that money we gave you are you putting it to good use we hired you are you doing a good job so why are security metrics hard it's always changing yeah landscape's always bad guys want us to measure the wrong thing yeah data matching yeah disparate data sets leadership does understand what's important yes uh also I've heard um

we're trying to prove a negative right like nothing happens that's good in our world uh in my own personal experience metrics are hard because I'm a security person and I don't care that much about metrics and I'm not super good about talking about it either here's a much less famous quote metrics are an annoying PowerPoint I need to update every month and that one ones for me a bit about me I'm a senior staff engineer at Airbnb I work on fun things like Enterprise security threat detection and incident response and I really love my job I live in Austin Texas with my wife and my three-year-old son Liam and I really love being a dad

and a husband and there's one thing that I'm really good at as a husband as a dad and as a security engineer I'm really good at making mistakes and this is the point of the talk where I'm supposed to gain credibility with all of you tell you about my accolades my 15 years of experience but the reality of it is that I've made a lot of mistakes let me tell you about five of them and based on the metrics I've inherited over the years you're making the same ones thank you very much and the first terrible mistake I make is losing sight of the goal how many of you have worked or do work the alert

queue all right yes a little bit we all kind of dabble in there from time to time the tired people in the room that's how you know and this year marks my 10year anniversary of being on an operational team and being on call and it's not for everyone it can be really tiring a lot of people burn out and if you're in a relationship it's certainly causes a lot of strain but I do really love it I love investigating alerts I love the thrill of being the first at an incident I love the chaos but for those of us that spend a lot of our time triaging alerts and responding to fires it can be really easy to lose sight of

the goal so we end up describing our operational work with metrics like this one and here's a number here's a metric that shows the number of security alerts per mon month you've seen this metric you might have this metric today I might have this metric today and if we take a closer look we can see that in the past year March and April had the most alerts my boss will probably ask a question about that and if we keep looking at it generally speaking alerts are trending down did we do that and Are we more or less secure now or uh in December and January the new IPS rules came out and then I turned

those off alert count has become the heartbeat metric for security operations instead of rooting back to our goal of detecting threats that matter the most and responding quickly and effectively we've reduced ourselves to cries for help I I've come to call this metric the operational burden we've inflicted on ourselves another title might be we're doing things it's crazy out there maybe it's fear driven scare leadership with a bunch of alerts and sometimes we try to make it a little better we break it down by true and false positives I've actually been proud of myself for doing this one but really it only shows how much I've lost sight of the goal this metric assumes that there's a

direct correlation between reducing false positives and reducing operational load and you might be thinking wait doesn't it this graph also assumes less false positives mean higher quality alert analysis or is it the opposite do alerts do more alerts mean we have better visibility and because I live in the operations world I find it's really easy to lose sight of the goal and I don't even know where to start when I'm creating metrics so to help you think about your own metrics I thought about all the different measurable activities in detection and response that could help us make decisions and see if we're improving and then I made an acronym for it figured out and the first category is

streamline which is where a lot of our operations metrics will live and and these are usually focused on efficiency accuracy and automation awareness is where we take our threat intelligence and turn it into our top threat lists and Trends vigilance is where we describe our visibility and detection coverage for known threats exploration is for the results of our threat hunting and other proactive investigations and Readiness is our measurement of whether we're ready for the next big incident so when you're thinking about your own met metrics think about which saber category the metric would fall under and this could help you tie you back to your goal or your outcome and to figure out which category of metric should fall

under we can ask what question does this metric answer so what question were we trying to answer with this metric

produ productivity maybe are false positives taking too much of our time do I have enough time to investigate true positives so how do we control this

metric yeah alert tuning so how's that going for you do not go great it hasn't been for a really long time so this is a streamlined metric streamlined metrics are usually interested in efficiency accuracy and Automation and I have two big problems with this metric first this metric doesn't tell us where we're spending most of our time and second the only control I have for this is tuning or tuning off alerts so how can we make it better and here's a graph of time spent on false positives and I remove the true positives because for now whatever time we spend on true positives is fine by me yeah we might come back to this but it's for now we're

saying it's fine so instead of tracking how many false positives there are I'm tracking how much time is spent on them now how much time you spend on an alert manually could be as simple as measuring the time the alert gets assigned to the time that somebody marks as a false positive now if your team's anything like mine we have this amazing habit where when we're working the alert queue we select all the new alerts and then we assign them immediately to ourselves why do we do that metrics we do it because of metrics what metric time to assign maybe the silliest metric we could have possibly invented because it measures what when somebody went in there highlighted

them all and clicked assign to me so stop measuring it and then this metric becomes more accurate so how do we control this metric what can we do to improve this metric have your data scientists go on call lots of data science is just bored right

yeah having a good breakdown understanding of your your false SPS yeah Baseline a baseline understanding of of how long yeah yeah automation yeah Automation and as we get more automation tools the number of events May not equate to how much time we're spending on false positives and as you automate the time you spent manually you can move then into your automated value unless you're adding a lot of sleep statements to your automation like automation time should essentially be like zero um you could add sleep statements in there that impact different metrics but by carrying the time you spent manually over to your automated values you can do something really cool you can speak to the amount

of human hours your automation is saving you because a lot of times we're not that motivated to automate because it doesn't impact the metrics that we're showing today so now we're not just incentivized to tune our alerts but we're incentivized to find where is the most manual time being spent so we can move that over to automated second mistake second mistake is using quantities that lack controls or more simply said measuring the things you can't change meantime to recover is a classic incident response metric it'll be in your Google search in this example we can see that recovery was as low as 4 hours in September and October and it grew all the way up to 16 hours in

December and then the team pulled together we worked really hard we brought recovery time back down again or maybe we just had a little bit too much holiday cheer it's funny I've spent the last year researching metrics for detection and response and I've learned something we're obsessed with speed in incident response the vast majority of metrics when I search for detection and response is about mean time time to detect time to respond time to contain time to recover and I'm certainly not going to argue that speed's not important but it can't be our sole measurement across incident phases because that completely ignores our quality and Effectiveness but my big problem with this metric is that security incidents

have a lot of variability especially the further Downstream you get in the response process there's a lot of dependencies that occur from start to recovery and not all of them can be controlled at least not by your teams so a graph like this doesn't tell me how to make better decisions and it doesn't reveal what's controllable and so you'll be like me and you'll stop caring about this metric and you'll fudge it so that nobody asks any questions so instead Break It Out by the response time across the different phases and here I've filtered out that built-in time that we need for Quality I like to do this where every response Playbook that you have it has

some built-in time that you will need sure as you mature your capabilities that time may come down but that's not the focus for this graph here we're looking at what can we control today there's a really great talk on YouTube um from AWS reinvent it's called the tension between absolutes and ambiguity and security by Eric brandwine and he says when you look at a graph it should immediately answer what do you want from me what do you want me to do and one of the easiest ways to do that is to make the answer zero here I filtered out all the time we can't reduce today it's built in so there's nothing for us to do make it

zero so now when I look at this scph I know where I need to do work I say oh wow hey we're actually getting through containment as quickly as we would expect today but after containment everything past that gets a little gets a little harder and maybe that's because there is a lot more variability more teams involved there's more coordination so we can think about what can I control here and what else might I need to filter out in this phase so then you can actually focus on the things you can control mistake number three thinking proxy metrics are bad or in simpler terms thinking like an engineer when it comes to metrics and building something

amazing and beautiful with just these vast detailed data pipelines so you can build this amazing metric when the reality of it is you should have just taken samples and just built something that was good enough here's a great example eight years ago my team and I determined that we wanted to see what our miter attack coverage was so we could determine what types of activities we could see and not see and this was before miter attack coverage was like really cool and every vendor did it side note I saw a great tweet the other day and it said we need to do a better job at mocking vendors that claim 100% miter attack coverage for many reasons but first

being that I see the Carnage that 100% coverage brings and hint it's alert fatigue like you wouldn't believe so we decided hey we're going to build tests across all of miter attack and then we figured out oh actually we don't need just one test for Every Technique we need a lot of tests because every technique a lot of different ways and variability oh and then we have mac windows and Linux and know different versions of Windows and so we spent all this time building tests across the entire attack framework it took us years and it's cool but at the end of the day all we really wanted to know was where do we prioritize our detection

building what detections do we need that we don't have today and miter attack is great but it doesn't answer that for you so do this instead rather than trying to measure your detection coverage across all of Attack start by thinking about what's thep top five threats I care about the most don't overthink it look at your external threat Intel and think about the industry you're in what type of environment you have then look at your incident Trends what kinds of incidents are reoccurring and then link those back to your organization's risks what would be a really bad day for your company what data if exfiltrated would make your Chief legal officer just weep and then once you've got your top

five prioritize your detection development from there so we like to Workshop these as a team where everyone takes one of the top threats and then we'll use attack to derive what are the techniques and sub techniques for that threat and then we come up with all the different very environment specific ways that we could simulate those techniques so if datax filtrations in your top five we would write these purple tests for how will we simulate that that and then as you write your tests and your detections you will slowly over time build yourself a prioritized attack coverage map but without all the alert fatigue and having a super costly metric mistake number four not adjusting

to the altitude and as someone who has floated between management an individual contributor I'm very guilty of this one how many of you have tried to explain all the different attack phases to a board of directors I have sure why not let's do it I think detection coverage is actually one of our better new metrics we've come up with but wow we've done a bad job at explaining it at the leadership level I have seen one of those mitor attack heat Maps just slapped into a boorder of director deck as if it meant anything to them so we need metrics at every altitude the higher the altitude the less it will become about detection and

response and more about the impact of the business it's helpful to think about it like a pyramid at the top of your pyramid are your Northstar metrics how long does it take us to detect a threat and then how long does it take us to get back to business as usual and under that top layer is our cover coverage and Effectiveness can we detect top threats to the business do we have playbooks for those most critical response scenarios do we have visibility and then under that L how well do our tools perform how much time do we spend trying to figure out what logs we need to search and then how long does it take for us to search

them and organizing your metrics in a pyramid can help you connect those lower layers to your top layer and also speak at the altitude that's appropriate to your audience and finally mistake number five asking why instead of how and my natural inclination is to ask why why didn't we detect that malware sooner why am I still missing those firewall logs from the staging environment and as a dad I have a lot of questions why did we bring the car seat when we only took one taxi ride the entire trip why did we need four suitcases why didn't we bring the stroller why can't Liam walk by himself but in all of these examples why is not helping so instead I've learned

to move straight to the how and figure out what actually needs to be done and often answering how allows you to identify the underlying problems much faster and with a much more positive perspective especially for your spouse I mean coworker how can I get Liam a car seat and many suitcases through the airport how can I detect these type of threats sooner how can we respond faster when I interviewed with my current VP she asked me how do we build a modern detection and response program how do we get there simple question not a simple answer so how do we describe where we're going where we are today and where we're going and it made me think about

maturity models and my first exposure to maturity models was the hunting maturity model hmm and the hunting matur maturity model was really helpful in describing the maturity of our threat hunting what we needed to do to get to the next level of maturity maturity models are a tool that help us answer these questions where are we now what tools and processes do we have today what's the current situation what are our challenges and where are we going what should the future look like where do we want to be in two years and how do we get there how do we get there what are we trying to achieve and so I I created this threat

detection and response maturity model and this the TDR maturity model builds off of the hunting maturity model but it expands across all the different areas of detection and response and there's a lot to it so I'll give you a link at the end that'll give you the full maturity model with all the descriptions but here are the pillars of the TDR maturity model and the first is observability and it's the foundation that we build detection response capabilities it's having the tools and logs that give us the visibility into our entities and user activity and then enriching it so we can contextualize the data and search it quickly the second pillar is proactive threat detection where we focus on

collecting threat Intel so we can prioritize detections we build and buy and the Hunts we perform and the third rapid response where we prepare by having complete Playbook books enrichments and automations so we can move from triage to analysis as quickly and effectively as possible and then we use these pillars and these 14 capabilities to describe and measure where you are today and where you want to go next and the first question we did answer is where are we today so for each capability in the framework you'll rate the maturity across four different areas process tool docks and testing and you'll rate each of those from initial all the way up to Leading and within the framework I've

provided General guidance on how to rate the maturity of each area and or specific to each capability so for example if we rate our detection engine capability and think about the processes we have do we have a process for creating a detection that looks at firsttime occurrences do we have a process that defines the most optimal way way to find thresholds and then we rate our tools are detections centralized and managed and then documentation that's right I put documentation in there or in my case and most of my life the lack of documentation and finally there's testing and because none of us test any portion of our detection engine I've gone ahead and give us a rating of

initial you're welcome and as you go through the capabilities and rate them I find it's really helpful to do these alone write down your readings for each of the ratings and then as a team rate all 14 capabilities as a group because at once everyone starts talking about these capabilities you might hear things that change your mind or confirm your own rating and then once you have agreeance on all of your different capabilities you can then use that to calculate what your your current maturity is and here's an example of how you could use this model to visualize your current program's maturity and then you could describe your projects and initiatives and map those to where your

maturity is today and how those initiatives and projects move the bar so you have a comparison between the beginning of the year and the end of the year and I like using this tool at the leadership level because it's a simple message that has a lot of good underlying details to back it up but I also like to use it because it really shows the impact of your individual contributors work as an engineer there's nothing more satisfying than knowing that the project you're working on has real impact to the overall maturity of the program so now with this maturity model you have a way to describe where you are today and what your plan is to get to a

t next level of Target maturity but as you do that work you'll need metrics that show the results are you getting better are we still on track do we need to adjust our strategy and so extending that saber framework comes in and for each metric you create you'll put it into this structure here you want to avoid mistake number one losing sight of the goal so ask what question does this metric answer what outcome are we looking to achieve and then what category streamlined awareness vigilance exploration and Readiness so then you can tie that back to your outcome more easily and then connect that question to the top of our pyramid right what is the

outcome what is the goal of measuring this metric how does it change the outcome of our Northstar metric and then avoid mistak number two using quantities that lack controls make sure that it's a metric you can actually control and don't forget to make it zero filter out what you can't control today so that when you look at the metric you know exactly what it's telling you to do and then if you do have control of the metric what risks could this measurement reward I was talking to a buddy of mine and he runs one of those really big socks you know the ones with like the big monitors on the wall and the pew pew

Maps going back and forth they still call it the pew pew map right okay Shing heads yes okay and we were talking about metrics and he was talking about the time to analyze metric and it was one of their biggest pain points at this sock overall analysis was taking way longer than they had expected so they brought the metric up to the team they said hey analyze we got to bring it down it's way too big it's taking way too long so guess what you won't believe it it went down and guess what else went down quality of analysis and guess what went up true positives missed so when you introduce a new metric think about what potentially

risky Behavior it could be rewarding it might not be a bad metric but you might want to think about what companion metrics need to go along with it because remember you become what you measure then there's metric expiration when is this metric not needed anymore when our only lever was alert tuning it might have made more sense to track the number of false positives but now as we have much more automation tooling maybe it's time that we expire alert count metrics or at least remove them from our leadership decks the next three data fields are data requirements effort and cost or simply how much data does the metric require how much new effort are we going

to need to improve this metric because guess what when you create a new metric it doesn't immediately mean that people are just going to like magically appear to help you improve that metric and then how much time how much will it cost to collect this metric don't come away from this talk saying Allan should spend all Allan said that we should spend all our time on metrics first of all no one will like you and I don't want that coming back to me but remember mistake number three thinking proxy metrics were bad testing 100% across the attack framework sounds really great but you might not need to and anytime I talk about metrics the question comes up of so how do I change

the bad metrics that I'm presenting today especially with my audience is leadership and I get it change is hard leadership doesn't like surprises and they often have expectations that you'll be updating last month's slide deck so for some it might take time to transition but I have a tip that works well for me changing your metrics is like a cold plunge here I have convinced my friend and still friend Dexter much to the Delight of my toddler here to get into near freezing water and his body's first reaction was shock his heart rate spiked when his body hit the water he gasped he had to work to not hyperventilate and then suddenly Clarity it's the same thing when you

change your metrics it's not going to be fun immediately some people are going to go into a State of Shock especially when those bad metrics have been around for a while they've gotten nice warm and cozy with them but my tip is to embrace it push through the change and YouTube will soon have Clarity and then bring it all together and here's an example that I put together that highlights each of the sa categories into a dashboard and then UPF front and center is our program's maturity using the TDR maturity model this way we can use the model to answer the how and the saber framework to answer the what and then we can tell the story of

our program we're streamlining our operations by automating the work we do to investigate false positives so we have more time to investigate the true positives we looked at our threat Intel and incident Trends and we're ra in Awareness about these top five threats and we're focusing our time to build detections for these threats and here's where we're tracking and we've been exploring our gaps in our security controls as it pertains to those top five threats and from a Readiness perspective we're really excelling in analysis and recovery and we have some work to do after we've gotten initial containment so now you have some tools to help you re rethink your metrics instead of making wild guesses about

whether you're improving you can use a maturity model to measure your own capabilities instead of buying tools and hoping for the best because more tools means more security right you can now measure how each tool contributes to your overall maturity instead of using volume counts fear tactics and sad emojis you can use saber to get to the core of your metric ask better questions and map those to something you can control and instead of focusing on 100% miter attack coverage you can focus on the threats that matter the most find your top five threats and work on detection coverage that actually provides real impact today so hopefully this talk is your wakeup call that it's time to take the

cold plunge and rethink your detection and response metrics thanks so much for having me [Applause] and uh this is what my links um I've got my LinkedIn Twitter stuff there and there's a copy of the slide Deck with additional content and then there's a link for the full detection response maturity model I also write a very infrequent newsletter called meow word it has an adorable cat that people love and the security info is decent um so anyway I've got time for one or two questions so we do that now

yes oh good question good Arts law is uh your your metrics are only good for so long uh which is why I I mentioned expiration in in the framework right uh at some point the thing that you're measuring might not make sense there's a actually in the one of the papers this that is heavily references that quote they talk about um measuring time that a call center employee is on the phone and they they determined that you know there was people were they were spending too much time on the phone and that's how we should figure out like whether or not you know the call center was doing well was long call were bad short calls were good but the problem

was is that became the focus then instead of thinking about oh actually we have like lots of tools we could provide like maybe the customer just needs a list of like frequently asked questions and then they stop making the call together so expiring a metric thinking about like is it are we thinking about the problem the right way anymore or like do we just need to rethink it all together like maybe the when we made this metric it made sense but now it doesn't constantly do that constantly think about does this metric make sense anymore in the world we live in this is especially true as we get things like access to really easy like large

language models and machine learning some of the questions and ways that we approach problems are drastically changing and so our metrics May drastically change three questions I'll do one more question yes

yeah yeah I don't want to take credit for metric pyramids it's a thing in the literature but I will say it's really helpful for you to take when you when you create a metric figure out where in your pyramid it is right because um you know one of the metrics that I had in a previous job was like the speed of our Splunk searches and I'm like yeah I bet our CEO really cares about that he's like I'm all about that cool man awesome all right cool thanks everyone I'll be up here for a couple more minutes [Music]

[Music]