Prevent Broken Detection with a Red Teamer Turned Detection (QA) Engineer

Name: Prevent Broken Detection with a Red Teamer Turned Detection (QA) Engineer
Uploaded: 2025-10-28
Duration: 53 min 32 s
Description: When you ship new detection, are you crossing your fingers in hopes you did everything correctly? When there is a subtle problem in your detection pipelines, will you learn about it before a major incident occurs? A former Red Teamer tells the story of the years-long project undertaken to cut down

BSides Augusta · 202553:32158 viewsPublished 2025-10Watch on YouTube ↗

Speakers

Ryan O'Horo

Tags

CategoryTechnical

StyleTalk

About this talk

When you ship new detection, are you crossing your fingers in hopes you did everything correctly? When there is a subtle problem in your detection pipelines, will you learn about it before a major incident occurs? A former Red Teamer tells the story of the years-long project undertaken to cut down on the amount of broken detection shipped by his Blue Team, and the career move that followed. Starting from the shared pain of seeing a Blue Team come up short in Red Team operations, to bringing highly integrated offensive engineering principles to Detection Engineering. This talk walks through how to read intelligence reports critically, the process of building a realistic adversary tests based on intelligence sources, establishing a test environment, collecting and analyzing test performance, automating retest activity to detect drift in the environment, and integrating this process into an end-to-end Test-Driven Detection workflow for your team. This talk is most suitable for detection engineers, their leaders, and those who benefit from an elevated detection engineering practice.

Show transcript [en]

Yes. >> All right. AB. Good. Awesome. Okay. I just want to remember the great Rusty Shackleford who once said, "Computers don't make errors. What they do, they do on purpose." And I want you to keep that in mind at all times through this presentation. Okay. All right. Welcome to Prevent Broken Detection with me, your friendly neighborhood former red teamer. Please do hold all your questions to the end and do take notes so you don't forget. Uh I'm sure not a single person in this room who has written detection has attempted to ship it without validating it in some way. But today my goal is to change the story in your head about how detection actually gets built and how

maybe you can worry less about a small mistake leading to a major incident. uh because that's not something you as an engineer should have to put on your shoulders on your own, right? Okay. So, I was part of the red team where I worked starting in 2017. I was party to hundreds of operations and if you haven't seen my talk 100 red team operations a year and that sounds like a joke to you, that talk is on YouTube. I suggest you watch it. Uh anyway the uh red team gives feedback to blue team in three primary ways just you know according from my from my perspective it is to incident uh exercise incident response identify unknown risk in the environment

and identify threat actor behaviors that we can't yet detect. Make sense? Yeah. Okay. But often we'll run into things that should have been detected as a red team but weren't. We knew there was some sort of detection in place, but an alert never actually materialized. We as a red team were seeing an unusual amount of this missed alerting on activity that we thought we could detect and we didn't have the bandwidth to test every rule or tactic that we wanted to, you know, detect within our organization. That's just not the role of the red team, right? And this missing middle part here where people normally think purple team should operate I thought was well suited for breach and

attack simulation tools. And that could be the TLDDR of this talk. But just wait, hold tight. There's more to this because I wanted to find out exactly how many of our rules were just plain broken. And for that I wanted to use my experience as a red teamer uh analyzing intel and replicating thread actor tactics. I needed to build and run real tests in the same environment that we as a red team were trying to protect. Once I'd set up all the necessary infrastructure and run a few hundred tests, which by the way took months, I concluded that somewhere between 20 and 25% of all of our custom detection was somehow broken. And we had thousands of

rules. As the scope of this problem became more evident and the path forward a little more clear, I eventually though reluctantly joined the detection engineering team and I started training our detection engineers to do the test work that I was doing and to help them think more like an adversary so that they could build their own tests. And this experience is why I'm talking to you today. So as I was reviewing all these broken rules, some broad categories of failure emerged. Uh engineers misunderstanding how a piece of software works or how adversaries work was a common theme. Uh errors in the application of intelligence including problems with the source intelligence itself or just literally copy and paste errors that

were occurring. uh unexpected or unintended changes in the environment itself were causing problems as well as unexpected or unintended changes in the detection content. Uh that's such as a vendor removing a YAR rule from a repo unexpectedly like you were depending on that to detect a specific thing and the vendor just decided we don't need that anymore. Uh or someone creating a typo when updating a rule for you know adding an exception or something like that. All right, let's define the problem for this talk. When detection works perfectly, it's indistinguishable from detection that doesn't work perfectly. Think on this for a second. It's very hard to know when detection doesn't work because your first indication could be

an actual incident that you should have caught at the moment it was delivered. This is because of the signals that we get when our rules are actually put into production. Most of that feedback comes from actual cases that are generated based on alerts. Like we get alerts and those alerts are triaged by analysts, right? So those cases emit either a true positive or a false positive signal. Either we got an alert that we should have done something about or we got an alert that we shouldn't have done anything about. Right? There is no negative signal in this kind of system. By testing the detection, we scope in the false negative signal. And if the known behavior occurs and we see no

alert, that is a false negative signal. That's that's really useful for our detection engineering and validation purposes. There is of course a little bit of nuance to this, but like this is how we should think about how our detection performs. All right. So there are two critical points in our detection stack and like neither of them are really like technologies, right? So there's the origin of the behavior that we want to detect some somebody doing something funny and then the second is the sore or really more the analyst on the other side that's performing incident response uh activities. If an alert arrives in case management, then we've done our jobs as detection engineers. like that's

that's we'll call that good, right? Let them handle everything else. Following our problem statement, the complication is that these two points are very far apart technically. And I'll give you a little illustration of that because the the behavior at the origin point is converted and shipped and converted and shipped and converted and shipped many times before it reaches an analyst to actually review. I'm showing you here a bunch of real layers of complexity that may be in between those two critical points. Uh for example, the endpoint security tools that you have installed on your hosts, the log collectors that are designed to aggregate logs from all of those different tools. Uh transformers and parsers that are designed to make your

logs more readable or useful in some way. And you know even the the sim which can be different from the rule engine in some cases like these all have different roles to play in this event pipeline. Any of these components can be managed by separate teams right and on top of all of these things we get to layer the deployment process maintenance and configuration of all these different components. This is a multi-dimensional matrix of complexity. And this matrix only really applies to one single environment. And by by environment, I don't mean like your company. I mean like the environment that consists of people who have Mac computers who work from home in a specific role. Like

that's one environment. a host in a warehouse in Pasadena may not have the same configuration or plumbing as that remote worker. Right? So we have all these different versions of our detection stack that are different throughout the organization. The fourth dimension of complexity is drift changes in the environment over time. Right? So if you could magically study your environment in a single day, the next day you come in, grab your coffee, things are different, right? Every day things get upgraded, installed decomp changed broken fixed. Your mental model of that complex system has to be kept up to date. And that's basically an impossible text. I'll spoil that for you. So in order to effectively execute on

your job as a detection engineer, you have to have this entire complex mental model in your head or at least share it among your team if you are lucky enough to have a team of detection engineers at where you work. So this is not something you should be expected to do perfectly and like you need shortcuts to work around these limitations. Let's add yet another dimension of complexity and that complexity is that any computer processing events never processes 100% of the events all of the time. Right? So over a long enough period there's always some measurable amount of loss in that system. All right. I baked in some breaks here because this is a dense talk.

network. >> Yep. Yep. Network sensors are a great example of uh Oh, it just disappeared. Uh and these these slides are for me, too. I need to take a break every once in a while so I don't rush through this. All right. Talk about driving like in a car. Beep beep. When you are operating a motor vehicle, there's a surprising amount of things happening all at once. Like if you've been driving for years, you probably are not cognizant of how much work you're actually doing. And it's the perception of those things feeding back into your responses to the driving task itself that keeps you safe while you're driving. Because driving is actually a very dangerous activity that

we as human beings have accepted with a lot of mitigations put in place in order to have that level of safety be acceptable uh for the the general public right? Some of them uh like let's say like on the road you have all these layers of defenses, right? So that includes how your vehicle is designed. It includes how the roads are designed and it includes all of the rules that govern how you're supposed to drive. So, um, some of these are related to your personal ability to process information and make accurate predictions about the future, right? Because when you are coming up to a stop sign and there's a car to your right that looks like

they're not going to stop, you have to be able to predict whether they're going to stop before you take your next action. Right? maybe that they're a little distracted, right? This is your cognitive load, your ability to process this this information and keep yourself safe. So, say for example, you are using a cell phone while you're driving. Adding this complexity to the already difficult driving task, right? Your cognitive load increases. Your ability to respond to important information goes down and you're more likely to have a collision. Right? This is cognitive load. Detection engineering is a lot less timesensitive than driving. But let's think about what we need to keep in mind in order to write a good rule, just one.

What is the thread actor actually doing? What is the computer doing when this behavior occurs? What related events can be created based on the behavior? Like the computer is doing a thing. What events are generated based on that? What capabilities and visibilities do you have in this specific environment? Say like your remote worker is working but they're not connected to the VPN currently, right? How does that change your visibility? Is it possible to collect those specific events that you need? Right? And what are those events going to look like when they reach the SIM? So all you detection engineers and incident responders, think about what an event looks like inside of your environment and then imagine a very specific one to

detect a very specific event. And imagine how detailed you can recreate that in your mind and then think about how accurate that is, right? You have to be able to put all that together in your head and able to accurately predict what an event is going to look like. This way of working is actually really hard. Like if we just get some intel and then we try to imagine events in our mind and then create a rule based on that imagined event, how good are we at actually accomplishing that task and creating a rule that will detect what we want to detect, not detect what we don't want to detect, and also not just be

fundamentally broken in some way. One way that we can reduce cognitive load while doing our detection engineering is by reducing the need to make accurate predictions. Right? We want to do less of this. In my opinion, the best shortcut around the complex system is the performance of an endto-end test. We can demonstrate what's going to happen instead of predicting what's going to happen. Right? Make sense? Yeah. Okay. Cool. Let's take stock of where we are right now. Right? Writing good detection is hard. Systems are in a constant state of change. Tests give you a false negative feedback which you desperately need. Using tests eliminates the dependency on mental models for engineers, right? And using end-to-end tests can catch issues

between integrated systems, right? A unit test is does this Yara rule conform to syntax, right? That unit test only tells you whether or not the rule makes sense to the syntax of the rule itself. Right? Okay. When you put those systems together, when you put that rule into a rule engine, the test landscape changes. You don't understand how those those systems integrate without an endto-end test, right? And endto-end testing in prod is the only test that actually matters. Tests can also help out with some other tricky parts of the detection engineering process. Right? I typically run into these three questions when deciding whether or not I want to write new detection, right? Is it worth my

time? This is the prioritization in my team. Can it be done? This is the feasibility. And has it already been done? This is the deconliction part. Now, when I say has it already been done, I don't mean someone in PaloAlto wrote this rule and I can go get a copy of it. I mean have we already written detection or have a detection on prem that will detect this activity that's what I mean by deconliction prioritization typically a matter of like policy or practice at your organization once you see the intel typically you'll know how to prioritize it in terms of your queue of things that you need to write rules for uh should be pretty easy to do that but not so much

to discover the feasibility or go through the deconliction process right This is another thing that you have to kind of have a complex uh mental model around like imagine in your brain knowing everything that you can currently detect and being able to accurately state you know to an auditor can you detect an hta being executed with ccript blah blah blah blah blah right no none of us have a really like have the ability to have all of that in our heads at once right so the first thing I would recommend you do when you get an intel tasking is to run a test because it makes the feasibility and deconliction processes much much easier. The test will either surface events at

the sim, right? Feasible or the test will fire an existing detection or not, right? Means we already had existing detection, right? So if you run the test and you don't get an alert, then you need to write detection. No more needing to make predictions based on our mental models. All right. So also if you go down the path of writing your own uh tests for detection, I don't recommend just running them once and then forgetting about them, right? Because we need to look into that fourth dimension where all that drift is, right? The ability to store and repeatably execute a known good test presents an opportunity for you to detect that drift and any kind of

breaking change that uh that happens as it happens rather than waiting days or months or maybe never. Maybe you never figured out a rule is broken. Uh so say you want to validate your detection for uh for a task that you're currently working on and you execute a benign command on your own workstation just to test to see if your your rule is actually working, right? You have some level of confidence that that rule is actually going to work, but that's only on your machine at that point in time, right? So but that rule is going to stay in production for years. you don't know how the world is going to change around it. So then maybe it makes sense to schedule

your tests to run regularly and identify drift that would otherwise be really hard to create metrics for, right? So imagine you're in an environment where you're getting millions of events per second for your Windows index, right? And somehow sysmon event 11 uh just starts going missing. Uh, would you be able to tell among all of the millions of events that just the 11s weren't there anymore? Could could you like confidently say that the volume would change enough that you could spot it? Um, you know, maybe somebody has who manages sysmon on another team shifted uh shipped a new configuration and they accidentally configured it so it doesn't ship sysmon 11s. Right? So this is a

situation that you know I've been in similarly where it's just you'd want to be able to create monitoring for every type of event that you could ever imagine that you want to write detection on but it's just not feasible to do that especially given the complexity of some of the schemas that are coming from vendors these days. Right? So maybe it makes more sense just to write detection for all of your individual rules so that you're only monitoring the events that you're actually using in your detection. Maybe you want to do this in an event driven way, right? Identifying critical change points in your engineering process and executing the relevant tests at those points. Things like rule query

updates, uh, schema changes, you can just rerun all the tests that apply to those specific events, uh, once the change is complete and give yourself confidence that all of the changes that you made didn't break any of those rules. uh say like if a vendor changes the schema of one of your feeds, right? You'll know very quickly after that happens and even though they're not actively communicating with you about these changes, which happens quite a bit unfortunately, um you will know uh very quickly that your rules are broken. All right. So, you might ask, if you're creating a test for every detection opportunity and then using automation to execute those tests regularly, won't you have like hundreds of tests running a

month? Okay, so the answer is yes. And you're going to have a lot of tests running if you do this if you like go all in on this, right? But this exercise will make you better at building detection, better at building tests, and better at diagnosing and preventing failures because you'll get familiar with how your rules break. Right? It's good practice to do this. You'll spot patterns in your workflows that are introducing defects. Like somebody is just copying and pasting hashes from a CrowdStrike report and it turns out that CrowdStrike puts a new line in all of their hashes kind of unexpectedly and there's always a space in all the hashes that get pulled in into your detection.

You don't and you don't realize it, right? uh you can address that earlier in the process if you have these tests running and things are failing and you start to figure out how your team is introducing defects into detection and then change the underlying methodology for which you um you create detection so that you can prevent these kinds of failures

really comforted by the fact that I'm the last talk of the day and if I go over it doesn't matter. >> Cyber question.

>> No, thank you for that. Appreciate it. All right. Cyber threat intelligence comes in uh all different levels of quality. All different sources, all different incentives drive who creates intelligence and for why. And I think the best way to about cyber threat intelligence is that it's a start and not an end to understanding threats. So I would recommend we approach detection engineering assuming that all intel reports are flawed or incomplete in some way, some very meaningfully way, not just like it looks kind of ugly or doesn't have good grammar, right? For example, some vendors will explicitly omit important information because they reveal some sort of sensitive information about a customer, right? And you as a detection engineer, you just

you just don't get access to that context whether or not it's important, right? And if we're just copying command lines and hashes out of PDFs or blogs, there's every opportunity uh in the world to make very simple but very impactful mistakes, right? Maybe uh you know like I talked about like Crowdstrike adding spaces to the hashes on a you know just nonsensibly uh or like dashes inside of command lines get converted to m dashes by like WordPress or something like that. I've had that happen a couple times. So these small problems could cause your detection to be flawed in ways that are very hard for you to spot before you ship it. Right? So, it's essential that

we look at Intel with a critical lens when doing detection engineering and then put that intel into a larger context instead of just accepting that, oh, this command line is bad. I'm going to write a rule for it. All right, so let's take a look at an Intel report. You may be asking, is that actually an Intel report? It doesn't look like an Intel report. Uh, here's my argument as to why I think it is. Uh, so emerging threats are often tracked fastest by individual contributors posting things on social media and like GitHub. And if you want to know about a campaign that kicked off last night and have a ticket open in the morning for

it, you're going to take your cues from like a vague social media post every once in a while. I can also say a little bit about this being a legitimate Intel report because we took action on this report in 2022, right? My team looked at this and then wrote rules based on it. So, it's an Intel report. Am I picking on this report because it's just kind of a throwaway update on a Quackbot campaign? Maybe. But like it does provide very clear examples of how Intel reports can be flawed, right? And I can tell you with some confidence that like even reports that have been reviewed and copy edited by vendors include some of these types of mistakes.

Just because you pay for them doesn't mean they're going to be perfect. So the first issue with this report is the origin of the URL that's provided at the top of the report. Of course, the infection chain doesn't start with a URL. It's got to come from somewhere. And we don't get that context here, right? We'll only know unless we know something else about Quackbot where this URL came from. And so in order to resolve this for myself, I went to go look at other intel reports around this particular Quackbot campaign. And this screenshot comes from malware traffican analysis.net. It seems to match pretty well with the structure of the URL and the password that's included. So I mean I think I

think we can pretty confidently replicate this email as part of a test and write an like a rule based on it, right? Because there's screenshot of it. Looks pretty good. It's just a password and a link. Seems pretty straightforward to put together a test and write a rule for this. But of course, like we wouldn't have known it if we just read the report. This is also important to know, of course, because what's going to happen is that the user is going to click on a URL. It's going to open a browser and then the browser is going to download a file to a very specific location. Now, normally this all just kind of makes

sense to you as a detection engineer, but you know, things that you could miss are that the name of the file that gets downloaded is different than the name of the file in the URL, right? And you wouldn't know what the name of the the the file that was actually downloaded, this zip file. You wouldn't know how that name looked unless you actually downloaded it because it's not provided in the report. Oops.

Yeah. Okay. Made my point there. Okay. So, the second issue with this report is the content of the shortcut file. So the researcher asserts that there is an absolute path to CL/publish for this VBS file. And since there's no application specified, it's going to run with the default probably WScript or Ccript. And but because we're good intel analysts or sorry, we're good analysts in our in our role as detection engineers, uh we've downloaded the original ISO file and analyzed the contents of the short uh the shortcut with a program called link info. And now we know that the execution path provided inside of the shortcut is actually relative. It's not absolute. Right? The thread actor has to do this because they

can't predict what drive they're going to start from when Explorer mounts the ISO file. So this is one of those things when you have to think about okay what's the operating system going to do when a user actually interacts with this and how is the thread actor going to respond to this because the they can't predict like what drive they're going to end up when they load this thing. So this all means that all of the paths and all the command lines in this Intel report are just incorrect. If we actually go through the process of the end user downloading the zip file, extracting the ISO and mounting it, we can see that it gets mounted to drive E

on this particular machine and this ends up being important if part of your detection strategy is the directory in which these malicious files are being manipulated. Okay, so up until now the mistakes the researcher has made in communicating their findings have mostly been the result of working in a sandbox environment like VM rank for example, right? These sandboxes make a bunch of assumptions about uh the payload being run and they don't really detonate payloads the same way a user would. Uh the other mistake is transcriptional. So pebbles.dll DLL here as listed in the Intel report is actually pebbles.dat when you actually open the ISO and look at what and look at what's inside of it.

Right? This is just a straight up typo. They got a little confused when writing up the report. Uh I don't see anyone really wanting to write detection on rundl 32 exe executing a DLL because that's just a normal thing to do. But if the thread actor is using a DAP, that's actually something that's interesting to me and I might want to write detection on it. Right? So it's not if it's not correct in the report I'm going to miss out on that detection opportunity and why is were manager here in the report here right so the structure of the report suggests that were manager is just the next step in the infection chain uh process-wise but how are they

actually using it right is it like an LOL bin is it a renamed malicious process is it an injection target they don't explain here in the the intel report. In order to understand this without actually running the malicious payload, I had to go look at several other blog posts about this campaign. And one of the complicating factors about campaigns like this and how an researchers analyze these campaigns is that some of them some of them get different samples, right? And there are mutating elements of like different file extensions, different encapsulations, archives. uh every researcher gets a little bit different sample to analyze and that can affect the analysis. So if you're looking at a specific campaign and like

googling around for different blog posts, they might be describing very similar but slightly different campaigns which which can make it difficult to sus out what to actually write detection on. Okay, so we've run through this intel report and now we have an intel tasking in form in the form of this report. you know, here's the report, right? Detection. That's where we are right now. So, let's step through the test driven detection methodology that I'm proposing here uh for this particular uh report. So, first we're going to take that Intel report and then we're going to try to see if we already have a test built for this. If you subscribe to a vendor tool or something like this or

use atomic red team, there may already be a test for this weird quackbot thing we're looking at, right? Uh and but in this case, we don't have any. So, we're going to go ahead and create one. Then, we need to do what we just did with this intel report and critically analyze it to make sure that we understand exactly what the threat actor intent is and how that intersects with what's being said inside of the intel report. Right? Need to have a deep understanding of what's going on here. The goal is to accurately rep replicate these tactics in a test so that it so that it pays to walk carefully through this report and and like think about

each of the steps and what's going to happen in each of these steps. Then we'll create this test using like shell scripts or executing a sample somewhere depending on how we have our test infrastructure set up. And then we'll run the new test in the actual environment using whatever tools or or infrastructure we set up in production. Uh so if after we ran our new test, we get an alert, we're done. We don't need to write any new detection because we've discovered that something we wrote two years ago already covers this tactic and we're good. We can move on with our day to our next intel tasking. Right? If there was no alert, then we either

need to create new detection or update existing detection at this point. And what we can do is since we have a test that we can run over and over again, we can do that as a validation process as we're developing our detection rule. Right? So we have an event and until our rule matches that particular event, we're not quite done with our our detection engineering task yet. All right, penguin time. Love this.

Hold on a second. There we go. Somebody wanted to go back one slide. Sorry. Okay, penguin time over. Let's replicate that behavior using a test. So, uh, we'll be able to actually see what this activity looks like when it hits the SIM. So, if you happen to have an isolated VM, and let me stress this, running a golden image of like one of your workstation configurations, this thing should be running exactly how all of your other workstations are running. The process is actually very simple, right? You start by injecting that ISO into the machine. And uh we'll take a little shortcut here because we want to mount that ISO using PowerShell because in this particular

case, we don't have interactive access to the machine. We're doing this through all scripting and we put that in a temporary directory and then once the ISO gets mounted we'll run the shortcut using the start command and that ends up also being very similar to how explorer would execute that shortcut. So we're getting not perfectly but really close to replicating the original tactics of the thread actor. Uh but if we don't have an isolated VM which I you know maybe you don't right now but hopefully you will soon we can attempt to replicate those tactics without the actual malicious component. So what I do is uh go in and replace the DLL that gets executed with something

more benign like a hello world DL. Right? In this case it's adequate to replicate the the thread actor tactics here because what we're really just looking at is the infection chain leading up to the execution of the DLL, not the DL itself. Uh so making a safe version of these infection chains can be useful but also a little difficult and is going to have a couple of trade-offs in terms of being able to accurately represent what's going on in terms of your SIM events. So we covered the initial execution on the host but we're missing a big part of the delivery and we'll need to take a little bit of a different approach. So, uh, maybe we generate a pcap that

contains a copy of that fishing email or like a ML version that we can send around using an email client or something like that, um, to like an internal test email inbox or something like that. Uh, if we don't have an actual sample of the fishing email, this one in particular is just pretty easy to replicate. Uh, so um, you don't only have the opportunity to replicate hostbased tactics. You also want to think about well if the fishing email comes in in are any of our detection opportunities there inside of like our email server or proof point or whatever you're working on uh in order to catch that before it actually lands and executes. Uh if you've got sore playbooks running,

you can use these tests to validate those playbooks as well as you know like any remediation tasks that your sore may do and that's a big benefit. All right, I'm going to talk about platforms now. Uh, I'm not going to recommend a particular tool for this. I'm not even going to tell you what tool I use because I don't want to give you any business. There are a lot of products that fit into this category. Uh, some open, some closed. And what I will say is that most of them don't do all of the things that I would recommend you have in order to do endtoend testing. So, you may need to augment even the most mature products in this

space. which is a little unfortunate but let's just call this practice nent and maturing at this point in time. I will however talk generically about the structure of these test platforms and how they integrate into your environment. So uh you know environments that may look like this simplified diagram here. What we're illustrating here is a malicious network traffic, a piece of malicious network traffic egressing from the network through our corporate proxy, right? And this environment is production and is correctly plumbed into the sim. So we're getting all the events that we we would normally expect from it. And there's a rule sitting in the rule engine that can detect this particular malicious network traffic, right? And

then of course which in turn sends an alert to case management and the the analysts now know that the bad guys are are through the gates right I'm going to zoom in here on three key elements required to receive the signals that we are looking for for testing and that is the SIM the rule engine and the case management system. This is where we want to integrate our test platform into. So there's an orchestrator and this is the piece of the platform that holds all of the test content and is responsible for managing test jobs. Uh it also integrates with the sim so that we can pull alerts back in and associate them with the

individual test jobs that we're running. There's a reporting component as well which is responsible for cataloging the results of our tests and then ingesting triggers if we want to do this in an event driven way. So it's like connected to the SIM or the rule engine and if a rule changes we get a notification from the rule engine and then we can execute a test. That's that's the reporting and analysis component. And of course there's a schedule that uses configurations and historical data to drive the test automation. So if we want to run uh all of our tests for you know sneaky snake uh or whatever once a month right we have that in theuler and that automation

occurs on the calendar. And then of course um since we've run those tests before, we know when they go from a passing to a failing state that we can get information about that. All right, we've installed test agents in two of the key representative environments here and in this case an internal Windows host and an external Linux host uh representing servers on the internet. And between them is the production corporate proxy which does things like TLS interception. And now we're going to test this new tactic. All right. So we have a test built and the orchestrator will um communicate with the test agents in order to actually execute that test. And in this case we're running traffic across the

network. Um upon execution of the test the thread is being replicated u uh sorry um so let me back up for a second. So the orchestrator is going to communicate with these test agents right and it's going to execute that test by running traffic in between the two agents. And of course when we execute that test that traffic is actually going to cross the network, right? So it's in production. It's real. It's feeding into our SIM and the associated events are easily collected because of that, right? It's just in production running on production hardware. The only difference is that our agents are on those machines. All right. Now, so now we have one or more events that we can use to develop

detection from this new tactic that we're working with.

And when we run the test again, we can validate our detection that we just created for this. So we have a feedback loop that we can close after receiving the correct alert for that rule. And here's our baseline. So this indicates like this particular test should fire this particular alert and this is something that we can keep track of over time and we can use that in subsequent tests to determine if it's still behaving like we'd expect it to. Fast forward in time, right? And we're updating some of our detection because of a schema change, right? And during this schema change, a typo is introduced into the rule and now our rule is broken. Oh no. But because we're using

event driven test automation, that test is going to run automatically once we publish the rule, the broken rule anyway. And at that point, very quickly after that change, we're going to know about it because when the test fails, one of our engineers is going to get a Jira ticket and they're going to know that the thing is broken and that something has to go get fixed.

All right, let's discuss some problems you may encounter if you go through this process and introduce it into your workflow. So, first of all, don't make incident responders work test cases. I hope I shouldn't have to say that, right? But it's going to make you unpopular really quickly if you start running hundreds of thread actor tests in the environment. You don't you don't clear the plate for your incident responders, right? So you got to figure out a way to to filter that activity on some level. Either do it through like your log pipelines and log stash with tagging or use your soore to retroactively remove those uh events from your case management system. You can do this based on uh agent host names

or IP addresses or email addresses. Doesn't really matter how you do it as long as you preserve the health and sanity of your incident responders through this process. There is an organizational risk here that I'll acknowledge is a trade-off, but in my opinion worth making because there's a small risk that something that you run that's malicious in a test environment will escape containment and you'll have an incident on your hands, right? The best way to avoid this is just to build good tests. be conscientious of what you're working with and how you're executing things and add adequate guard rails to your test system. Right? So, if you have no place where it's safe to detonate malicious payloads in your test

system, then just don't replicate all those tactics and remove the malicious elements before building before building them into your tests, right? But also trust the other people and the other systems in your organization to help back stop you if something like this happens, right? There's really no situation in which something that has already had an intel report generated and detonated on a system you didn't want it to detonate on just completely ruins your day because there's all sorts of things all sorts of other things in the environment helping protect the rest of the company from things like this happening. You expect people to click on things and to run things they shouldn't run. If you're running an automated

system to run tests, then it's a much less likely you're you're going to be much less likely to actually encounter something bad happening. But it's similar, right? Something is going to backs stop you in this process and I think it's a good trade-off. Okay, so you're running all these tests and you're doing it on a schedule, but you have to examine the results of all of these tests, especially the failed ones, right? So if you run all these tests, I recommend running them at least twice so that like an intermittent failure doesn't cause you to go in and do an investigation on an incident that you don't need to respond to. uh keep your test infrastructure up todate and

healthy, right? You don't want your host going down or not being up todate. Uh say like you have a new policy pushed and that policy doesn't get autom automatically updated on your test infrastructure because it's some bespoke test infrastructure that you have to manually write like GP update in every time you want to get a new policy. Right? When your test runs and the new policy is not there, you're not going to get alert and you're going to have to go fix it, right? So that's one of the important reasons why you want to run your agents on real hosts because they'll get maintained and updated just like other hosts in the environment and you have less to maintain yourself,

right? But what if you have all these tests and you're spending way too much time diagnosing the problems with the failed tests, right? So you'll really get into the weeds of why didn't that alert show up or why why didn't that event come out the way I expected it to? So in my experience, troubleshooting these things is a really good way to exercise your incident handling skills and to do root cause analysis because I end up investigating failed tests like actual incidents. And with that root cause analysis, I'm improving communication between teams and I'm also improving the engineering capabilities of those other teams because often they're involved in these failures, right? And so I'm helping them like

better monitor their process and and improve their change management process because I'm feeding them back these these problems that they're creating and helping them triage them and fix them. The same way we're improving our detection engineering team and our ability to deliver quality detection. We're improving these other teams ability to support our detection mission. And over time, once you do this a whole bunch, uh it's going to get a lot easier and your failures are going to occur less. So you'll have less to worry about over time. So just uh you know, put up with it for a little while. I I encourage you to do that. All right. So what do you do if overall test

quality is low? Uh and by low test quality, I mean the test that was put in doesn't really represent what was in the intel. And this will happen more if you're working with a team and collaborate on test building because if you become an expert in writing tests, right? You're going to have to help the rest of your team write good quality tests. Lowquality tests omit the the context that we wanted to bring to the intel, right? Low quality tests are often written to the rule that the engineer had in mind rather than to the actual intelligence or the actual payloads or malicious content that trigger the the development of the intel. Right? So what they're doing is

they're doing the same thing that they were doing there with their events. They're imagining the test in their mind before they actually go through and critically analyze the intel. And then they build that test to match the rule that they imagined which is based on the event that they imagined, right? We don't want them to do those kinds of shortcuts, right? The quality of the test is actually really important. My coworker Paul Hutmire did a great uh talk called bridging the gap at SNS blue team summit 2023. In this talk, he describes the tooling that we use internally to map thread actors and to detection and validation that we do. Uh for example, you can look at our top

thread actors in this tool that he built and you can see what our detection coverages for that thread actor. You can see all those tactics mapped to MITER attack and you can see all of the test coverage that we have for all of those rules. Right? This detection validation process helps our leadership feel more confident that not only are we doing good intel work and good detection work, but like we're actually protecting our organization against these top threat actors, right? Because who knows if it's actually working if we don't actually test these things. So you too should have some place where you can surface the results of the detection validation work that you're doing so that other

people on other teams and your leadership can see them. All right, these slides are going to get a little wordy from this point on. So, here's some features you might want in a detection validation platform. And forgive me for reading. Uh, they need to cover all of your detection use cases, right? You need to be able to run hostbased events, run pcaps and DNS requests and raw sockets across the network. You need to be able to send emails, which is actually a little bit more complicated than just replicating a P or replaying a pcap on the network. uh you want to have agents for all of your platforms which not all of uh these open

source ones have especially when it comes to cloud and docker. You need to have a place to put all your tests and be able to search for those tests including the content of the test. So if you want to find all your tests for that use rundl32, right? You you want to be able to do that inside of that tool. You do want there to be some pre-made test content even if it is just atomic red team. uh just so that you have a starting point to start running these tests and figure out how to get into this cadence and get practice running tests. You also need to be able to ingest alerts from your sim and match

them to the test jobs. It's very difficult to put in any kind of automation or get a lot of confidence if you're manually going into your your tooling and looking for alerts manually every time you run a test job, right? needs to have auler at like a minimum because you want to run things on a schedule and hopefully maybe there's a way to feed your uh event driven uh strategy back into it as well. I've not really seen that in any of the commercial products, but we do that with custom software where I work. Also super helpful is if wherever you get your tooling, they've also provided a way to sandbox your production images and execute malicious payloads safely.

There's a couple of vendors that do this pretty well. None of them offer Mac OS of course because Apple and I think we all know that pain. All right. What now? What should everyone do who has any kind of scale in their detection engineering process right now? Find a tool that allows you to write a test and get feedback after executing them. Right? Doesn't matter what the platform is of the project off the shelf. You pay for it. you don't just something that you can create and manage tests with. Uh set up the agents, use the existing tests that are available to get familiar with the process. Uh write tests for your most important detection if you

have a rubric where you have thread actors that you care about. Look at those thread actors. Look at those particular rules that apply to them or at least their tactics through MITER and then write tests for those rules first. get them into auler with an automation process and then never use simulated events. Never create a unit test where you have an event that looks like what you think it's going to look like and then run validation on that and then call that a detection validation program. That's not what you're doing. You're using that same imagination as you were writing your rule as for writing your tests. You don't want to do that. This is literally just a list of breach

and attack simulation tools. Some of them do the things, some of them don't. I have not reviewed any of them. Is the one that I use on there? I'm not going to tell you. There is one big benefit for using a commercial solution, though, is that you get to offload some of the liability of running potentially malicious things onto the vendor. So, think about that before writing your own test tool. Okay? You can take that on, but just imagine that if something bad happens, having somebody else to blame for it. What have we learned, right? So, some of your detection is currently broken. Detection systems are complex. Detection engineers rely heavily on mental models. Tests reduce cognitive load associated

with those mental models. Intel reports need critical analysis, and tests help us, or more to the point, force us to do that. Writing tests improves the efficacy of our detection engineering work and reduces the amount of work that we have to do. They boost the confidence of you and your team and your leadership in that confidence in that detection that you're shipping. Tests are also useful throughout the entire lifetime of detection and that all those tests need some sort of structure and care over time to maintain. All right. Thank you very much for your time and attention. But before I let you go, uh I'd like to ask you all a question. Right. Who here works at a

place where you're you feel like your detection validation process works really well? You can you give me like two sentences on how you feel about it? >> We don't have sections for every other simulate uh at least step through all the process. >> Okay. Do you have any kind of scheduling or like automation around those tests? >> Okay. So, you run those on a regular basis to make sure that your rules are still working. >> Wonderful. Anyone else feel like they do a good job of testing their detection?

>> Do not let your customers beta test your detection. All right. Um, well, I'm a little concerned that I'm like way out in outer space in this process. So, if anyone else wants to chat about detection validation, please do come up and talk to me about it. I have stuff to give away. All right, let's see here. What is the best band of all time? All right. >> You get a land turtle. >> All right. >> All right. What should you do with synthetic events? >> Trash them. >> Who said that? >> Trash them. All right. Yeah. >> You get to pick. >> I get to pick. Um, sure.

>> All right. Who here has never been to a security conference before? All right. You in the book. Thank you so much, everyone. I appreciate your time and attention.

Prevent Broken Detection with a Red Teamer Turned Detection (QA) Engineer

Related talks