So You Think You Can Detect? Detection Testing in Production

Name: So You Think You Can Detect? Detection Testing in Production
Uploaded: 2025-06-05
Duration: 27 min 24 s
Description: 🚨 Detection logic breaks more than you think—and often without anyone noticing. In this cutting-edge BSidesSLC 2025 talk, Lisa Li (Security Engineer at Scale AI) unveils a fully automated, production-grade detection testing framework built to solve the real-world problems security teams face with a

BSides SLC · 202527:24173 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Lisa Li

Tags

CategoryTechnical

StyleDemo Talk

About this talk

🚨 Detection logic breaks more than you think—and often without anyone noticing. In this cutting-edge BSidesSLC 2025 talk, Lisa Li (Security Engineer at Scale AI) unveils a fully automated, production-grade detection testing framework built to solve the real-world problems security teams face with alerting fidelity. This talk dives into: -Continual validation of detection logic -Test-driven detection development workflows -Real-world issues like log schema drift and logging failures -Seamless AWS integration with minimal overhead -Post-detection enrichment and alert lifecycle management Lisa will walk through a full demonstration of the system in action—including deployment in AWS and how it integrates with modern detection stacks. Whether you're an engineer building detections, running purple teams, or managing SIEM infrastructure, this session will give you the blueprint for building resilient and scalable detection systems. 💡 Bonus: You'll also get a glimpse into how this framework can be extended for purple teaming and adversary simulation, allowing teams to simulate attacks and validate their defenses in real time. 🎤 About Lisa Li: Lisa is a Security Engineer at Scale AI, leading the development of their detection engineering program. With prior experience at Twitch, Amazon, and a background from Cal Poly SLO, Lisa brings deep expertise in cloud security, automation, and modern detection engineering practices. 👉 Learn more about BSidesSLC: https://www.bsidesslc.org/ #BSidesSLC2025 #DetectionEngineering #LisaLi #SecurityAutomation #AlertingPipeline #SOAR #SIEM #CloudSecurity #PurpleTeam #SecurityEngineering #ScaleAI

Show transcript [en]

Uh I'll get started shortly I guess. So this talk is formerly known as uh so you think you can detect detection testing in production. Um and it's now called spite driven detections. The reason being the service did not have a name until recently because as we know naming things is like the hardest thing ever in engineering. Um and I needed something kind of funny to hopefully draw people to this talk. Um but spite does have a real purpose. It stands for security production integration testing environment. And in this talk we will learn all about how to use spite to drive your detections. A little bit about me. My name is Lisa. I'm a security engineer

based in San Francisco. I'm currently at Scale AI but previously I was at Twitch and AWS security. My security background is primarily in incident response and detection engineering and I majored in computer science at CalPoly Slow. I am also a first-time conference speaker, so I'm hoping everyone will be really nice to me. And I have a blog that is infrequently updated and LinkedIn if you'd like to connect. So let's start out by exploring an early stage detection engineering scenario and why we need spite driven detections. Um this is very similar to the scenario that I encountered when I joined Scale a year ago. Um so one of the first detection hires um we don't really have program in place a lot of

it's very green field. If you write a detection um in the form of a simple query you run it against some logs and it looks good um because you can validate the logic yourself or hopefully by matching a log that already exists. Um that's all fine but how do you know if it works in practice? Unit tests can only get you so far by validating the logic of the detection itself. But unit tests can't tell if alerts are properly configured or enabled in the end application. There are several points of failure that can a detection pipeline versus um that is not just detect the detection logic being incorrect. Um and these points can be difficult to monitor such as log schema

changes, failures with ingestion or delivery and even with alerting setup for these potential points of failure. These methods are prone to introducing alert fatigue in volume with bursty alerts. So as I mentioned when I joined scale uh we were in the process of migrating SIMs uh building the detection program for ground zero so spite didn't yet exist and this process actually uncovered a few stale detections ones that were no longer valid because of configuration syntax or other forms of drift. We had no idea that this was the case because nothing was truly broken. The logic in the detection just wasn't valid anymore. In addition, our previous platform didn't include a way to write unit tests

and we had to manually create a log or search for a log example to match our detection query. So, because of all this eyeballing, guesstimating, um we weren't actually very sure whether our detections were operating as we intended them to. After Spite was implemented, uh we created a PR that included a typo in a string and this one line of code broke alerting and we discovered it within hours of our next testing run because we had Spite. And now as we're in the process of rewriting rules from a V1 to a V2 implementation, we look at signal very quickly on if the migration was completed successfully. Spite driven development was a term I coined out of frustration.

Sometimes I'd be stuck on a bug or a feature so long that my primary motivation to complete the task was spite. I couldn't let the task win and I had to defeat it by being done with it. And we've now sort of retrofitted that idea to name this service uh which previously caused confusion because of its original name causing terms like alert detection and test to be overloaded. Um and it was also just a mouthful to say. So spite and now I know I've been throwing around this term spite and I still haven't really gone into what this means. Um which is what you're probably all wondering what is spite driven detections outside of the meaning code

until you hate it. hate it until you've coded it. Um, I'd like to introduce you to a more mature detection engineering scenario, the one that we have arrived at today, which is detections driven by and incorporating live testing using production infrastructure. This is the core of spite. Um, some examples of what this entails, uh, performing privilege escalation on user account in our production octa organization or taking a screenshot of a database in a production aws account. These are all actions that um we can script via API calls and I can produce alerts using the same logging and detection systems that we use to identify anomalous behaviors in our system. This means that instead of using

a duplicate testing pipeline which doesn't seem the same throughput as prod and is also subject to drift, we perform these actions in and on production systems. While live testing is crucial for continually maintaining and validating the detection pipeline, we can also use them to drive our detection development. Spite is a platform for automated adversary emulation in the form of containerized tests. With this setup, we can discover errors in the detection pipeline within hours instead of months to years. We are taking the guesswork out of detections that we set. This time, we don't just forget them. We don't just write a detection to match a log. We write a detection to match a behavior that is repeatedly being

performed, caught, and validated on our production systems. Now we'll take a quick detour to cover a bit about um our internal detection stack. Um some platforms I'll be referring to the most important one a sim called panther that we used to ingest logs from various sources and write our detections as code. Uh this sim along with many other traditional SIM offerings alerts when log pipelines stop or when schema um parsing fails but not when alerts fail to fire because you need to have throughput um through these pipelines to be able to see that. And this is one of the many gaps that Spite is able to cover because we're con constantly detonating these tests at

every step in the pipeline. We use Jira to surface our alerts and enrich them for review. And with Spy, we can drive down alert fatigue from alert processing areas errors overall if we can get these alerts right from the start. And NAN is a web hook place platform that we use to add postal alert enrichments and automate pieces of our alerting pipeline. Now this is uh a system diagram of the spy architecture which I'll be breaking down and walking through um in the continued slides. So our infrastructure is based primarily in AWS. The core of spite as a service is an ECS parent service that orchestr orchestrates containerized tests each packaged as Fargate tasks that run in

parallel. The ECS parent service is blank to the logic within tests and just sees tasks it has to run. Go's built-in concurrency features such as channels, go routines, and mutex are used so that we can run these tests in parallel. And these tests are containerized commands to execute actions that would trigger an alert. And it uses the existing detections infrastructure to process the logs generated by testing activity. In this way, we can stress test the actual detection infrastructure and not prod clone. So expanding outside of um the ECS cluster container core um we have a lot of other pieces of infrastructure that help us run this smoothly. Um we write logic to reroute the alert once it

fires in the sim. So even though we're reusing the same SIM and the same logging and detection infrastructure, we're not going to be populating um the main alert queue with these alerts. We redirect them to land in an SQSQ that the parent service pulls once it is finished kicking off test containers. This is useful not only because the parent service can both run and validate tests, but also to reduce alert fatigue because we're not cluttering up the main queue with these test results. The setup also includes a lambda function that processes the services deadlitter q that holds alerts received while the service is not actively testing an edge case that I'll expand on later. We use GitHub

actions to build and store the container images in ECR during the PR stage of our development cycle and invent to schedule the parent service during predefined time intervals and this introduces the potential as well to schedule tests out of band some future work we'll touch on later. So a quick overview of the tech and tools used in this project. Um we use Terraform for infrastructure as code with managed deploys via Atlantis. Go for the main service code. AWS to deploy resources for parent service and test infrastructure. Python is used to write most tests but some are shell scripts. Docker files are used for our container information and GitHub actions is used to build and deploy images. And again we

rely on existing infrastructure for accuracy and simplicity in the system. Now we'll take a closer look at what each test is made of. What I mean when I refer to a test, there are three components. A docker file which sets the environment and entry point for the script. Then the main test code file that defines the test logic and describes the actions for each test to take and finally the task definition. This is so we can set up the container in AWS and build the environment to run the script in. We must also set up or enable the corresponding testing resources that this test will interface with. So this could be an AWS or G Suite

or Octa um or just another configured endpoint depending on the type of detection. To enable a test in addition to deploying this infrastructure in prod, we need to add the name of the test to a YAML config map. The service reads from this file to determine what test should be run. However, you're still able to run tests without them being in this config file because when you create a task definition in AWS, you can go and run that container specifically. This is really handy for ad hoc testing. This is a simplified visualization of what I've been discussing and I'm about to walk through in more detail. We have our testing components when run that generate logs

that populate in our sim and trigger alerts. Spite receives these alerts, validates them, and logs the result thereafter. This entire system is also fully hands-off and autonomous after tests are productionized. So, we get to run these tests repeatedly against our environment without additional lift. As we talk, this uh tests are being scheduled and run um and logged all without us having to do anything extra. Now, I want to show you just how simple setting up a test can be and what it looks like exactly. Um, I'm walking through one of our more basic tests, one of the first ones we wrote, which corresponds to a detection for taking manual snapshots on RDS databases. So, I've set up an empty database, put it

behind our testing VBC and security groups that only enable our testing resources to access it. Um, I've marked this as a testing resource so we can clearly identify during test validation and so we won't confuse it with other resources. Uh, this is the Docker file. We're using a basic Python image and using boto3 to interface with the AWS API. Here we also install dependencies and then copy in the script containing the test actions. And here we are scripting the setup. Um I'm going to show a lot of code but I don't expect anyone to read it because it's really tiny. Um we're pulling environment variables making sure they exist in the runtime environment. Um then connecting to the

database and taking a snapshot. After a time the snapshot will also be deleted so that we aren't collecting a bunch of public snapshots. So the cleanup portion of the test, this is the task definition where we set the compute for the container, set up a logging destination, and assign roles for resource management. So welcome back. These are the three components that are required for a test. And enabling a test is as simple as adding this block of text to our YAML config file, which includes the name of the test, names of relevant files, and a tag. And here's what the service looks like in execution. Um, it's a bit boring because everything just happens behind

the scenes, but we log everything uh from the service to Cloudatch. And so, uh, we log information about each tasks, which one's running, what the results are, um, and the status of the containers. After the actions are kicked off, we can also see them popular in our SIM very shortly. Uh, this is one of what one of our detections firing looks like in Panther. Uh, we also have written logic to redirect this alert back to the spite service for validation. So even though it is an alert in our production sim, it won't end up in our regular alerting cues. Back in this fight service, we're continually pulling a queue for those alerts. So at this point um all the

tests have been run, all the containers have been kicked off and um all of these little action packs are running uh in parallel. But now we're pulling this queue that we are rerouting the alerts to. Um, and once we receive it, we then parse the log for details such as, is this the log we expect based on the config map? Is it one of the tests we ran? Do we expect this one to be um, matching one of our test cases? In a successful run, we simply log the result of the test. And if we experience any errors, we both log the result and surface an alert to our errors Q, which is named SE. Oopsie. The

screenshot is old, which is why it only looks like we have one test in the service. But, um, now We'll take a look at some other testing platforms we have. Um this is another example of a testing platform uh on octa. So this is the code for an octa test where an admin role is assigned to a user. Uh when we aren't working in AWS primarily we have resources defined outside this code repository to track. This introduces a minor level of complexity but it is manageable provided you track and label your resources. Similar to the AWS testing platform we have the same testing components. However, in our code, we're triggering triggering an octa workflow via its

invoke URL instead of performing an action on an infrastructure resource. This is a deliberate design choice, abstracting the highle admin permissions required to perform tests so that we're not giving the service all powerful access. Um, a quick click through to show you the setup for Docker file and task definition. Um, you can note that it only differs slightly from the AWS files in the types of packages installed in the secrets imported into the environment. So this is the actual workflow in Octa with details redacted. This is what runs when we hit the send point. A user is added to an admin group. We wait 30 seconds and then remove the user from that group as part of the cleanup for

the test. Um another testing platform example, we have G Suite, one of the new platforms we're working on right now. This is the proof of concept code as we don't have a test fully onboarded yet. Um the main difference to note is that in this docker file um instead of doing a very basic setup we actually are installing GAM advanced xtd3 which is an extended version of gam a command line tool to administer G suite. Uh what's cool about this setup is we can install this tool in headless mode bail on the setup and later import the credentials and settings we require to run this test. Um again apologies for this slide being very tiny but um I'm going break

down the highlights for you. uh as I just mentioned we bail on the setup and we import the credentials. So this is the shell script that is actually running uh in this container. The required oath files are stored as B64 in AWS secrets manager and then downloaded and saved at the expected locations during runtime to grant the service the required permissions to perform the test. These files are also deleted after the test is complete. Once we're authenticated, we're able to administer the actions we require. Um, just like in Smash, when a spike gets knocked off the stage, we can recover gracefully. When there are less alerts than expected, this means that something in testing failed. Perhaps the

test failed to run or it did run and the detection didn't fire. Maybe it fired and then the alerts were routed incorrectly. When there are more alerts than expected, this could be a strange coincidence in which someone is interacting with our testing resources in such a way that trips our alarms, potentially indicating an adversary. If this happens when spite is not active, the queue will still receive the alert, send it to the DLQ because it hasn't been picked up and a lambda will process it and generate an error report. The logging that we have throughout the service and each test container in addition to how we collect inner sim has so far been sufficient to investigate

these cases as they arite. If you are at this point in the presentation and may be thinking, wow, this is great. How can I duplicate this into my own workflow? Um, I would caution that this was customuilt for our internal detection stack. So, the assumptions we make about test validation and alert redirection hinge greatly on how our SIM works and how extensible it is due to the detections as code platform. Implementations for tests may also vary across stacks, but I do believe the idea of spite is generic enough to be used wisely. It's very much like integration testing in traditional software engineering. A potential issue I've been thinking about is how resources that we don't

maintain using AWS might be difficult to track such as the octa workflows because it requires working with an external tracker and knowing where these resources are referred to in the testing code. It could be helpful here to introduce a dependencies file for each detection. Additionally, Spite has only been live for a little over half a year, so we're still in the process of playing catch-up with testing the detections we've written previously. I anticipate needing to come up with some clever solutions to temper the effects of scaling tests in the 50 to 100 range. At that stage, we'd be introducing several more testing platforms, managing a lot of resources, and probably adjusting run times and resource

allocations. The monthly cost of testing also rises with the more resources we create for more tests. It's about $276 a month right now with the majority of our test skewing towards AWS. But as you can see, the cost for EC2 instances are minimal. the highest cost is actually associated with standing up resources to perform the test on. So the test container tests themselves aren't the expensive part but just having any infrastructure at in AWS at all uh right now because we're more so playing catch-up with the detections that we have already written. Um we haven't fully explored the potentials as a purple teaming platform but spite is intended to be used as a tool to drive

detections work from that standpoint. Um, if you think about it, all of these tests are actually containerized adversary emulation because we're conducting exactly the behavior that you would expect an adversary to be completing to trip one of our alerts. This is really cool because you can start with a test instead of starting with um the detection. What would the behavior look like in your own environment? Well, you can prove it out um prove out that technique with a tool and generate the log yourself in your production environment, mimicking attacking aer behavior from the beginning instead of trying to search for a log or a blog post or um guess at what kind of indicators of compromise um

a certain behavior or pattern of behaviors may look like. There's a of potential for what you can containerize as well. Since the orchestration service component of Spite does not need to know what is inside the containers it runs, we could run scanners, exploitation frameworks, and emulate the real tools and techniques that adversaries use in our production environments. Something we'd love to work up to is the ability to run chaos testing. This means emulating adversarial action on live production resources at random instead of resources we stand up for the purpose of testing. Right now, all our tests are scheduled at the same time and take under an hour to complete. But with more complex tests, we would need to account for

variable time execution. This is due in part to the nature of some tests such as brute force or password spraying that require an extended time span to detect. We would also want to introduce more randomness to our tests so they can be even better signals. Changing the timing of these tests would more closely mimic adversarial patterns and widening our testing capabilities. There's also a lot of potential for setting up new platforms and one we're excited to introduce is endpoint emulation. So something like setting up EC2 instances of a Mac or Windows machine and installing Crowd Strike, umbrella and other endpoint telemetry um to act as one of your own users and then performing tests on those.

Uh in summary, detection engineering is an ever evolving process requiring constant evaluation to ensure fidelity, especially given that detections are often the last line of defense. We need to create environments where we can run controls and tests to prove the reliability of our systems. The need for continuous validation in fast-paced security environments exist because we are unaware of the drift that can occur when parts of the pipeline change over time. Intection engineering is not only an art in defining what we are looking for, but is science in how we look for indicators of compromise. Spite helps you write better detections by providing a platform for repeated adversary emulation, the simulation of real world attacks, and

the constant validation of your alerts. Hope to open source code for this platform someday, but it's currently still in development and not public. Thank you everyone for coming to my talk. um like to thank the besides SLC organizers, the whole community and the security team at scale. Um if you would like to connect with me, I'm on LinkedIn. Um but other than that, I'll have some time for questions. Thank you.

Yes.

Yeah. So, we don't look at our alerts in the sim. We look at our alerts uh when they're surfaced in Jira. So, instead of redirecting um these testing alerts to that main Q, we send them to an SQSQ that the service itself ingests. And then we don't create any tickets for ourselves unless there's an actual error. So you can always go back in the logs and look for what happened during um a specific run, but then we won't get an notification every time a run has completed unless there's a reason to look into it. Yes.

Uh so the question is what would I do differently? [Music] Um I think that I'd build this when I worked at Twitch because everything uh AWS is subsidized so it's basically free. And now that I have to think about how much AWS costs, it really sucks. Um, yeah.

Yeah, the question is how do I prioritize what kind of behaviors we want to detect and where do we get that kind of information? Um, I guess multitude of sources. It kind of all ties into the whole detection engineering program, the whole cycle of um what attack surfaces are we seeing right now um and in the industry uh and in our own environment. What are we prioritizing based on uh the business needs or um the needs of our team? Um are we seeing any specific gaps? Is there anything in particular that we're concerned about? I think for us right now it would be cloud security. Yes.

Yeah, absolutely. Um, the reason why we didn't implement that straight away is because it's just a little bit complex. It's something we definitely thought about. So to have a a signature in the test when you initiate it saying that this log exactly is the one that we intended to create, it's very difficult because how do you pass along a specific tag or a specific identifier every single time that you create an action? Not a lot of actions allow you to do that. Um but it's something we definitely would want to do and want to look into. And right now um we have some edge case catching around that kind of concept. So while the service is

running, if we receive more alerts than we expect, so alongside the one that we triggered intentionally, an adversary trying to hide behind um these actions by performing the same one, that would produce a duplicate that we would alert on. Uh in addition, if the service is is not running and an adversary is using any of these testing resources, um that would generate an alert that gets sent to this queue and then surfaced to us uh because it is continually being processed even when it's not running.

Yes.

Yeah. Um like I said, I joined uh the team as one of the first detection engineering hires. Um very very new. We were on um Elk, which if anyone's worked with it, they'll understand why. Um something had to be done very quickly. Uh we started onboarding to a new sim and the opportunity was basically laid out for me to determine how detection engineering at scale would look in its entirety. So we're starting from zero. what would you do like the right way based on um you know kind of programs that you've been in before or experiences you've had what bothered you and what would you like to see now that you have um the field completely clear

of technical debt if you're able to come in and say this is how I would like the program to work what would you set up and this is one of the things I came up with and there actually was quite a lot of push back uh for this idea because um there's the talk of prioritizing other more lowhanging fruits such as just setting up detections in general um to that I said I think we don't want to incurr additional tech debt if we do move forward with the system by setting up so many detections that we are continually playing a game of catchup um something else that we were trying to prioritize is um proof of concepting

onboarding a sore um instead of just having our pretty basic web hooks platform so um then another engineer was able to take that on while I did this work and another weird constraint that came up was I just was going on vacation and I had PTO scheduled for um a month during summer and so I had a month and a half to finish building everything. And so when it comes down to that kind of time, you just kind of have to hunker down and just do it, I think.

Yeah. Thanks. Um, I won't hold everyone too long. If anyone has other questions, I'll be hanging around to answer them. Otherwise, thanks so much for coming to this talk.

So You Think You Can Detect? Detection Testing in Production

Related talks