
Do you want to learn how to cut threat detection, development, and maintenance time by over 80%? I know that I do. >> [gasps] >> Please welcome Raphael and Chen, who are going to tell us all about detection allegro. Uh wow, I can't read. Composing detections with agentic workflows. Please make them welcome. Oh, thank you, everyone. Uh it's a very big room and a giant screen there and it's the entrance over there makes me feel like I'm entering a boxing game or something like that. So, >> [laughter] >> let's see how it goes. So, thank you, everyone, for joining. Um we're going to talk about detection allegro. And hopefully by the end of this talk, um everyone could get to know how to
effortlessly create and maintain detection rules. My name is Chen and I'm presenting together with my coworker, uh Raphael. We're both on detection response team at Vacasa. Here is a outline of the talk today. So, we will introduce the background of the current situation of the detection response team at Vacasa, our logging pipeline and everything. Then we'll go through a quick demo and we'll discuss why we believe this is a good problem for AI to solve. Then we'll go through the architecture, orchestrations, and show some results by the end. So, first, background. So, detection response team at Vacasa, uh we try to capture security threats within the company and we try to prevent them from being escalated to security incidents.
We respond them quickly to minimize the business impact. Uh we're a group of software engineers building detection response capabilities and the toolings to try to scale at the company's growth pace. >> [sighs] >> Here is a uh uh pipe pipeline of what we build in-house. Uh we don't pay millions of dollars to vendors to uh do log analysis, pulling all the logs, all of those automations. We build everything in-house. Before I dive into this, I want to give a huge shout out to all the open source contributors behind Grove, Substation, Stream Alert, and Tracecat. I I love your projects and you make our life much easier. Uh we build the entire pipeline uh above like these open source projects.
So, we use Grove for log pull and log gathering and we use Substation to help with log normalization and log enrichments. Uh we use Stream Alert for log analysis. Then we have our own in-house alerting pipeline. Then we use Tracecat for AI response and alert enrichments. So, it all works well. Uh our team is very familiar with this whole thing. Like we know how to find the right thing to make changes or add any new rule. And it's like fantastic for us. We're processing like over hundreds of terabytes of logs every month. We have hundreds of rules running through this platform. Uh we maintain the cost pretty well. And until one day, we got a new hire and his
name is Raphael and he is going to introduce his experience after joining the team. All right. So, I'm Raphael. And this is me working. And this is me presenting here. So, my first task when I came, uh given to me by Chen, was, "Okay, let's improve our GitHub coverage." So, that means we want to have better insight into if anything malicious happens uh on GitHub. Okay. So, I start thinking and I have an idea. And the idea is, "Why not detect disabled branch protections?" Cuz that should never happen. Okay. So, I get going. I read the GitHub docs because I need to know what data I have available. Then I read our actual logs to see whether their documentation
is accurate. Then I start writing code. I realize I need an SQL query, so I do that. And then I start writing tests to see whether it actually works. I run these tests and maybe I'm not as good as I thought I was and they all fail. So, let's start again. Let's review my code. Let's read the docs again. Let's go back to the logs. Back to my SQL query. Let's run the tests again. Uh and this time all the tests pass. Uh that's amazing, right? But, you know, it's me. Uh I know how to code. Or so I think uh until I realize in production, it doesn't really work. Um So, you know, maybe some engineers quit
here, but not me. Uh I start thinking and I think hard. >> [gasps] >> How what happened here? So, let's review CloudWatch. Let's review the logs. Let's review the code. Let's review the docs. And at some point I start wondering, right? Where did it all go wrong? Uh cuz if you remember at the start of the talk, Chen promised you that this should be effortless. And this does not feel effortless at all. So, if we think through the scenario, after I had my original idea that the detection should be disabled branch protections, everything else I did was super boring, right? And I don't want to do that. So, let's take a look at a quick demo of how we solve that
um to make it effortless. So, here we have a linear ticket um for Kubernetes API logs. I just wrote down like three detections and was like, "Okay, we need to add more." And then I'm going to add a label here. The label is going to say, "I want to generate some ideas." And now we'll get some feedback that the process has started. Yep. So, new ideas are being generated. Let's take a look at the workflow.
So, this is the detection idea generation workflow. We'll go through this. Um it's a little sped up cuz nobody wants to watch uh time go by, basically. But this will create uh some more linear tickets for us that we can see here with all the detections that AI thinks we should create. And we're going to deep dive on one of them. So, you can go through and review which ones you think are actually reasonable and so on. And we'll click on one. And here's what AI thought we should do. Uh you can see that it thinks it should be sort of a scheduled detection run every 30 minutes. It has some minor attack uh labels and now we kick off the next
workflow, also a linear label, and this is the the detection PR generation. So, we'll head over to that workflow. It's [snorts] also sped up. Okay, it gets generated. All great. Now we'll go back to the ticket and here we're going to see the PR being linked to the ticket. So, now all I have to do is, "Okay, let's let's over head over there and review." You can see AI put some reasoning into the comment. And then we always generate both the tests and the actual detection. In this case, there are three files. This is the test file. This is sort of just a log. This is the actual Python file that runs the test. Uh the detection. And then this is the
SQL query because in this case, if it's a scheduled detection, it's split into. Okay. So, that was pretty effortless. So, let's dive into how exactly all of this works. But before I go into the actual architecture, I think it's always worth asking, "Why is this a good problem for AI? Uh should I use AI here?" Um and so let's go through that first. I think first of all, this is a very clearly defined problem. So, this uh journey that I went through in the beginning, you literally go through every single time. And we can also verify that the code at the end is actually correct because if you have a a log of a disabled branch
protection, you know for sure the alert should trigger. All right. So, now you might think, "Oh, this is a lot like an interview question." Um but you know, a lot of companies don't like it if you use AI for your interview questions. So, how is this different? Um in this case, if somebody on our team creates these detections just every single day, they're never going to get any better as an engineer. They're not learning anything. There's no like other thing you're trying to i- gather from this uh process. And then lastly, this is just a text comprehension problem. Uh which is AI is really good at that, right? Okay. So, now let's actually get into
the details of the architecture. So, there are two distinct phases that I had in in that original walk-through. There's sort of creating the detection and then afterwards there's maintaining it, fixing it, allow listing people, whatever. And they are different types of problems, so we'll handle them separately. And we'll start with the creation. So, we'll spend a little bit of time on the slide so you can get comfortable with it. Um So, first, when you look at this diagram, the first thing I want you to take away is on the left side, there's sort of like creating the prompt and then there's handing it off to a coding agent. And that is the big idea here is we do a
lot to build a prompt and then let a coding agent actually do the work. So, how do we build this prompt? We gather context from all of the different information sources that we have. So, in Notion we have a log source wiki that covers like all of the log sources that we have, what kind of information can you get from there, and so on. We get some log samples from our S3 bucket and you get the log schema from Glue. And then you ask AI, "Okay, given this, you know, and and the idea, obviously, can you name the detection, what is the severity, can you uh I don't know. How often should it run? Those kind of
decisions. And once you have that, we make a separate call to get uh the MITRE ATTACK framework. Um and now AI sort of has decided, "Okay, should it this be a scheduled detection or a streaming detection?" So, the big difference here is for scheduled detections, uh these are run uh every 30 minutes, an hour, or so on. They analyze logs over a time period and usually use SQL. And then streaming detections, it's literally just like, "I have this one log, no other context. Was this activity malicious or not? Yes, no." And uh they they need a different prompt, so this is handled separately here with different code examples, and the prompt itself is also different. So, you kind of make
this decision in the beginning. Now you've gathered all this context, you can make a huge ass prompt, and then you can hand it off to your coding agent, and the coding agent ha- it runs in a lambda in this case. It has a repository, it has all of the tooling that we have for that repository available, so it can lint, it can test, all those things. And it has a specialized Athena skill to make sure that all the SQL queries work. And then the PR gets created. So, that is exactly what we saw at the very end. Um so, that's how the creation works. I want you to take away three things from the slide, so let's go over them again.
First of all, here we did sort of I would call it like manual context management almost, uh because we realized that every time you do this workflow, you need the same kind of information. You need to know what log sources we have available, what the log samples are, and so on. So, we can just get it for the agent right away. We don't need to ask the agent to get it itself. Um because we think we can do that initial part uh better in this case. The second is if you ever do need like a normal LLM call, just use like the standard techniques, sort of like chain chain of thought and few-shot prompting. And then lastly, all of this runs on
some sort of workflow scheduler. I don't think exactly what it is is that important as long as you have like durable execution or something like that. All right. [snorts] And then, there's also the second part of this whole pipeline. Yes. Uh as Raphael introduced, so when we build new detections, we kind of follow like a certain procedure and certain like thinking process to add a new detection. But there's there's another type of issue, uh which is like a very small, like straightforward uh simple like bug fixes. It can be anything. It can be "Okay, we need to add a new user to allow list." It can be "Oh, there is a bug in the detection
rule uh written in Python. We need to fix that." It can also be uh the upstream log source got a new field in the schema or like they changed the schema. So, these are all like very repetitive work. Let's use an example a Raphael used earlier. So, we have a branch protection rule detection, and it works all the time until like one day we got like a dev tools team trying to build a new workflow needs to disable the branch protection again again. Then we don't want to be alerted again again because like it's we got a alert alert fatigue and like got wasted time there. So, what do we usually do is we want to choose to add
either the username to allow list or add the repo with what which is used being for testing to allow list. So, what do you do that in the original workflow is like we need to locate where the code is, make a code change, then create a test case, create a PR, then waiting for approval, then like once it's approved, we merge the PR waiting for it to be deployed. It's not hard. It's very straightforward. But it's also very repetitive, and nobody learns anything from this. And I'm not like becoming like a better engineer after adding like five people to allow list, and I have never heard anybody got a super excited say, "If I don't add like these five repos to allow
list, I can't stay sleep tonight." And that's like what I live for. Like I just want to add one more people to allow list. I've never heard a comment like that. So, that's what why we believe it's a good problem for AI to solve because AI is always excited, no matter what you give to to them. So, we to solve this problem, we bundle the coding agent into and our coding and our detection rule repository into a image. We put it on them on ECR and then run them through the lambda function, which can be triggered through the upstream linear tickets. So, that helps like making the workflow much easier. Basically, somebody just needs to say, "Add this to allow list."
Then PR is made, then just merge, and it's applied. Now, the last piece of the entire picture is the idea generation. So, we spent so much time talking about how to building detections, how to like build things given a task. So, but like how do we come up with ideas about like what should be alerted? So, I would like to break this into two parts. The first part is like the common SaaS applications. For example, Okta, uh GitHub, or like CloudTrail. Uh these things you can easily find a lot of public research from the internet. They're like GitHub repos, they're like blog posts. All you need to do is just to pass those in to the AI and that AI to generate
what should be alerted on. It's very straightforward, then we trigger the downstream to create PRs. But there there are other parts like Lacuna we have a lot of internal tooling. We have our customer support tooling. We have millions of IoT devices all over the world. Nobody's going to tell us like what should be alerted on those. So, what do we do there in the past is like we're always depending on the security engineers having the partner meeting with their product team. Then they got ideas through threat modeling. They got ideas from threat research. In the past they just always create tickets, assign to us, then we find the resource building those detections and then maintain those detections.
But now like in the new workflow, they can just like create idea, then they trigger the workflow without understanding the entire code base or what DNR is. They can just like have the detection set up, and they can maintain them easily themselves. Then the last piece here is like once we bring a new log source into the pipeline, how can we quickly gather all the ideas should we should be alerted on? So, to achieve that goal, we build a vector DB pulling the company's risk registry, and we pull the log inventory with sample logs, and we put together the MITRE ATTACK detection strategies into the vector DB. Then we basically just throw to AI say, "Hey, this is the new log
source we're going to onboard, and this is what the tooling does. Please let me know what should be alerted on." That's like the earlier we saw like in the demo like say, "Hey." And then the AI would go to the knowledge base and search all these things and put some reasonings like why we should be alerted on these. Then we can just trigger the downstream ideas and to generate detection code automatically. All right. So, we've been iterating on this pipeline for many months now. And so, I want to share some of the pitfalls that we went through and and the learnings. So, before we switched to a coding agent, we actually I just had like a
sequence of LLM calls, basically. But now that coding agents are really good, you know, having something that can actually lint, format, and run the tests has really improved the output. So, that has been great. And then specifically for detection generation, you know, maybe one day somebody at GitHub wakes up and decides, "Okay, we should change where the email field is." And you need to adapt to that. So, here we always get live logs from or like the latest logs, basically. And add them to our prompt so that we have the latest information of what's happening. And lastly, the AI is actually pretty bad at getting the capitalization correctly. So, if you do a scheduled query, Athena
requires everything to be lower case, but a lot of logs actually come in camel case or something else. So, how do you uh have AI work through that? We built an Athena query skill specifically. So, the agent can call the skill, and then it'll run this little loop, basically, here, which first it creates an SQL query with the same uh inputs that we went through earlier, right? So, our database, the log schema, some logs. It'll create the SQL, then it'll run through the linters that we have and and public linters. And then it'll actually test the query against Athena. If it works, we're we're done. If it doesn't, let's try again, basically. And this has really helped uh fix this
capitalization issue and just making the SQL queries a lot more accurate. Okay, so the last piece. So, we've gone through all the architecture, how it all works, but uh what it really brings it together for our team is the orchestration through our ticketing platform. So, if you look at this diagram, we've we've gone through all the pieces already, right? So, the ticket origin, the idea generation, Chen just went through that. We've talked about the bug fix lambda. We've talked about the agentic workflow that creates new detections. And so, the only really manual work that's left is you need to review these linear tickets. If you agree with the detection, uh if you and then review the
PR at the end, and you add a label, it'll go to a webhook, and then get routed to our corresponding workflow, and then you just have the PR at the end. So, let's go through these why this has been so good for us. First of all, you don't need a local dev environment anymore. It's like a classic benefit, right? So, I don't want to spend too much time on it. All of the information is stored in one place. So, Linear now is sort of the source of truth for all these uh detections. Build nice metrics around that or whatever. And then lastly, and this is the one I really want to focus on, it's a lot
easier for other teams to contribute. So, you know, we have a lot of uh really smart [snorts] other engineers who are not on the detection response team. Some of them are in the crowd. Um and they don't understand how our pipeline works. Um but they still want to contribute their ideas. And now all they can do is let let me write it up in a linear ticket, kick off the workflow, and then our team just has to approve. And we've on-boarded a lot of other teams onto uh the system, and it's been really good at creating new detections, at keeping uh old detections accurate and maintained. So, that has been, I think, the biggest benefit of doing this sort of whole
let's keep everything as linear tickets and so on pipeline. So, lastly, let's go through the results. So, the original flow that I went through that used to take me like 2 hours or something, uh now I can do it a lot faster with proper review in maybe 20 minutes or something. Uh we've used this to create over 150 new detections, and the bug fixing lambda that Chen mentioned, we've use every week, basically. So, it's been a great. And you know, if you love security and uh you have a strong software engineering background, we're always hiring at Verkada.
Oh. Okay, now it's questions.
Thanks. I guess knowing the setup for the infrastructure now, how long do you think it would take you to build like this initial system? And how long do you think it would take you to test? If I had to re-implement it now? Yes, like already knowing how you would kind of build it, um but just like taking these bits and putting it together. Yeah yeah. So, I would have say it's a not like we build everything all in once, like all just one shot. Then we wait until we build everything to start using that. It's more like we build a small piece and see if it works. Like for example, with the cloud code uh bundle like into
the into the image, we first like when it's cloud code was just released, I think that was like May-ish, then we built that in early June, like very early when like not a lot of people even know what a cloud code is. Then we'd see if it works, then we see it works, then we try to put more efforts to keep building it. So, if we know like building everything from scratch, we could probably finish everything within one quarter, I would say, with one engineer. Uh but it's always not just like you build it, you let it go. You just collect feedback, you see how it works, and you keep improving it. That's like how we do it.
There are a lot of questions here. Yeah, I mean, by my for my experience is you build something, and then you realize "Oh Uh what I built is kind of out of date again. So, let me take out this middle piece and replace it with something that now people start doing." And that's why I mean like we've been iterating on it. Yeah. Okay, so most upvoted question, "How do you validate the detections are correct? Where correct is not the same as to uh to what a skilled expert would write or roughly equal to, actually. Um So, something that a skilled So, I in in this case, I think the easiest case for this one is
if you have a log that you know contains a malicious event, then it's always very easy to verify. If you don't have that, and you are like um coming up with something completely new, and you're trying to identify that, I think the code review phase does not go magically go away. You need to actually go through what the agent wrote. And Um I I would also like to add on that uh we're also adding a detection evaluation um framework. Basically, to answer the question like say, if we change the prompts, is the results going to getting better or is it getting worse? So, we is still in early stage. We We noticed the same issue, right? How do we tell if the
detection is good enough to be deployed straightforward. So, we're still like in early stage of rolling this out. Uh and oh, like the engineer working on this actually sitting here as well. Like I it's a we use AI to answer the question if the detection is good enough, and if it's match what it's expected to do. Uh yeah, that's a what we're building. We didn't mention it here because it's still very early stage. Next one is very similar. Uh so, the next question is, "How do you know that it's pulling the right event type {slash} log fields if public documentation is poor, and the logs has uh never been generated in your environment?" I would say that the answer is very
going to be very similar. >> [laughter] >> Um you Ideally, I think you feed logs through your testing pipeline to see uh whether the tests are correct. Um and then we do the same thing of trying to figure out evaluating our de- detections against AI. And there is a little bit of this uh problem always that maybe you will never actually see one specific field in a log because it only gets added there if something really bad happens. And in that case, you know, if the documentation doesn't have it, the log never has it, then it's going to be hard to figure out for anybody.
Okay, next question, "Do you build composite detections, threat models with the agentic workflows, or just atomics, behaviors signals?" So, if I understand this correctly, you're asking whether we are um correlating events. Is that the question? Okay. Um So, for this workflow, um this does not currently do any correlating of multiple events. What we do do is um sort of enriching the context of the alert at the end. So, if if something triggers, um let me think of an example. All right, let let's keep with the branch protection disabled uh theme. Say that triggers, right? We we're building a system that now takes this alert, checks uh our Slack history and so on to figure out um
has anyone talked about doing this somewhere? Like, is this something that we know is happening? So, we we do alert triaging um with AI as well. But for the correlating that you meant, we currently do not do that with this AI system. Yeah, but we actually use Substation, which does log enrichments in the pipeline. So, we won't see like, "Okay, [snorts] we see somebody disabled branch protection rule." Then we see who they are, which department they're in, and what they do. So, if it's like, say, a marketing people disabled branch protection, it's probably very suspicious. But if it's like an engineer working on dev tools team disabled a branch protection rule, then that's like likely legit. I don't know if that
covers the correlation parts, but that's how we correlate with other information in our data center and to build detections. Uh All right. Thank you, everyone, for all those fantastic questions and thank you to our presenters for this fabulous talk. Let's give them a round of applause. >> [applause]