Breaking BOTS: Cheat Blue Team CTFs by building AI agents that investigate

Name: Breaking BOTS: Cheat Blue Team CTFs by building AI agents that investigate
Uploaded: 2026-03-30
Duration: 23 min 45 s
Description: Leo Meyerov and team from Graphistry explore using AI agents to rapidly solve blue-team CTFs, specifically Splunk's Boss of the SOC competition. The talk covers practical techniques for prompt engineering, cost optimization, contamination testing, and building evaluation frameworks to improve AI inv

BSides Seattle · 202623:4516 viewsPublished 2026-03Watch on YouTube ↗

Speakers

Leo Meyerov Alex Maurice Alex Warren Tanoy Synindra Thomas Cook

Tags

CategoryTechnical

TopicAI Security CTF

TeamBlue

StyleTalk

About this talk

Leo Meyerov and team from Graphistry explore using AI agents to rapidly solve blue-team CTFs, specifically Splunk's Boss of the SOC competition. The talk covers practical techniques for prompt engineering, cost optimization, contamination testing, and building evaluation frameworks to improve AI investigator performance—moving from naive approaches to systematic, token-aware optimization strategies.

Show original YouTube description

Presenter: lmeyerov Bsides Seattle February 27-27, 2026

Show transcript [en]

Um so hi I'm Leo um from Graphustry I'll be uh you are hopefully in the right room so um we're talking about breaking Splunk Splunk boss of the sock um where we're uh using AI to um speedrun CTFs uh and so we are in the right room so there's that and um I do want to say so thank you first of all for the patience as we figured out and also for inviting me uh to be here and um I'll I'll be talking about things but in reality this type work takes a team. And so, um, there are other folks I want to just call out like Alex Maurice, Alex Warren, Tanoy, um, Synindra, Thomas Cook. Um,

you'll see slides later, uh, but, um, yeah, this work does does take a few folks over time. Um, and I guess while we're get set up, um, I'm guessing most people have not necessarily heard of Graphistry, um, our team. Um, but there's a good chance you've used technology we've built. We've done a lot of open source and free stuff over the last decade. So, um, for example, you might be familiar with Apache Arrow or Nvidia Rapids where you can do things on GPUs a lot easier nowadays. So, we helped a lot of that happen. Oh my gosh, this is so much better now. Thank you. >> Yeah, but you'll have to sit over here because I don't have my clicker.

>> Awesome. Thank you. Thank you, Cat. Um, >> here I don't trust people with my laptop. >> Great. >> And um, can we >> I don't know if I can get rid of this thing. It's okay. All right. Um, so yeah. So, I'll um park over here, but um just like really briefly just check out some of that stuff. Um also um we're have a do want to call out we have a one-day um annual AI security event. So, if you like this kind of stuff, there's a lot of great talks today if you want more of it. Um we're doing it on both virtual and if you are RSA, we have a full one-day event. It's free. Um it's a

lot of fun. So, I highly encourage and this year we're doing trainings. Um so, actually, quick question. Who here's ever done a CTF? Red team, blue team. All right, cool. We're in the right room. So, thank you. Um, so something funny started happening uh in the last year or so, which is uh there's this this weird like prohuman anti-AI thing going on. And um I understand it for many reasons, but one of the the more funny ones is uh and this is actually a bit more for empathy is like uh imagine you're the organizer and you're trying to figure out what's a way to do a good community thing here. And um originally I wanted to show a live version of this

but just the interest of time uh I don't I don't want to um go for it. But um what you see here is like last year um at the beginning of last year um we we published a little video that went viral that kind of showed us taking Splunk's boss of a sock and basically solving threearters of it. And all the that that's going on in this one is we basically took each question a little generic prompt that we used across all of them. And then the AI it wasn't just like chat GB answering a question but it's like modern agentic where it actually called tools like went through it and the video showed not just doing

one question but all the questions in parallel. And so if you're organizing a CTF and then somebody shows up with one of these things that's kind of frustrating. You're kind of like what are we doing here? Um and uh yeah we have the videos online you could find it but like u I think maybe for where we were about um published last year so the work of about two years ago to be honest is uh 3/4 of it was solved so instead of a team taking an entire day the AI would spend about two hours and um each question even the hard ones was just two or three minutes and that that was sort of um yeah but that's actually that's

not what this talk is about because that's that's the stuff from a couple years ago and honestly that was a C minus like if you just think about like where we are. Um and so what we really I want to talk about today is a few things. Um first one is um why are we doing this? Um why are we attacking these competitions? Um and for that and um the the broad goal is like as a defender we want to bring AI into our practice and we want AIs to investigate. And if you think at a communal level imagine everybody had like Bruce Schneider in their pocket. So the next time a relative is getting fished what

if like Bruce Schneider was helping out? you know, that would be pretty amazing. Um, so to get there, uh, as a if you're doing AI, where where AI learns is an AI gym. And so these CTFs are actually pretty incredible because some of the smartest security people in the world have been spending years toning these things. And that is the perfect uh learning environment. Thank you. Oh, maybe did not want to press that one. Okay. So, um so that these AI gyms are uh uh are pretty uh useful as an AI person, but then um at a more human level, uh we like to play like play is way how you learn. That's like a safe

place that you can actually just run as fast as you can, see what works, sees what doesn't. So, as individuals and like you want to do it yourself and that's going to be a big focus for the first half of the talk, we need places to play and we need everyone to be able to play. And I think CTFs before AI that was true and now with AI I think that's even more true. Um and so we're gonna talk about that. Um and then we'll talk a little bit about as a more my background is actually more of a scientist and this is actually we also like to play just like in a little slightly different ways as an engineer.

So we'll talk about that. And so what what I want to go through today um for the first app is bit about benchmarking AI evals and then a bit about how I think basically honestly if some after you have fun today and tomorrow maybe spend Sunday trying to recreate some of this and that's the first half of the talk. Um and then the last bit I'm going to if you are doing this a little more professionally how do we get an A+? What's then like where's this going? I'm not going to press the button this time. Okay. So I too I'm learning. Okay. Um, okay. So, just for a bit of a framing of you'll see AI people talk about

benchmarks and evals a lot. Um, the as um first one is uh I like this quote from Lurard Kelvin. So of fame of the Kelvin scale named after him. If you cannot measure it uh you cannot improve it. And so as we're trying to methodically go these things we need a way to have steady progress. Um if you've seen a lot of vibecoded projects, they're exciting for the vibes phase in eval that's how you start. You just kind of but then once you get things rolling, you do need to switch to something more um engineered and methodical. So you need to have a way to measure AI investigations. But then there's a kind of a dark side and so we flip it and you

um you get what you measure. So good hearts law and this is um a nice way of saying it is like it's about val communal values by us by someone putting out a benchmark of this is what we think is important then the various teams could kind of rally and say hey this is the thing to beat this is the system to hack this is the problem to solve um people who don't do that or come up with their own benchmark things like that that's actually a little dangerous because all of a sudden they're making their own tests grading their grading their own homework and um for as an industry I'm one of the things I love

about CTFs is we're not defining the benchmark. These are community defined benchmarks. And I I really love that. Um and then I don't know if it comes out, but there's a gentleman uh maybe a little less famous holding up two lobsters on a boat on a right, Cinder Ba. Um he he said this more in a detection engineering context, but internally the way we think about um if you can't reproduce it, did it even happen? Um and so like how you know what does it mean to have a real result? uh internally we have a less uh pretty version of this which is eval or gtfo that increasingly when I see results that's kind of how we talk about them.

So, and and just to kind of give you uh like a like a simplified model of how important uh this is, um let's say you use the AI coding tool, do the stuff I talk about for the first half and you get a C minus. I think that's great. That's where you want to start. But then if you want to start like using these for real and like put into production, um let's say we get a C minus on every task we get a C minus. A real investigation, let's say 10 things have to go right. So, um doesn't matter what order, but those 10 facts, we have to be sure. we can't miss them and we have to

hit them. If I have that my my AI coin and with a 70% hit rate and I do I need to get those 10 things right. I'm only going to have a 3% chance of that investigation working. That's imagine if I'm fully automated. That means like who's going to use an automation like that? That's terrible. Now, let's say I improve my coin. I go from like I I'm trying to get to 100%. So, I keep I do all the effort to have the distance to 100%. So, I improve my coin to 85% 90% 95%. I keep improving my coin and each I put the same amount of effort each time. It gets harder to make less progress,

right? I'm still at 59% on my 10 on my 10 things holding my case together. So, that's still an unreliable tool. Okay, I go 95 98 99.5. I think I dropped a number. So, 99.75 90, you know, I keep doing that. I'm now finally at that 99% confidence. And and that's just for doing 10 steps. So, uh, that that's going to be a lot of the focus second half is like and and part of it is think of the C minus that's not bad. It just means we're at the beginning. We're not at the end. Um, so uh and maybe actually one more just I think for because we are humans in the office. I don't think we

have any AIs in the audience. Um, I think this also tells you why human in the loop is so important because an AI let's say takes 10 steps with 99% or like 90% or 50 wherever it's at. If you have a human in the loop, well, they can take another 10 steps and another 10 steps and maybe we get it to 100 steps or a thousand steps. So, I think human in the loop, I call vibes investigating. I'm actually very bullish on on that for doing deeper progress. So, um reason uh I didn't really go too deep into the the example just because of time and the the setup, but um the reason I like bots as a CTF specifically

and I encourage you if you want to start start here. Um a few things. one it is free and open and so as a scientist that's sort of like you can't have closed gates and call this a community thing. Second is um the the example then the beginning of that um if you see the slides after was doing some sort of a cloud incident some S3 like instance startup thing underneath that required like scoping like what what happened what was the timeline like all these things take time are hard and so even though individual questions are like whatever from uh characterizing what an actual incident response does like at like a tier 2 level that was actually

that's pretty good stuff and it's not just the cloud logs they'll actually have the same incident over multiple log types like oh we'll actually have both the cloud version but also the person doing it will have their Windows security logs on their device so we'll have overlapping multis source data sets so this is uh I actually hats off to the boss of the sock team they actually did a really good job so um the rules for the game uh it's kind of like the CTF but we want to change things to make them a little more interesting so um two and I want to talk about two points one is um we're not going to do the open AI trick of like

it's not like the math olympi ad where I'm going to spend $10,000 to answer two questions. Like I can't do every single ticket that comes into an inbox with $10,000 for five minutes. Like that just that doesn't work. So, um for this one, we're just going to talk about a time. Um uh when we did it last year was 5 minutes. I actually should update the slide to say 3 minutes. That's probably a more interesting thing for us nowadays. Um also, uh I probably want to switch it from time to actually dollars. Like what's the token burn? Um the second thing uh that was kind of fun was uh something called uh contamination. And so um we know these uh the all the

AI model people are just training basically on the internet. They're training as much as they can all the data they can. Um every new model has more data. And so one of the first things we do when we try a new model um is uh we actually run it on the CTF but we don't actually connect it to the database. So it's like how could it do on a CTF just with what it knows and just guessing answers and it doesn't get a zero. None of them get a zero which is kind of funny. So um we we we looked at it a few of the questions actually it's just like knowledge base like the first

one's like a freebie like just does it can read like for for Splunk it's like what is the name of this tool? Oh it's probably Splunk. It can it's common knowledge. Some of them are actually the way we run a AI CTF is a little different from how you run a human CTF. In an in a human CTF, you have a list of questions and the team, it's like a puzzle hunt. You kind of work through it. In an AICTF, uh that's really boring if you're just going one question at a time. We really want to run all of them. And so, uh we we actually run all of them, but some of them are sort of path

dependent where one question unlocks the next one. And so, we actually inline some of those answers. And what we hadn't realized is actually sometimes those answers of the previous question let you you don't even have to do anything in the next one. It's like it's fully answerable. And so, there's a bit of that too. So but what told us is the AI companies are trying to cheat but the models are not actually these are still valid uh CTAs but you you got to do the work um to be sure uh every new model we have to do it. So, okay, first result. Uh, we're going to do the simplest and most uh dumb thing to do. And so,

imagine you just I'm actually showing here. It's probably invisible with the red and the black in the screen, but on the bottom, it's basically we're calling cloud code. Um, we're connecting it to Splunk and we're not even no MCPS or anything. We're just, you know, we're just going like caveman. We're just like, you have the CLI, figure it out. Here's your credentials. Uh, and it actually does pretty good. So um and it was actually as a team building tools here was kind of it did depressingly good. Uh so like it was a fail but it was a really impressive failure. Uh so opus 4.5. So we ran this a couple months ago. Um we we don't have 46 yet. It's

not enabled and fully available but um so we weren't able to run it but uh it got a 56%. That that's that's still you know below where we were a year ago but with no prompt engineering just like you saw that that was the prompt. Um I I thought that was ridiculous. um and amazing in an amazing way. Um and then as curiosity we ran sonnet 4.5 so this thing is much cheaper and it's still it got still failing 30 30%. But that's actually a very interesting number for two for two reasons. One is that means we could do prompt engineering to get more mileage out of it. And that means if I want to

have a lot of cheapo tickets which I don't want to spend a really pricey model on there's actually a lot of good we can get out of these. And then the second is um I does anyone here run their own models? So we got a few people. Yeah, the top open source models are kind of roughly where Sonnet is. And so that means take the Sonnet raw score, add your prompt engineering, and you can probably get a C minus. And if you're into that, let's let's talk after. So that to me that was also very encouraging. So let's talk about prompting. Um the um some of the prompting you actually don't even have to do the stuff

we did in the ver first word version like plans with to-do list and all that, but AI actually does now. So, like a big chunk of our initial prompt just went away. Um, so to-do lists, check them out. They're actually there automatically happening. You'll see them automated. But we did uh um two two things we found was pretty impactful um was thinking about uh um uh around planning was methodology and validation. Um and this is a good example of where AI investigating is very different from AI coding. So in AI coding, you might have a specification that you're kind of doing this what what it means to good software. Here's your feature list. And then you might have unit tests to make

sure it kind of does the right thing and then you go into Ralph mode where if it doesn't, you just ask it to run again and keep running and running until all the unit tests pass. In security, we do not have unit tests. We like we don't even know if it's a real hack or not. We like sometimes you'll never know, right? Like you're not going to go like all FBI on it. So um we so we do two things. Um one is uh we there's a lot of different methodologies. I'm not going to get um religious on this. Uh, two popular ones are UDA and um, Oscar. Uh, a crazy thing with LLMs is I don't have to explain

what they are. The LLM knows what they are, so you just have to tell them to use them. Um, and then a lot of the important parts in terms of investigation that that just kind of takes care of it. You can tweak it more, but just starting there like so again like you can fit the full prompts on the slide, right? So that that's kind of intuitive. We need an investigative methodology because it's not coding. And the other part is um, this was less obvious to me was we need cross validation. So, um, the simple version is like let's say the LLM does bad and you're like, well, I want it to do better and you can

you could just intuitively ask, hey LM be do better or maybe not do better, but be more certain and then maybe it'll figure out what it means to be more certain and it'll do it. Um, that's a good intuition and cross validation is just sort of formalization of that. And so, for example, if you're looking at a a real incident, if something happened in one log source in one event and it's real, there should be an overlapping alternate log source that corroborates that behavior at the same time. Or maybe it's in that same initial one, but it's kind of phrased a little differently. Um, and so cross validation is a way to kind of help fight hallucination to

actually force that in. And so that was uh you kind of see in the bottom right um we have very short prompt on that but that like when you see like a you know 10 or 20 percent lift on something very simple like that that's the kind of wins you get in the beginning when we're at this like C minus phase. So I thought that was really cool. Um I'll be sharing all these prompts uh I don't know if the slides go public but I'll share all the prompts on on on Twitter and LinkedIn. So um again they fit on slides but I'll try to wrap them up the Okay. So let's this kind of stuff gets you to a C minus

fast and again like a year ago would been really hard but now just just try it like you know do it for yourself but let's now now we have to start working for our gains. Um one of my uh favorite quotes is uh from the beginning of Anacrenina all happy families are alike each unhappy family is unhappy in its own way. Um and uh if anyone deals with data and databases uh this uh should feel a little familiar. Um, yes it's SQL, yes it's Spunk, yes it's Custoau, but every team has different tables and when it's logs what's in those tables are also very different and this is really hard on on the AI. Um,

the first hope is you could go to a vendor and they just solve it for you. Um, so Splunk, we're boss of the stock of Splunk about everyone like Kuste and everybody else has their own but generally um there's something I call a thin MCP. So, an MTP, if you're not familiar, is is basically a wrapper around ideas like REST and API so that the AI can more easily talk to whatever um the the tool is. In this case, a database. A thin uh a thin MCP to me is just a wrapper like it's it's just telling you how to call it. We actually do see a benefit from MCPS just by the fact of going from curl to MCP. And the

the intuition is a little funny, which is basically when it has to figure out curl, it has to figure out the right song and dance to call the underlying REST API correctly. So we b we burn a lot of our time at the AI figuring that out. By having the MCP, you can kind of skip the ceremony and because we're in a cost and time gated uh activity, that's actually a big deal. So um an MCP helps, but at the same time uh Splunk and most of these are a thin MCP. So in Splunk from an AI perspective, it doesn't really give us a map of how to work with the data. Um so what would a map mean?

Well, I want to know what like Spunk will tell you which indexes are, but it won't tell you which log sources or which tables are in each index. For each log source, it won't tell you which columns are in there. For each column, let's say there's a column like like host. Well, host can mean so many different things and across so many different sources. How do you jump across and pivot across? Like it doesn't give you a map or data dictionary or anything like that. Um, and this can go very deep. Um, so uh adding that that helps. Then one more that um for folks uh anyone vibe code okay awesome anyone um when you do a loop is it you do unit

tests right? Does anybody have a unit test in the loop that's actually not a unit test but a performance benchmark? No hands. Okay so we had some hands for tests but no hands for benchmarks. Uh, so in our world, if we're trying to actually go up the leaderboard, um, we want to have we're we're fixing we're making our investigator the Python code for all of our custom tools, but actually our prompts, our skills, all of those are part of our eval sets. And so when we do AI coding loops, we actually we're not trying to pass the unit tests. We're trying to do an error analysis of what went wrong. Um, are we wasting tokens? Are we getting questions wrong?

It's and we're actually building out and optimizing our eval set. So I don't really call it vibe coding anymore. I call more eval driven uh coding loops because that's a very different mentality once uh uh you start thinking about not just passing but doing well uh and um as a trick for the at home thing one of the most useful things we did hear about maybe half a year ago is uh we use open telemetry so all the logs but actually not just the logs but the prompts all of that the AI gets the feedback on itself and then in the um last few months both open AAI and claude exposed that as well so you can actually

do that at home as well now um you might not write your own tools, but you should you're probably writing your own prompts. And this is a very different way of thinking about it. Um, one more thing I want to share here uh um was kind of fun was uh we started thinking about the problem differently now. Before it was just a very, you know, cold start. You have a question and you answer it and each question we answered independently. um in a in a real scenario like we're we're curious like you know if you actually look how somebody does a competition they they they spin up the the the the database they take a look at what's in

there they start figuring things out we wonder like what if we actually have the AI jump in it does that but not only does that but it goes full IR mode it's like does a full se flow of like it even not knowing what the incidents are it figures out what the incidents are creates a bunch of timelines creates a bunch of IR reports we actually had to create graphs visualizations mapping them all out. It spends time without any it it dumps the competition without any questions. Just like think of that. It just has a database and it investigates and then now it has a crib sheet and now it investigates and then what we get um is

uh all of a sudden this is a result from this week. We are now at 100%. So this is uh you know it's not like we're beating it. We killed it. Like this thing is dead. Uh uh and then also a note is because we did that our our solve time for each question is now cut in half because it it still has to investigate more but it just knows so much about what's going on that is kind of like burned in and ready to go. Uh so to me uh we're still getting figuring out all the ramifications of this but is changing how we think about a lot of it. So um with that uh just um couple just

wrapping it together. Hopefully you have a bit of confidence here that a lot of people here are using Vibe coding tools. Just get one of go to GitHub, get your favorite CTF, get your C um get your favorite MCP and just try it. See how far you can get. Um try some of the tricks I gave here like see if you can go from like an F to a C minus and then iterate from there. Um we're trying to make this easier and so um we're also trying to make it again like how do we get this in everyone's pocket so you have the CERT team in in your pocket always available. Um we're getting ready

to release that stuff for free. for now we're giving out privately for free. So sue up sign up on the website, tell us what database you have and we'll get you early access to the to the free version. Um and it's bring your own keys. We don't like all all private uh AI. Um the other one just at a conceptual I just want to go back to the beginning on benchmarks. Hopefully you're seeing why I care so much about CTFs benchmarks. It's kind of adding up a bit more like this is the core of how the community needs to work. Um, I think my recommendation for the CTF community is don't ban AI, just make

an AI track. That's people need to learn that stuff anyway. And it's actually our chance to shape where the the industry goes because we get to say what we believe is important. Um, and with that, I just want to say, you know, thank you.

Breaking BOTS: Cheat Blue Team CTFs by building AI agents that investigate

Related talks