
You know what they're talking about downstairs? What? AI? [ __ ] I don't know about that currently, but uh that's that's a lot of lot of AI everywhere. It's not not just your track. >> Nope. >> All righty. Um we'll go ahead and uh get started. So my goal is to talk about what we've learn what we've learned. um using LMS on pin tests. Um this isn't really a AI hype talk. It's not AI doom talk. Um it's just kind of in between and just kind of realistically what we how we how at least I see it today. Um so as Aaron mentioned, I'm Alex Lamman, founded Pestic Trust Foundry. I founded SecC and I like hacking stuff. Um lots of
building tooling automation. Um, so this uh using AI is is is right up my alley. Um, I I'm I'm really enjoying it and uh taking it as best I can. Um, okay. So, the just kind of the thing we're all wondering, especially if you're hearing some of this hype, um, there's all this autonomous AI pin testing hype, mythos, mythos, mythos, QPT 5.5. Um, there's always just crazy new models coming out all the time. They're always getting better. Um it's just, you know, there's tons of tons of tons of great models and tons of marketing around them. Um and they are finding like real z really cool bugs uh at least kind of cool bugs, real
zero days, really impressive bugs. Um such as like the 23-year-old FreeBSD zero day that uh that Mythos found recently. Um and so there is some validation behind all the hype for sure. And then we'll talk about kind of what actually works today. um what's on real engagements, on real targets. Um uh what what will we see? Uh okay, so we'll talk about kind of where it helps, where it fails, the patterns that work, and the real and we'll end with Q&A. Um so we'll start with the wins. Um just basically all the good stuff AI does. Um so code reviews, it does amazing at. Um has a huge context window. It can look at tons and
tons of code. uh the coding agents have kind of invented a way of sorting through the codebase and figuring out what code is relevant. They follow that and been able to uh uh produce a lot of really good context for the for the LLM which produces good bugs. It catches lots of weird esoteric stuff that I see. I >> think we might be low on battery on this one. Um the um lots of bypasses um post message handlers just weird JavaScript things like that that are >> around still around here somewhere. >> He took it with him.
All right,
I'm okay to keep talking, but I only get recorded. But >> I'll give you one. Um the over here. So finds all sorts of crazy word stuff. It's sometimes really hard for a human to see. Uh often very hard for human to see. Um it uh also is a great writing code which provides a new attack surface. uh usually codes pretty good but there's definitely bugs that code that's written. Um so just new stuff to new apps new stuff like that. Um in anthropics research which is based on code review they found like 500 bugs using obus 41.6. I don't think these are all validated uh but it's just potential bugs at least. Um and in the mythos I
mentioned earlier they found a 23y old um 3BSD bug really impressive really hard to operate system. Um so the key here is um and also that screenshot there is showing like how simple the prompt was that they used to find those bugs. This isn't like fancy complicated crazy stuff. This is um just find you're playing in CTI fire find a vulnerability write the most serious one to report that txt. So super simple prompts running that in a loop over a bunch of files. Um it doesn't take a ton of crazy prompting and orchestration that work. Um so the bottleneck as I mentioned is more around um is is not just finding the bugs but it's
actually validating them. I mentioned those are 500 Opus bugs uh that Opus found but they're not all validated. Maybe a quarter of those are valid. Maybe 50 were valid. It's just there's a lot of work to do um after the no. So AI is good at um recon coverage like uh sorting through huge you know 15 megaby JavaScript files changing assistance payload generation format conversion just lots of super helpful things that's good at yeah batteries out um the uh um we like uh we've tried a lot of AI pit testing tools. Uh there's all there's tons of stuff on GitHub showing some tools for everything including testing. Um I haven't found one that really shines. They're all just good
different things. Uh but the code is like u just just being having a harness is really helpful and our favorite way to do things. I kind of customize it kind of do whatever you want. I guess that's what um anthropic um very smart researchers from are doing it that way. So it must be very smart AI researchers are doing something we'll be able to do something uh extremely complex combin
where you take that thing >> I was getting scavenger hunt >> it is down there I told you just walk that way and yell scavenger hunt and start cool um so where were we here the um Uh so we like log code a lot um or codecs or any of those just kind of agentic uh flexible interfaces are great. Um the um the way you configure them is also um you can have an M MCPS hook up different tools. I mean just sky's the limit on what you want to do. You hook up different sub aents um configure configure them in different way any way you want. Uh just depending on what you want to optimize for. I think generally
just making sure they're very thorough is a challenge. um that more newer models, especially Office 4.7, tends to be a little lazy from what I've seen, tends to kind of think that it solved problems and they're still in progress. So, just kind of have to encourage it to be really thorough. Um here's a example of just a simple skill um just an example skill running um pin testing finding pin testing findings. You probably can't see that from there, but um it's really easy just to write skills um randomly. uh just or just ask it ask clog what you want it to do and it will find potential bugs. Um validation is a is is another step but it will find lots
of potential stuff to look into which is which is great. Um the uh AI can be super helpful with reporting and just kind of all the boring stuff that goes into pin testing. Um this is really probably really boring if you're not a pin tester or never been a pin tester. Uh but it's probably super exciting if you have been. Um it just all the reporting monotony is is really tedious. Uh there is tons of boilerplate text that we're constantly writing, tons of little formatting errors, um tons of uh things that you're just customizing depending on what TLS cipher suite's in use or something that just is uh super tedious. Um it's hard to make that all
consistent across a team and customize and a helps AI helps a lot with that. Um the other thing is like here's an example here which is uh going to be too small for you to see but um sending it up to like our reporting um pin testing platform. Um it you can just use AI to document these things and as as well as um review them review human written output or or machine output to help improve it. So AI is super helpful in just the entire backend process. I think that's like the biggest win for us um is just making everything as efficient as possible. um that one of the things that it really excels at.
Um let's see. Uh it also helps write code. Um so if you're writing tooling like this, uh if you're you can just get further faster than ever and that's uh that's a great thing. Um the uh what's shown here is just like AI PR just reviewing reviewing report content making suggesting improvements for a human to review and go through and decide what uh what you know how to make the report as good as possible. The irony is I'm not sure how much this will continue to matter. Um it's always been a pride of every pin tester to write reports as cleanly as possible and make things as thorough thorough and detailed or reproducible as possible. But I'm not sure how many of
our customers are just throwing things in the cloud code and saying fix this um and not even reading it. So if that's the case, it probably doesn't need to be super clear. But our goal is to make it as clear as possible for the time being. But I'm not sure exactly. I think that stack between um pin test report Jira ticket um you know months of months in the queue and then being code deployment is just going to get thinner over time. I'm not exactly sure what piece of that is going to be eliminated. Yeah. >> Have you thought about basically generating a customer? >> Yeah the question uh just to repeat the mic is have you thought about generating
a prompt for customers to use? Yeah, we have an API that uh agents can communicate with. My goal is to make it as like agent and human friendly as possible. Um but it's mostly focused on humans right now. But I think that's uh I don't know how how many of our customers would want that, but I'm going to guess that you know more the cutting edge and bleeding edge startups would. Um but our current focus I think a lot of people just copy and paste stuff and throw it into quad, which is fine. Um but yeah, you can do everything at scale and just say fix all the bugs in this report if you want to be kind of lazy
about it and see what let it let it uh rip through everything. Um so yeah, there's a API that uh facilitates that and yeah, it's I think making things agent friendly is definitely uh important. Um another thing is this is kind of related to um just being able to write code. It enables a ton of data analysis. So, um, just being able to get crazy analytics and stuff like that that we've never been able to get before, uh, is is possible. I think it's kind of ridiculous. Um, pin testers are probably some of the worst of this, but I'm sure a lot of other professions had this, too, where it's just really hard to answer. How many cross scripting
findings do we identify last year? How many um what what bug class do we see the most of or the biggest increase of? Just questions like that have been really hard to answer over a bunch of different engagements. And now we can kind of collect and analyze that data in a nice way. Um, and I I like that. Um, the um I think those trends and analysis and stuff is just um I don't know super super useful, very helpful. Um uh so another place that uh I guess the first thing that we saw where I was really good at pin testing was CTFs. They're nice confined problems with a clear wind situation. um there's no
blast radius, there's no concerns about production servers or anything like that. Um and and they're usually like moderately hard problems um that are, you know, solvable within a reasonable time frame. So um AI does really well at those and continues to do better and better. I think it's solving like a lot of CTF problems, but not some of the hard and really hard ones. It just kind of depends on the model and the problem. But um anyway, CCFs kind of came first and that was the first place that we saw them doing really well. Um and then um they've gotten uh um better and better at uh different tasks since then. So um uh now and now we're seeing that um a
year or so ago we started hearing kind of that they were more applicable on like could be used for pin testing and then now we're just seeing everybody using them uh across the board for finding bugs. Um uh so we'll talk a little bit about where AI fails. Um the one uh one one place we've seen it fail. This is my personal uh failure is being confused. It's just the attack surface is weird. This is this is the only slide about testing AI, not testing with AI, but I think I think it's a a story that I I think of for this. Um, so I was testing LLM chatbot and I convinced it to give
me I I'd been chatting with it for a while and I finally convinced it to do bad things and I got like Etsy shadow out of it and all this stuff. But what I didn't realize is I was communicating with it was just hallucinating the whole time and just tricking me and it was making up everything that I asked for in some reasonable format. And uh it yeah it had me had me fooled and I thought I found something cool but I didn't and I'm still embarrassed about it. Um the um so I mentioned earlier about just it's easy to generate tons and tons of bugs. Um and Ryan talked about this with his first talk about kind of bug bounty
programs and that third challenge there. This um uh in this chart you can see just the number of bugs and open source projects going through the roof. Um some of those are true positives. Uh actually probably all those are true positives. the false positives are going to be um even higher just because it's so easy to find potential bugs and that look seemingly valid. Um and this is again just a problem for even the smartest researchers uh at Anthropic. They have they need to they can find tons of bugs they just tast um he had 500 bugs say sorted by severity. He picked the top one and it was severe um but didn't have time to go through the rest at the time
of his uh presentation or at least many of them. So it's just h the human bottleneck is is real. Um the uh also on like there's a lot of cool bugs and research coming out but the false positive rates are never published. It's just um on their open BSD bug. They mentioned it found it. Mythos found the op free BSD open FBSD I think bug three out of 10 times which is like an okay rate I guess. I mean it's better than zero out of 10 but it's hard to know. you really need a human in there to kind of validate whether that's true because uh if it's only exploitable if it if seven out of 10 times it says
it's secure and not exploitable that seems seems uh like a believable case but it isn't there once a human types in actually tries to exploit it. So, um, and token cost is real. I mean, they spend quite a bit of tokens on it, but um, I bet the human cost is larger. I would guess they didn't publish the human cost, but they have a really talented team working on this stuff. And, um, they spent $20,000 of tokens, but I'm going to guess the human time would be, uh, a lot higher than that that they put into this um, research. So, um, yeah, validation is a is a huge challenge um, even for the best models,
the best researchers. Um, and uh, that will continue to be the problem. pro or a problem for the it's the near future. Um the uh so another challenge is um with hallucinations and issues with severity. So it will um make about make a issue look like a real issue and it is like really convincing that it's a real issue but it's not and you really just have to go in there and analyze it and figure out what the case is. Um, you can use a really good validation pipeline to minimize that as much as possible, but you're if you're the the farther you take things with automation, the more false negatives you might be throwing uh
away and false positive you might be including, I guess. Um, you're just you're going to have to go in one direction or the other. So, it's a it's a challenge to get uh perfect um great validation as automated as far as possible. Um, and there's lots of issues with severity. AI is like so good at c sort of things, but it's really struggles with interpreting severity. It thinks everything is critical. Um, and you can kind of convince it otherwise, but it's just and put a lot of standards around it, but it's still just as uh really hard at interpreting specific situations and whether they matter or not. Um when uh agents meet the real world,
there's um uh let's see. Um they they can um they either don't know when to they can run into problems where they don't know when to stop. They keep looking at a target that's down or something like that or you know wasting tokens on a on a bad path. Um there's just al concerns around are they really going to interpret the guardrails that I put in about rate limiting and things like that. So it's just hard to know exactly what AI won't do all the time. Although it's you know with good prompting you should be able to limit that quite a bit. It's just hard to know for sure. I would say humans have the same challenge
like you can ask a human don't you know run a map with over or well you know a port scan with over um over this rate or something like that and there's limitation on that. um humans may not follow those instructions either. So um and um there are uh and that there could be prompt ejection issues where they are following instructions from the target and they get confused. Agents just do weird things sometimes. Um so patterns that work I think this is changing. I think this is kind of what I I've came to uh conclusion I've came to here. Um so I the big variables I see are the prompt, the model, the token usage, the orchestration and the tools
involved. Um a few years ago I would say orchestration was a huge part of this. Um you had to really work hard to get good uh vulnerabilities out of uh out of you know GBT3 or something like that. Um it was not agentic and not very smart compared to today's models. Um and things have changed. Models are really good. Um, and with the prompting, uh, with some decent prompting to kind of create some orchestration around that, um, they're they're super powerful. Um, orchestration is getting less and less important, or at least it's super easy. I guess it's important. It's just easy. Cloud code, um, just does that for you. Uh, just does what you ask it to do
without any, um, fancy coding on the back end. Um token usage I you know I think is a um balancing that is a challenge but just keeping it high I mean getting good coverage on everything is a um is is a priority and the token cost is relatively minimal at the time being so um it's uh to at least find a lot of potential bugs. Um and the tools I have them as a smallest variable there today. Um if you change out like a certain tool you're using Playright instead of Chrome MCP or something like that you're going to see different bugs. um and how they're configured. Um so just small all these small changes will
yield different bugs and that's that's actually kind of something that humans would run into as well. You ask three different pin testers to look at the same attack surface, they're going to find different things. Um they're just it's just a little bit more visible with a um LLM when you're getting just different sets of bugs out with different configurations. So, it's the same problem we've always had, but um or same challenge we've always had, but it's just a little hard to know um what tool set is going to be uh best for certain target. So, there's definitely challenges there on the tooling side um and just minimizing yeah, balancing coverage with token use and you know,
just there's tons of variables there in play to try to maximize coverage with the time you have. Um and the clock code just came out like less than a year ago. I mean things have changed so much since then and agentic models came out around I guess February I think uh um yeah just a little over a year ago so it's uh or thinking models actually um so things are changing quickly. Um so uh what was hard a few years ago is is now quite a lot easier. Um so all this orchestration everything um writing the code any code on the back end is just uh you can move a lot faster a lot um more quickly now um the
uh so I guess the takeaways are um AI is bug finding bugs faster than ever um validation is the new bottleneck um the mode has moved uh expertise is where it lives this is kind of tying in a little bit overlap with Ryan's talk earlier um that you still just need a lot of human expertise to either finally tune these models um and with a lot of validation to get them as far as possible but additionally perform human validation if you want to get like 100% accuracy on these. So um there's uh definitely a place for a human element in um within finding bugs still. Um I think uh one evidence a piece of evidence of this
is like the LL or the um hacker one leaderboards the various um uh bug bounty leaderboards they're all humans at the top. There's not people that are I mean they're using AI to help them along the way, but there's if someone had let's say there's a billion dollars of unfound bugs and bug bounty programs. If someone had something that would just print valid bugs with 100% accuracy, they could just kill it and make hundreds of millions of dollars. But um the fact that nobody's able to really just dominate shows that uh there's just a lot of those small tweaks to variables that that matter a lot and a lot of human validation that takes time. Um
once we see someone just killing it with uh with um tons of valid bugs um that might change. >> Um yeah, I think that's it. Any uh my contact information is there. Any any questions?