BSidesSF 2026 - One Thousand and One AI-Prevented CVEs: Vibe Coding a Whole New... (Brandon Wu)

Name: BSidesSF 2026 - One Thousand and One AI-Prevented CVEs: Vibe Coding a Whole New... (Brandon Wu)
Uploaded: 2026-05-12
Duration: 44 min 25 s
Description: One Thousand and One AI-Prevented CVEs: Vibe Coding a Whole New Supply Chain Defense Brandon Wu Supply chain attacks are everywhere with over 20,000 CVEs filed in Q2 2025. We wrote a non-agentic, internal tool which combines deterministic callgraph-based methods with selective LLM reasoning to “vi

BSidesSF44:2531 viewsPublished 2026-05Watch on YouTube ↗

Mentioned in this talk

Tools used

Semgrep

Platforms

GitHub

About this talk

One Thousand and One AI-Prevented CVEs: Vibe Coding a Whole New Supply Chain Defense Brandon Wu Supply chain attacks are everywhere with over 20,000 CVEs filed in Q2 2025. We wrote a non-agentic, internal tool which combines deterministic callgraph-based methods with selective LLM reasoning to “vibe code” Semgrep rules, colossally decreasing the time security researchers spend processing CVEs. https://bsidessf2026.sched.com/event/d320ea99f09a52d8b8b44030d6a4bff8

Show transcript [en]

next talk. I'm just going to read it cuz, you know, it's a long title. 1,001 AI prevented CVEs by coding a whole new supply chain defense with Brandon Wu. Uh just a quick reminder, please use Slido. It makes our life so much easier. To use Slido, just type besidesf.org/q&a. Brandon, the stage is yours. Thank you. I think there's too much money in this industry. No, that's not that's not a woo. That's a that's a too much. This is like the light, you know, this theater, this like gigantic space. It's like it's crazy. I feel like I'm just here to like talk to people. Okay, who here knows about the concept like the fourth wall? You know,

when someone breaks the fourth wall, where does that originate from? Does anyone know? The stage. When you have a what's called a proscenium, right? When you're in an auditorium, the fourth wall is the wall that you face when you're going to go uh see your show. That's the wall that they're facing. And when you lift that fourth wall, that's when the actor or whatever is talking to you. I would like to lift that fourth wall. I would like to disperse with the artifice of this presentation uh dynamic because I I don't much like it. I don't like this I don't like this whole thing where I'm just up here talking and you guys are over there. So, this is a wordy way of

me saying that at any point if you have a question, if you're curious about something, if I explained something that didn't make sense, please holler at me, okay? Just say something, you know, raise your hand Well, I won't see your hand if you raise it, but please um any questions at any time will be very much appreciated. But um let's do away with the majesty and the cast sellers here. Um okay, uh one thing I wanted to say. So, first of all, hi, my name is Brandon. Um I I think am the most fitting person to be giving a talk right now at BSides. Why is that? It's because I am a gigantic musical theater

nerd. This theme could not have been picked better for anyone but me. In fact, I am not supposed to be here right now. I am supposed to be at rehearsal for a musical I'm in. So, if you have any questions after this, I know that's I'm just letting you know, ask it quickly cuz I got to leave in the next like 15 minutes after this show is done after this this uh talk. So, it's like you know, I also have um a a Spotify playlist for my for musicals and it's 12 hours long. So, I just want you to know, I'm not messing around. I'm serious, okay? Like talk to me about musicals too. It's

I will have a good conversation uh for 15 minutes. All right. So, let's talk about what we're here for. 1,001 AI prevented CVEs by coding a whole new supply chain defense. Why is this talk called that, Brandon? Well, if you ask me that question, I'd have to tell you it's cuz I rang up my CTO Drew with this brand new talk idea and he said, "Can you put vibe coding somewhere in the title?" Um not only like it's partly joking, but he gave me this subtitle and I think it's a it's a better catchier one. I got all of you in the room anyways. All right, if you're in this room, you're probably concerned with security.

And in particular, you might be concerned with supply chain security. Um that is that's to say the security of like our third-party packages, our libraries that we make use of, the, you know, endless domino elephants upon elephants tower of software that we depend on every single day. Um and in particular, uh I will tell you, you probably know, that there is a lot of problems with supply chain security. Uh that's some of which we're here to solve, right? Uh let's give you some concrete numbers so just so that I can say that I gave you figures, but 50,000 estimated supply chain volumes in 2025, uh up from 40,000 in 2024. That's some percentage increase, you can do the

uh do the math. And we've got ShyHulu, we've got ReactToShell, we've got LangFlow, we've got, you know, all sorts of random really bad supply chain vulnerabilities happening. What do we do about them? Oh my gosh. Well, hopefully something, right? Um so, in particular from our perspective or from the perspective of any company um that is building some kind of supply chain solution, um some kind of product, some scanner that allows you to secure your third-party libraries, uh the question is raised, which is what can we do about it? How do we provide a good defense? How how do we defend against the incoming flood of secure uh of CVEs, of supply chain attacks? So, that question in blue will be our

question that we are concerned with today. Okay. If you ask me or if you ask somebody on the street, they might tell you that at the heart uh if you go up to someone on the street and you ask them, "What do you think about vulnerability detection products?" Uh they'll tell you, of course, that it's the detection engine. That is to say, it is like the the main thing that's analyzing the code, that's figuring out what's wrong with it, that is the heart of it, that is the most important piece. Um having worked on a detection engine for the past 4 years of my life, I can tell you that this is not precisely the case. Um it is kind of so,

and you know, to make my head bigger, I would like to think that I is the most important thing, but there's actually a lot more things that go into securing against supply chain threats. Um for instance, a mature supply chain software supply chain solution has to be able to keep up with, as I said, the incoming flood. There's just so many CVEs, there's so many different things we have to defend against. It's not enough to just be really good at finding stuff. The question is, what do you want to find? I can give you, you know, the world's best bloodhound that has that has can detect microns of of small particles. And if it if you don't tell

it what a bomb uh smells like, how's it going to find anything? All right. I can't put that analogy just now. Um so, meet Max. Max probably isn't here, but Max is a co-worker of mine. Uh he's a security researcher and he Let me let me walk you through his routine. Um he eats breakfast, he wakes up. That's actually not there. But he checks the queue, the queue being tickets that have been filed overnight for incoming CVEs. So, these are software supply chain vulnerabilities. I haven't defined the term for you. I'm I'm sorry. Um but some vulnerabilities may occur overnight and Max checks his queue for what may have been filed. Um for every single one of those CVEs, um

for myself, I work at a company called Semgrep. Basically, the idea is that we would like to write a a rule, a query that can find for vulnerable uses of that CVE. So, ways that you might actually be vulnerable to this attack. So, for instance, um one CVE might be that if you use the function foo inside of your package, boom, big bad. That's a bomb. You're going to you're going to get a my phone because that my foo function might be compromised. Uh that is in its essence what a Semgrep rule is trying to find. Uh so, he has to do that for every single one of those CVEs uh at vulnerable versions of the package. I

will define these again later also. Um so, to do this, he might have to go read troll the internet to read through the description of this vulnerability. What is the vulnerability about? What package is it concerning? When did it occur? Who do we think did it? Um what what kind of attack is it? Like an XSS? Is it like a prototype pollution? What are we talking here, right? Um and we'll find also in it typically the commit to that open source repository that fixed the issue. Um so, we open the repository or Max does, okay? Um him and I, the royal we, we open the repository and we click through the source code and we're

checking, we're looking, and we're trying to see what in the source code could have gone wrong so that we know how you use this package to be vulnerable. So, we have to write a rule. The problem, Max does not spend every single second of his life looking through open source libraries. There are in fact some that he has not seen before as as as good of a security researcher as he is. So, he searches GitHub to figure out how people use it. When you use TensorFlow, what do you what do you use? Do you use tensorflow.bar? Do you use tensorflow.createMatrix? Et cetera. And let's suppose that this particular CVE is a singular utility function. It's like a doAuth function.

Okay, this is just an example inside of this library. Well, if it's private, you're not going to be using it as an outside consumer of this third-party library, right? Um and this function might be used widely across the package. Let's suppose that your authentication mechanism is somehow vulnerable and this is used everywhere inside of this library. Well, then is does does does it stand to reason that every single user of this package is vulnerable? If I so much as look at it, if I import it, am I going to run into some kind of vulnerability? Um after 30 minutes of clicking through source, he's just guessing. He's looking around. He's like, "Okay, this is like

this looks vulnerable. So, A calls B and B is vulnerable, so A must be vulnerable." And then and he repeats and rinse to death. This is the This is the year 2026. We do not need to do that anymore, okay? So, he see thinks he's finished it and he writes a rule for the uh this CVE, for this vulnerability. And boom, 20 more vulnerabilities to go. I don't know about you. That doesn't sound like how I want to be spending my 8 hours uh for my 9-to-5. So, uh the point is that this is a lot of work. It's a lot of work for Max. He wants to go eat lunch with us and he's he's too

busy read um reading through source code and trying to figure out what CVEs are doing. This is a pain point with particularly a security researcher at Semgrep. But like this is a overall universal pain point of any of us who would seek to protect ourselves, guard ourselves from uh supply chain vulnerabilities. There's just a lot of them and we don't have the time to dedicate our human eyes and our human brains to it all the time. Man, it's so sad that I don't have a clip-on mic. I I really want to just like walk around. Um So, this workflow becomes more untenable as there are more CVEs, okay? If there are more If you take it to its logical

extreme, the limit of n as n approaches a very large number of CVEs makes security researcher face go unhappy, okay? That's the equation. The surface of languages and frameworks we want to support expands. Let's suppose we want to talk about Let's secure Rust tool chain. Let's talk about PHP. Let's talk about Composer. Let's talk about all of these different ways of distributing packages. And don't even get me started on C++. Um it it's so many different ways that you could be vulnerable and so many different aspects and avenues along which you could be uh infected. And every single rule that we write or every single, you know, new detection, the rule part is abstract. It could be

whatever. Um costs you human resources and costs you time and costs you possible mistakes, okay? We want to do it in a way that is scaling with the number of supply chain vulnerabilities out there. It costs us less in terms of researcher time. Uh Max still works at the company. We're not laying him off, okay? If you're If you're If you're concerned. Um like any good programmers, there's a workflow and we are trying to automate it um without hot people to lose their jobs, importantly. Um, uh, so let's do that. How about that? Uh, hello. My name's Brandon. Uh, this is a slide I ripped from a different slide about me. Uh, I am a program analysis engineer working

in a company called Semgrep. We do application security. Um, I've been working at Semgrep That's where I That's That's where I I ripped it. Um, I've been working at Semgrep for about 4 years now. This is a This is an old slide. Uh, and I studied computer science at Carnegie Mellon and I am really big functional programming kind of guy. So, if you like functional programming, come talk to me about that, too. Um, we are a software security startup, an application security platform, and we would like to find and fix software security vulnerabilities, as opposed to not finding and not fixing. Um, and our mission is to profoundly improve software security. If you'd like to

stalk me, there are ways. You can also follow me out of this auditorium and, uh, no one will stop you, probably. Um, so I don't think I have bodyguards, right? No? Okay. Sorry. Uh, so lifecycle of a vulnerability. Well, a vulnerability starts and ends in the code, right? Um, it's all code, to use, uh, my some vernacular. So, we have an introducing commit. All right. Let me try to concretize a lot of the things I've been saying to you. We're talking about third-party libraries. We're talking about your your Django, your LangChain, your your TensorFlow. Stuff that you import to your library, right? And people make open-source contributions to it all the time. And sometimes those contributions mess stuff

up. And that's okay. That's how open-source works. We just got to work past that, but it is the reality of our industry, right? So, the introducing commit, I'll say, is the commit that introduces some kind of insecure behavior. Um, it could have been in the source since its inception, which would be really bad. Um, or it could be when be when some functionality is later modified. Okay? Like you've you've heard about this. I mean, XZ and stuff, all this stuff recently. That's basically how that typically goes. Um, the disclosure is when we find the vulnerability and we're like, "Hey, that doesn't look right." And then we have the CVE system, uh, for giving it a kind

of ID, and a unique identifier, and then, you know, collecting information about what that vulnerability entails, so that people know, so people can upgrade, so people can hopefully not be vulnerable. Um, I'll say that the patch commit is the commit that counters the introducing commit. So, if we introduce the vuln in one commit, we have another commit later that hopefully undoes it. Or if we don't have a patch commit, then that sucks and, you know, that's our fault. Um, and then the vulnerable range is all versions released between the introducing and patch commits. I think that should probably make sense. Um, although I will stop if anyone thinks that anything has been confusing so far.

See, I wanted to break the this this artifice. I feel like it You see, if I was in the movie theater audience, I'd be sleeping right now. It's It's so dark in here. Oh my goodness. Uh okay. When a package with a CVE is used by a consumer, it's our job, as Semgrep supply chain, or, you know, any other supply chain product out there, to determine if the customer is affected, if the user of this tool is affected. Cuz not all vulnerabilities are made the same. Sometimes you could be using a malicious package and not be um, not be vulnerable to that particular vulnerability in that package. Um, there's a difference between using it a

malicious package and using a malicious package in a malicious way. Uh, so depending on the nature of the library, this could be complex. This could be hard to determine. Um, so for instance, how does the vulnerability present itself? There's some lingo we have, um, internally, but let me tell you walk you through what this basically means. So, uh, well, actually, let me let me first Has anyone here Has anyone here used Semgrep? Like, hands up? Hands up? Understand anything about it? No? No? Yes? Yes? Okay, cool. Um, uh, I just said no and yes at you bunch of times and some people moved. That was it. Uh, cool. All right. Well, that's not going to affect anything I say right now. But

basically, the point is that, um, some type Semgrep is a is a technology for finding particular patterns in code using things called patterns. Um, sometimes we can write a Semgrep query that says, "Are you using foo from this package?" Um, we cannot so easily write a Semgrep query for something like, "Does this program halt?" Cuz that would be impossible for those of you keeping track at home. Um, so for those kinds of vulnerabilities, we say that you can write what's called a vulner Sorry, a reachable rule. So, you can flag, "Oh, an object's save method might have a path traversal uh, traversal vuln." Where we can write a Semgrep rule to determine that. Um, a library's foo

function might also be vulnerable if it's called with sensitive inputs. These are all things we can usually write a Semgrep query, or CodeQL query, whichever, to find. Um, so these are things we call reachable. These are things that we can very clearly identify against non-instances of that vulnerability. Um, and typically, in the case where we want to write such a rule, a subset of the, um, package's API is vulnerable. There are some, uh, non-cases of this, but, uh, for instance, like if I had a package that had foo, bar, and baz, it may be the case that only foo is vulnerable. In which case, I would tell you, "I'm going to detect uses of foo and not care so

much about bar and baz." All right? Um, that is basically the extent to which we write a reachable rule. Um, there's also things we call upgrade only. So, this is the case where, um, some packages are built differently. Uh, some packages are not like things you import a single symbol from. Some packages are like web frameworks that are laced throughout your entire authorization layer. Some packages are more fundamental to the way that your that your app runs. Uh, maybe they're even running your app. In which case, uh, this code that you've used might actually be upgrade only in the sense that there there's no way for you to not be vulnerable unless you just upgrade.

All right? You're out of luck. Sorry. Just go upgrade the package and you'll be fine. Um, there are differing extents and different circumstances in which this is true. But, uh, basically, I'm telling you that there are these two cases. Um, does that not make sense to anyone? Ah, see, I got you. I got I got some people nodding their heads. And what you meant to say is, "No, it doesn't not make sense to me." See, double neg double negative. Tricky. Um, so for instance, a rule might really basically look something like this. Um, we have a sample rule and the pattern says, "If you call library.save on any arguments, then you're out of luck. You

got a vulnerability. Sorry. Go next." Um, that's the idea of a vulner a reachable rule. I keep saying vulnerable. Wow. Okay. Um, so we want a better way for writing these rules that cuts out manual work. Um, and some things should be true. Check time. Okay. Uh, it should be autonomous. It should be discriminating. Actually, this is not so important, but the point is that, uh, we want to be able to do it at low cost. We want to be able to do it without, um, expending too many resources, and we want to be able to do it in a way that we can tell that it's correct. I'm actually realizing I pontificated too much, so I'm going to move on a bit.

All right. So, the problem. Let's write Semgrep rules from CVEs. If any of you were thinking, "Man, this sounds like I can just Claude code this. I can vibe code this." Wow. You would be right, in the year of our Lord 2026. Um, however, the cool thing about this talk, I think, is that, um, uh, we are not going to do that, because the tool I'm describing that we developed was developed in 2024, um, before agentic coding existence assistance existed. Um, but I think they have a few useful lessons, and this is what I really want you to get out of the talk. Non-agentic code is deliberate code. When I write code that does something,

some workflow, and I I ground it in an actual script, I'm being very deliberate. I'm not saying, "Figure it out," and then leaving the implementation of that abstract. I'm telling you precisely how to do it. So, that's very transparent. Um, agentic workflows, I'll say, I'm sure most of you all of you know, are unpredictable. Sometimes you tell it to go do A and it does B. Or you tell it to do B and it says, "I did B, boss." And then you check and it didn't do B. And you're like, "Why didn't you do B?" And it's like, "You're absolutely right. I didn't do B." Like, it's it's not so fun when your when your workflow goes back to you

and says, "I don't know. Like, I thought I did it." Like, I didn't I didn't realize that we wanted to employ adolescents, okay, as our as our agents. Like, I'm not a big fan. Neither should you be. Um, I would say that structured code is also easier to test, reason about, and improve upon. Um, if your entire process of doing this rule-writing thing is just a black box, it's Claude code, how are you supposed to tell whether or not any given part did better this time or worse this time? It's all It's all grounded inside the quote-unquote brain of, uh, the LLM, and I'll not say more on that in fear of starting a revolution. Um,

but it's harder to targetedly improve parts when the parts don't really exist. Okay. So, let's talk about a workflow that today could be agentic, but it's not. And why I think that's great. That's not a copyright violation, is it? No, never mind. Okay. All right. Mostly deterministic code, though. All right. Um, and very importantly, uh, one other thing is different since 2024. Um, in 2024, I'm a big pop pop music guy. We had Chappell Roan and Sabrina Carpenter in the Eras Tour. But real ones know, in 2024, what we really had was Brat Summer. We had Brat Summer. And that's the name of this tool we're talking about today. It's called Brat. Why? Cuz in 2024, I thought

that'd be funny. And it still remains so, but less so with every passing year. Okay. All right. Well, All right. 2024 Brandon New Win. Let's talk about Brat. Uh, the idea of a tool to help researchers write rules is not new. Um, but the last one was called the Rule Authorship Tool, RAT. And then I decided to make a new one called Better RAT. Brat. You can clap now, please. Thank you. Man, I can tell you guys to do anything. That's awesome. Wow. Can we do that again? Can we Can I make you clap again? No? Ah, my god. This is crazy. Okay. Um, three steps to Brat. We're going to do research automatically. We're going to

do test generation and rule generation. Brandon, what does that mean? I'm going to tell you, dear watcher. Cool. Um, so the I'm going to tell walk you through the highlights of the high level, and then we'll talk about specifically what that entails. But the research portion uses LLMs. Those did exist in 2024, just not Claude code and Codex and stuff. Uh, to determine what we can about a CVE and package. If you recall in the Max example, Max was like, reading through the read me, reading the API documentation, trying to figure out what this package had in it or was doing. Well, turns out we can actually use an LLM to to do some of that, in a

less good way, uh, as we can do today with agents. So, we might want to scrape the read me. We might want to scrape the online API documentation, the readthedocs.io. We might want to search GitHub for real examples of how people use this package. That's actually really important. Um, and we're going to do what's called building a code graph, a call graph, a graph where the nodes are functions and the edges are given by what functions call which other functions. Uh, I'm a program analysis guy. This is kind of like my bread and butter. So, if this is confusing, ask away. Um, and for test generation, what we're going to do is uh basically we want to

create a test file that vets that we can actually find uses of the vulnerable function. So, for instance, we might write a test file that has synthetic uses of the vulnerable functions, and then we can test whether or not our rule that we generate uh matches against it. So, uh the important thing here is that we can do each function independently. I'll come back to this when I talk about composability and the strengths of that uh as opposed to agentic workflows. Um this part's not super important. Uh but it will be composable. And then finally, rule generation will be using the test file from the previous step. Can we generate Semgrep patterns, queries that can match against the test

code? And we can verify that deterministically and and consistently, which is very very important. And for any of you working with agentic workflows, I highly recommend you find your holy grail of a repeatable kind of verification step that you can always stay within because, you know, the the edges of the uh solid are are hazy, but if you can contain it to a verifiable kind of poly uh polytope, then you're you're good as rain. Right as rain. Good as new. Um okay. Uh and we can always mechanically verify whether or not rule generation succeeds. I'll show you an example of this in a bit. Cool. All right. Um Here's my Here's my Here's my uh thesis

to you. I believe that pure, simple, and verifiable parts make a stronger product. It makes stronger software. I think that you can definitely use LLMs and you definitely should use um AI to do things. But I think that there's a difference between creating a solid foundation where the parts are are well-defined and the tasks are well uh scoped to nondeterminism and just like saying, "Okay, go do the thing." You know? Um uh that would be a lot different. Yes. I'm not quite using the word pure in terms of pure functions. I was more using it in terms of vibes, but in general I do like pure functions. Also, I think both are good. Um uh some of the

functions I'm talking about are not pure, but I would like all most functions to be pure. So, good question though. Uh thank you. See, uh so can I get a clap for clap for like a he he asked a question. Again, let the artifices fade. It's It's one of us has a very bright light shining into our eyes, okay? The rest Otherwise, we're all the same. Oop. And a microphone. Uh okay. Not all CVEs are made the same. Uh I'm going to decide I This is isn't isn't important. Basically, some libraries are libraries and some things are like uh FFmpeg and you don't actually like import it into your code. It's not really that important.

All right. So, da da da da da da da da da da da I'm not interested in this. I want to talk to you about code graphs. I like code graphs. Do you like code graphs? I like code graphs. Let's talk about code graphs. All right. Um insight. Vulnerable code is made vulnerable by calling vulnerable code. Brandon, you just said vulnerable three times in quick succession. You can't just do that and make people think you have a point. Okay, the point I'm trying to say is that if I call vulnerable code, I am now probably vulnerable in the most, you know, realistic uh cases of open source software, right? So, uh remember Max a CVE? I said there was a

single private utility function that was changed and then uh by the patching commit. That implies probably the vulnerable behavior had something to do with that function that was changed, right? If I fix something by by by changing this one function, probably that function it was its fault. Now, I'm not pointing pointing fingers, but I'm I'm definitely pointing fingers. Um any caller of that function, transitively, right? Through other functions, may be compromised. That is totally possible. And if we traverse the call graph backwards, uh we should be able to eventually find parts of the library's API that are uh outward-facing, that are public, that people actually consume. So, for instance, this deeply nested private function might eventually be

transitively called by, I don't know, uh my dot save method. And then I'll know that anyone using that dot save is out of luck. So, that's the basic idea of the algorithm we're going to do. Oh Oh, I have a nice graphic for you. That's crazy. Okay. I am not creating the code graph at runtime cuz I don't mess with runtime. That's That's That's That's no man's land. It's hard to do. It's hard to set up and it's hard to use. Uh you don't want that. Um I'm a big static analysis guy. We are We are writing programs that analyze programs. Good point though.

So, if you have a lambda in a in a in a function, it's usually not so different than if you deal with like a regular function, right? So, you can We We do like an incremental analysis where you can look at each lambda and then at the consumption site of the lambda say something like, "If the first argument to this lambda were to be tainted, then I know that for sure this thing goes to some kind of sensitive sink." It's the same as a regular function. It's just that a lambda is harder because you can pass it around as true data. You There is more cases that you can't catch with a lambda. Let's suppose I pass it like through my my

code graph like very very far, like five function levels deep. But usually um most lambdas I would say, you know, it's like your list.map f n x goes to x. Like you have the the use site is right where the lambda is.

Dunder calls are usually just like syntactic sugar though, right? If I were to do like x + y, that's like inherent uh implicitly dunder add. So, like I don't see that as like two different. It's just like your uh analysis engine has to be smart enough to be like, "Plus in Python means dunder add." Um it's just kind of a special casing. It's annoying more than it is like unrealistic. But yeah. Yeah, good question. Um uh cool. So, um let's suppose we had a database call and we had a helper, and that helper was edited by a patching commit. This guy, bad. This means this purple guy is probably bad and vulnerable. So, if I had a

function that called that helper, and then another exposed API function that called that function, I would say that both of these are potentially vulnerable, right? Because transitively they they might pass data that goes to the sensitive database call. Not always, but largely, I would say. Um if that makes Does that general gist make sense? Cuz this is the core of the whole thing. Yes. Oh my god, I got responses back. I I was I was beginning to think that you guys weren't alive. All right. Uh cool. So, we got building the code graph. Um so, I I mentioned this earlier, the call graph is functions as nodes and caller-callee relationships are edges. Um we're going to use Semgrep program

analysis using the Semgrep uh engine to be able to do this. You could do this with like an open source uh indexer like Skip too, if you so wished, um because it can kind of statically analyze your code. Um and we mark functions that are changed by the patching commit as vulnerable, right? And then we do everyone's favorite thing that you learned in like your first or second year of um computer science education, we do what's called a graph search. We do a graph search backwards. Um uh and so every node that points to a vulnerable function is also marked as vulnerable. I don't have a fun cozy little uh you know, animation for you, but you

know, you can use your brain. Um so, this makes code graph generation deterministic, which is really nice because it means that if I have the same code that's changed in a given diff, um the nodes that are marked as calling that for a given commit in a repo will always be the same. Um you'll note that we haven't really factored in AI yet. This is part of my my thesis here where we're going to use a tasteful amount of AI, let's say. Um AI will be used in small subroutines like scraping API docs, parsing readmes, etc. Um and that's because in 2024, you couldn't really dispatch an agent to do these. Okay. Here's an example of a code graph,

uh which is simple, I think. So, the starting node is the one in gold. Uh this is for a real CVE, by the way. Um uh I don't know which one, but it is one. I pulled it the other day. Um so, we have this remove DTD markup declarations function, which is marked as starting. That means that there was a CVE where that function was changed to fix the CVE. That heavily implies to me that remove DTD markup declarations is a bad guy, to cite the ill Billie Eilish song. It's a bad guy that causes bad behavior and it should be sent into timeout. And how do we send it into timeout? Well, we find

its transitive callers, of course. So, we go see that clean calls that function. Uh that's the way that the nodes go. You point A points at B if B calls A. Um clean calls that function, I just said. And then is_svg calls clean. And you'll note that there's colors on this, unless you're colorblind, in which case I apologize. But um there are colors on this. The top two are red and the bottom one's blue. Um the reason for that is that the S is_svg function is known to me to be a public function. I know that when people use this library, they actually import that function and they actually use it. The other ones, not so

much. Uh this is important because what I could do is I could generate a rule that says, "Let me look for remove DTD markup declarations. And let me Let me look for clean. And let me look for is_svg." But that would be unnecessary. It'll be slow because it'll be costing resources to look for something that nobody uses. These are internal private functions. I'll tell you how we determine whether or not a node is uh public or private later. Um but the idea is that we would like to contain ourselves to the uh transitive closure, to use a fancy math term, uh of this um vulnerability, and then we want to constrain ourselves to only those public

functions. Here's a big code graph. It's not even that big. Uh >> [laughter] >> I've got way bigger ones. But this is at least gets the point across. Um uh this is a similar thing where like Can you make that out? Uh yeah, about. Uh this guy right here is gold and it's called whoop. And then we have all these blue guys in the middle. Um so, what I'm trying to get across to you is if I wrote a Semgrep rule that tried to look for every single one of these nodes, it'll be wasteful. It'll be silly. It'll be repetitive. It'll be repetitive. I wouldn't like it very much. So, let's not do that. Um but they

get even bigger. They get crazily big, especially in Java. I don't know what Java people are doing, but it's wild. Um so, the transitive closure, I used that word again, of a project could be hundreds of nodes. We can't realistically write that rule. I kind of just said uh all this stuff. Oh, here Here's the part where I was going to tell you how we filter the public nodes. So, Brat uses a filtering pass by using uh GitHub's API to search for all of the ways a library is given use. Uh so, I use Hera 6 to do that. I look for like import TensorFlow. I look for from TensorFlow import, like the concrete strings of that. And then we, as a first

pass, get all of the files that seem like they have import TensorFlow or or Then we can use a program analysis tool to pair that down and find for real uses of is this tensorflow.foo is this tensorflow.bar etc. Um so if I wanted to check for instance if foo.bar was used in code anywhere if for library called foo and a method called bar, I could search GitHub for the literal string bar left paren right paren and the literal string import foo and I might get false positives, right? I might get bar being in a string but I can then as a first pass have my candidates that I can use program analysis to actually determine

whether or not that's correct. Um does that make sense? I also don't have a a graphic for this but I hope you can visualize what I'm saying here. Cool. Um as a note I have been rate limited by GitHub literally thousands of times. It's very annoying. I I I started doing a thing where I I I cycle between different API keys cuz it's very annoying. They won't let me won't let me send thousands upon thousands of requests to GitHub for um code search at scale. Who would have guessed? Um So we can do that to find true usages of foo.bar and that's how we're going to get our blue notes in this graph here. Oh if I do this I can see you guys

better. Can't see you. Can't see you. Cool. Um so yeah so I have ADHD. Um so basically what we can do is we can find the real public uses of the nodes here. All right. So oh my gosh I do have a graphic. Crazy All right. So if I wanted to find foo.bar basically I'd search GitHub for import foo and bar. I'd find both of those code snippets perhaps. The second one by the way is fake, right? Cuz it's import import foo bar not foo.bar. Um and the first one is actually usage of foo.bar. And then with Semgrep good old trusty Semgrep best friend Semgrep we can figure out whether or not the top one is true and the bottom one is not.

That's how we do that. All right. Um why do we split test gen and rule gen? Well we want to spawn an agent and give it the afflicted functions and call it a day. Um let's suppose that we did this, right? Um I I said agent foo.bar is vulnerable and foo.baz is vulnerable and foo. you know cooks is vulnerable. Um and what I found was that usually it would mess up. Who would have guessed? AI can mess up? I know. Crazy. Um sometimes it would give me an invalid Semgrep rule. Sometimes the symbol would be missing. Have you ever fed like 20 things to an AI said do something on all 20 of them and it gave you back 19

things? I don't know if this is like just me but I this happens to me all I'm doing some variation of this for my day job all the time. Uh it's very annoying. Um things like that would happen. That's annoying. I don't want that. And sometimes the rule wouldn't semantically work. It wouldn't match the targets. It wouldn't match the functions that I said were vulnerable. Also bad. One of them is bad cuz it doesn't work. One of them is Wait. One of them is bad cuz it doesn't parse. One of them is bad cuz it doesn't work, all right? Um both are bad. So how do we how do we limit ourselves there? Um Insight. Do separate things separately.

Um if I have different functions in a package, they are independent. My ability to test whether or not I can generate one thing um one rule for one part of the functions API surface is separate than my ability to generate for that other part. Um I'm telling you some specific things but I want you to know this is I believe a general guiding principle. If you're doing work with AI or you're doing work in any kind of way that's composable that you have separate subproblems, do the subproblems independently and verify the subproblems independently. So that's what we're doing here. We're trying to be able to match a total rule by taking each part individually. Uh so if I wanted to flag usages of

foo.bar and foo.baz I could independently ask an LM to generate test code for foo.bar and foo.baz. I could ask an LM to generate a pattern a query that matches foo.bar and foo.baz. And then test them separately. I could test whether or not bar and baz worked and then I would stitch together the results because textually speaking that's kind of simple. Uh if you haven't seen a Semgrep rule that might not be obvious but just take it on my word that like if I have a query I can have a query that's also A or B. Okay? And that's basically how we would do it. We would stitch together our independent queries. But the the simple subproblems

are dealt with in parallel separately, right? Um and if any subcomponent fails, rerun it by itself. Don't you know rerun all 10 of your subproblems when only one of them failed, right? Keep the nine that worked. Toss out the one that that was uh wrong. Okay? Um And then we stitch all them into a single rule. So let me show you how this works. Uh so for instance we were just saying this thing about how I want to generate Can you read that? I hope so. Yes. Um let's suppose I have two symbols in this library and these are synthetic but this works for like any example we care to discuss. Um we might have foo.bar and

foo.baz. And an LM might split us out these uh two code snippets. So foo.baz is used in a way where we pass vulnerable equals true. As a side note if you ever run into a function where there's a vulnerable argument maybe passing it as true is not the brightest idea you had that day. Um One is one is certainly better. That's right. We don't we don't shame but there is one way that's much better. Um and then two we might have this foo.bar uh or baz I guess. Uh looks like I mixed up the words but whatever. This foo.bar down here which is unilaterally vulnerable. If you call it whatsoever you've got a vuln on your

hands. Um and so if I generate a pattern for each of these, if you don't read Semgrep just trust me that this works. Um we might write Ooh what the heck? I didn't know I could do that. Um We might write this one foo.baz pattern which flags vulnerable equals true cuz it's probably okay if you don't call vulnerable equals false. For real. It's okay if you call vulnerable equals false. And then we might have this one pattern that just checks if you use foo.bar whatsoever you're out of luck. Sorry. Bad. Go back to start. Um and then we can then use AI to generate a test file which is two parts. Um sorry we can take these two code

snippets and then literally just stitch them together into one one code file, right? Cuz if I have a Python file and I have a Python file I can usually concatenate them textually speaking and it's still a Python file. It's gross but you know yell at me. It works. Um and then we can concatenate the two patterns into a Semgrep rule also. And the reason why this is important I'm talking about this like thing about separation. Like it's possible you could one shot it. It's also possible you'd mess it up. In which case what we can do is verify that this pattern works on this test file and this pattern works on this test file. And if any given part

component fails, we can rerun it. Or we might be able to proceed and discard that one component. But the point is not to one shot everything. It's to decompose things into subproblems. Does that this whole kind of flow make sense to people?

Um that's kind of just the the the way that the test harness works. It's We could we could have them separate also. But that's kind of just like ergonomics. At the end of the day all that's important is that we have the separate parts. Yeah. Good question though. Um any other questions about this setup? Cool. Thank you. All right. Advantages are numerous. We cannot accidentally drop a symbol, all right? I'm not That's one of my biggest things about about using AI to do anything. Is that it it it can just like sometimes not write what you want it to write, right? Uh so for instance if I ask it to full on generate this, it might just not generate the

part of the test file for baz or bar or whatever which is pretty sucky and I don't like that very much. Um we can audit each part of the workflow independently. We can see whether or not it succeeded. I kind of said And overall we have a more robust composable system because each subproblem is examinable and is understandable, you know? Um so our failures are isolated basically. Um so for results, Brat has authored nearly 3,500 rules over two years of existence. Um we have some stats here given to us by our precious researchers um where we have some rules that are ready to merge which means that you just go. And we have some rules that are irrelevant.

That's a very small percentage. That's this little blue guy here. Some that are very helpful. Some that are helpful and some that are not helpful. And I realize these are words and I will quantify them for you soon because you all are quantity people I assume. Um I see phones so yeah feel free to take my data. Go for it. The first digit of my social security number is two in case you're wondering. Um >> [laughter] >> All right. Data from security researchers indicate that researching a CVE and writing a rule takes about on the order of an hour. Um which is not super fun. Depends on the rule sometimes but maybe on average we can say that.

Uh conservative estimates so I'm trying to be as bad towards myself tell us that a ready to merge rule could take about 10 minutes of researcher time sometimes less. Um a very helpful might take maybe 20 minutes of researcher time a third and helpful could take maybe 45 minutes of researcher time. >> [snorts] >> So proportionally we've saved you know 40% or probably more because I'm being conservative of time spent writing rules um which is pretty good. I'm happy with that. But Huh not this just wait. We have more. Um to put on my best Billy Mays impression. Um we have a tangible decrease in human resources but we can also use this to backfill things. So for instance um

when I wrote Brat we didn't have support for like PHP or Rust in Semgrep supply chain. When you have now a AI plus deterministic code workflow for writing rules, all of those 2,000 Rust rules in your backlog you can just run on all of them in parallel and be done in two hours. And how long was that going to be? Our thousands of hours of researcher time. So um Not done done but you know gets you a large portion of the way. So the fact that this can be parallelized um is very very powerful I think. Um I we have had nights where um one time this was a freak it was crazy. Uh 50 CVEs were like

filed into our queue in like a single night. I think like a single researcher like found one thing that was like applicable amongst many or got lazy and decided to do publish their homework in the night before the deadline. But there were 50 in a night and I would not like to be the person that had to deal with those 50 CVEs the next day. Um the point is that Brat can deal with all of those in parallel because Brat is not a human and Brat can just kind of do all these things. Um I'm not trying to you know start a start an AI versus human thing here. Uh Um So importantly we did not replace the

human workflow. We just have a trusted mostly deterministic pipeline with a human in the loop workflow, right? The human at the end can look at all the artifacts. Let me show you this briefly. Um so what Brat does is it generates what we call research summaries. So basically uh things are the output of the research step. This is all generated. It gives you like a description, the commit, some description of how you would use the library, um the API that it scraped from online, and then uh the code graph as well is in here. So, these things are meant are created because I wanted the researcher, the human in the loop, to be able to look at the artifact and tell

whether or not the thing had done a correct job. Um and if I go look at this, I think I've just a little bit Da da da da da. I already told you that some of this stuff. Um let's take a look at a case study, which is a prototype pollution in d value. That's a pretty simple npm library. Um I ran an agent with it. I told it, "Hey, go research this and write a summary rule." And it said a few things. It said, "The vulnerability is in d value.parse because the unflatten function doesn't block proto keys." Sure, sounds legit. LGTM, let's go. Um but the agent states the problem is in unflatten, but it didn't realize that

unflatten as a helper routine is also exposed by the library. So, the the the the rule that it generated basically says, "If you use d value.parse, you got a problem." Only d value.parse. But, if we look at the code graph for this library, we see that on the left here, hydrate, which is the real problem, is called by unflatten, which is called by parse. But, both of these are blue nodes, they're public. The case I'm trying to make here is yes, okay. We can make it so that an agent just goes and one shots the rule, but you'd miss things like this. You'd miss things like this because you had no part of this of the workflow that you

actually were able to look at and see the reasoning. Why did I put d value.parse in? Oh, well, because it's called d value, so might as well. No, but like what about unflatten? Why didn't you put unflatten in? And you just didn't catch it. The false negatives that you have are immense when you just shell out to cloud code or whatever and don't think about the actual work that you're doing. And I think this makes a much more trustable, much better, much more composable kind of workflow that you can actually use and be assured of its success. I'm a big believer in this kind of thing. So, this to me was like was like the holy grail.

I was like, "Wow, that it actually missed that. That's crazy." Um so, our problem is they're both vulnerable, but the rule written by the agent was only one. So, agentic, okay. Don't get me wrong. The amount of spend I've I've spent on cloud code is more than I'd like to admit to to any given human being. But, it skips steps and it has leaps in logic. That is not what I call bueno. That's not good. I don't like that. It's not auditable, it misses things, and it's hard to improve because all you can do is prompt tune. You you are If I am here at like the results, the agent is all the way over here at the end,

over here. Hi. Because it's making a leap of a leap in logic. It's not so fun, right? So, sure, use the agent, but be careful. Brat, and also by extent uh deterministic workflows, are subtask oriented, the parts are verifiable, and they're composable. So, we can do the small sub problems with AI and be relatively assured of success. And it's improvable at each stage. Um so, okay, you know, I'm paid for by big by big shareholders. The shareholders want me to say, "Use AI. AI is great." But, and in many cases they're correct. But, I'd like to posit to you, good old-fashioned code land is also pretty cool. All right? Uh and so, we should try to be wary of that when possible.

Maybe make make a script instead of like asking an agent to do something. Or make have the agent to make a script, even better. Um uh deterministic workflows and non-deterministic LLMs, I think, and some other things, make an effect better than what either could accomplish alone. And to me, that's very obvious. But, I'd like to to to posit this to you right now. And I think that at the end of the day, a tasteful amount of AI is the right amount. And that is basically the thesis of what I want to tell you, okay? Um yeah, that's it. Do I have time for questions? Or did I run it all out? So, we have time for two questions. We

have two questions in Slido. >> questions. Oh my gosh. Two fun questions. >> telling you the other digit. You have to find out. First question, what musical are you in? I am in a production of Legally Blonde the Musical. I have dress rehearsal right after this. So, I'm serious. If you want to ask more questions, I I got to go. So, like like find me fast. Second question, favorite banger of Brat. Wait, what's our favorite Oh, oh, like the the album, not the You must understand that I've solely been referring to the technology for the past like 2 years of my life. Um I really like I really like 360. I know it's the first track, but I really like

360. It's just like the bass is so good. Dum dum dum dum dum dum It's It's so good. So, um uh Sympathy Is a Knife is pretty good, too. Um yeah. I think that's it. Thank you so much, Brando. Cool. Brando, have applause.

BSidesSF 2026 - One Thousand and One AI-Prevented CVEs: Vibe Coding a Whole New... (Brandon Wu)

Related talks