
[music]
[music] One of my favorite things about Besides Portland and organizing it is finding keynotes. And I want to find people who have a connection to the Pacific West. I want to find people who do amazing things in the industry. I want to find people who are doing things at the edge of the industry that we all might work on. And I thought Perry, you would be an excellent candidate. Um, so without any more distractions and um uh me rambling, I think Are you ready, Perry? >> I am ready. Can you hear me? >> Uh, we can hear you. Thank you, Perry. >> Wonderful. Hello, everyone. I wish desperately I could be there uh with you
today. I was at the airport last night and I had some unfortunate issues with uh the flight. However, I'm going to hope to make this as exciting and engaging as possible. I was so excited when Joe asked me to give the keynote because like he said, I am from the Pacific Northwest. I was born in Seattle and raised there and then my parents moved down to Oregon, actually southern Oregon, uh about half a decade ago. And they now live in Ashland, Oregon, right outside of Medford. and I go there every year for the holidays. But the first time I ever came to Portland was for a Latin competition actually. And it was became a yearly endeavor to go down to
Portland, go do our Latin competition because I was I think like many of the folks in the room a bit of a nerd in middle school and high school. And so we'd drive down from Seattle, do the Latin competition, and we'd always do the Voodoo Donuts and go to Powels. And my mother both loved and hated this because I would go to Powels and buy maybe five or 10 of the largest hardback used books I could find and then we'd have to truck them back to the car and then truck them all the way back up to Seattle. And I just have phenomenal, phenomenal memories. And so I'm just truly sad I can't be in Portland today
because it is one of my favorite my favorite cities. But uh hopefully I'll make this as engaging as possible and I think that Joe is going to do questions at the end and I'm going to try to leave enough time for those because this is I think really fun material and it's material that is available to everyone to play around with and leverage in your day jobs and your hobbies etc. And I'd really encourage uh you to uh engage with it. So I'm going to be talking about uh the AI cyber challenge which is something that I did when I was at DARPA. And so I gave a little bit about my PNW history, but on from a
professional point of view, I've spent the last four years and I've recently left uh uh and am now at Dartmouth as a fellow, but I prior to that I spent the last four years at the Defense Advanced Research Projects Agency, which is an agency in the federal government that focuses on funding the cutting edge, the next great breakthrough in science and technology across the spectrum. and I focused on cyber security and I followed in the footsteps of some really great PMs that focused on cyber security such as for instance Mudge who was at the agency in the 2010s and when I was there I started a challenge called the AI cyber challenge which was a two-yearong
effort to use AI to find and fix vulnerabilities in software and so I want to give a little bit of a background on the competition because what we'll be talking about today is the takeaways from the competition. What are we what are we walking away with? But the competition started I pitched it in uh the spring of 2023 as a two-yearong competition in which we partner with a number of entities in the cyber security and AI space to develop tools that could automatically find and fix vulnerabilities in software using AI. And this was a time when LM had really come into into their more powerful iterations. And so in the 2010s, you hadn't seen uh
you'd seen promise, but it was really around 2022 2023 when they started becoming so powerful that it made sense to apply them to the space of software security. And the agency had a history of focusing on automatic software security prior to that. But what the agency focused on was more traditional methods in what we call program analysis. So automatic ways of understanding a computer program and there hadn't been a significant amount of integration with traditional AI uh uh LLM type models. And so this was going to be a two-year long competition. We'd run the semifinals uh after one year at Defcon and the top seven teams would walk away with $2 million each in prize money. They would
return to Defcon for the final competition in which the first place winner would win $4 million. And so this is a total prize pool combined with we had some prizes available to small businesses at the beginning that ended up being around $30 million. So we were putting a lot of money down for software security and so we had 42 teams compete. Uh there were five uh in semi-finals there were five challenge projects. Now, I want to talk a little bit about the way that we designed the competition and the thought that went into this design to ensure that what came out of the competition was actually uh was actual usable software and discoveries and pushing the state-of-the-art forward.
And so the first aspect of this was the open-source projects which I'll talk about in a bit, but also the collaborators that we brought on. And so we had Google, Anthropic, OpenAI, and Microsoft as collaborators on the challenge, as well as the Linux Foundation, the Open Source Software Foundation, and Black Hat and Defcon because we wanted to make sure that we were getting the state-of-the-art in LMS. We were getting uh perspective from the open-source community which I think from our perspective at least from my perspective that is one of the most important communities that we need to support when it comes to software security as well as ensuring we were running this in a place
where it could get the most amount of visibility and engagement with the cyber security community. And so in designing an effective competition like I talked about the first aspect was prize money. You need to make this a worthwhile competition for the best and the brightest in both AI and software security. If your software security experts can go and make a lot more money doing something else, they're just not going to do your competition. And so you do have to make it make it worthwhile. And the prize money ends up actually being a uh for a team of seven or eight experts, getting $2 million to spend a year doing that ends up equating what they might get paid uh elsewhere doing
something doing something else. You also, like I talked about, need access to state-of-the-art technology. with our AI tooling that we were trying to build, we were relying on foundational frontier models. And those models were changing so quickly that if you ran a competition without getting perspective and insight from the folks making those models, you might be making tools that just do not represent the state-of-the-art in this space. And so we went to OpenAI, Anthropic, Deep Mind, and Microsoft. I approached all of those companies in uh May of 2023, and I said, "Hey, I'm thinking about doing this competition. I'd love to have uh uh your engagement with it." And all of those places were really excited about software security.
They build software inhouse. They rely on a lot of open-source software. And so that was a really impactful aspect of it. The third thing was moving quickly. Uh like I talked about, I was ideating this in May of 2023 and we had the finals in August of 2025. The US government does not usually move that quickly. This was a very uh I I think unique opportunity and it took so many bright driven folks pushing this forward committing to the timeline. But what moving quickly also allowed us to do was to harness the excitement around LLM at the time and to harness that energy and push it into the competition. Now the Probably one of the most important
aspects of design and competition, if not the most important aspect, is finally the game structure. You can have all of these other things, prize, money, collaborators, uh, ability to move fast, but if you do not have a game structure that is designed to effectively measure who was the winner and do that in a fair and transparent way, but also to create an environment that isn't overly gamified, that's not overly creating a overly toy environment, but rather creating something that's going to produce discoveries and technology that has real world application. That is a very hard balancing act to do and something that I'd say that myself and my team agonized over as we were putting as we were putting this together. This
is one of if not the most important aspects of designing a competition like this. So, as I mentioned earlier, part of what went into that game design was realworld software. If you create a competition and you have toy software, that sometimes makes sense depending on the maturity of the technology. But DARPA had run a competition like that 10 years previously, which had really driven forward the state-of-the-art. But now that we had LLMs, now that we had had 10 years more of research focused on software, automatic software understanding, it did not make any sense to run another competition focused on synthetic software pro projects that were designed for the competition. We had to make sure that our tools could run on real world
software. And that was what motivated us partnering with the Linux Foundation, the open- source security foundation. But what also what also motivate us designing all of the challenges in the competition around open-source projects. And so if I go back a little bit and we look at the semifinals, you can see that there were five challenge projects, the Linux kernel EngineX Tika Jenkins and SQLite. And we actually found realworld vulnerabilities in those in that software. in addition to the vulnerabilities that as part of the competition we inserted into the software and that itself was a difficult balancing act as part of the game design. We needed to ensure that there were a sufficient number of vulnerabilities in the
software to really stress test the tools that were being created by the competitors. But there's also going to be realworld vulnerabilities which means that as organizers we don't have the answer key. We don't know all of the vulnerabilities in software. Vulnerability discovery automatic vulnerability discovery suffers from a problem of false positives. And so in scoring all of the teams and again there were 42 teams just in semi-finals. In scoring those teams, we had to have a way to measure whether the team had found a real vulnerability, whether it was one that we put in oursself or one that was in that open-source project organically, as opposed to them finding what they expect to be what they think might be a
vulnerability, but isn't actually a vulnerability. the the challenges that we uh the software projects that we developed challenges around consisted of these projects. So you have uh [clears throat] curl uh IPF free RDP etc etc and so these are things that in some cases are going to be very familiar and other cases are going to be less familiar but are all software projects that are very very commonly very commonly used. We decide to focus on two languages, B and Java, and on different kinds of vulnerabilities in each of those in each of those languages. [clears throat] And we had different kinds of different kinds of challenges. And we tried to make again going back to
this concept of realw world real world challenges game design that's going to translate into real world games. We designed challenges around the idea of how around the idea of making something that could fit within a real world software development life cycle. So in some cases we had in [clears throat] some cases we had challenges that were diffs. So delta scans. So you would receive in a uh commit or a diff of what the code looked like before and what the code looked like after the change. and the system, the AI systems the teams were developing which we called CRS's for cyber reasoning system. The cyber reasoning system would have to determine whether there was a vulnerability in
that commit. In some cases, the CRS would just receive a large code base and would have to scan that code base. In other in other cases, the CRS would receive the output of a static a static analysis tool. Uh something that we called a Sarah for structure reporting format for vulnerability details and the CRS would have to determine based on the output of the static analysis tool whether that was a true vulnerability or whether that was a false positive. In the software development ecosystem, you when you run static analysis tools, you often run into false positives. So all of these, the commit review, the doing full source code review, and the static analysis output review all
mimicked different aspects of real world software development. And what teams would submit what teams would submit based on that was they would submit a proof of vulnerability. So they would take in these different kinds of challenge formats and they would submit a proof of vulnerability uh in the form of a triggering input. So something that would a user input that would interact with the software in a way that triggered that vulnerability. They would submit a patch for that vulnerability and they might submit their own static analysis static analysis report or a combination of all three. [clears throat] So the uh scoring system was the next aspect of game design that had to be considered. How do you weight these different
submissions from the teams? How do you weight different challenge uh different challenges? And then how do you weight a vulner a proof of vulnerability versus a patch versus a static analysis report. [clears throat] And these are how the final teams and their scores this is how the final team scores ended up looking. So eventually we converged on the concept of a weighted equation that could be used to combine all of these different categories into one score with different weights. And the thought behind the weights was ensuring that we were able to effectively toggle and tune just how much we wanted, for instance, false positives to count against a team. How did we want teams to suffer significant
points losses for accuracy uh failures or did we want to tune that such that some inaccuracies were okay but too many were not? [clears throat] We also wanted to ensure that patching was more heavily weighted than vulnerability discovery. This is a defensive competition and this is a competition that is focused on providing the software community with tools they can actually use. So finding vulnerabilities is truly not enough. All it does is give the developers more work to do. What we want to do is design something that would actually create tools that could automatically fix software. And that required waiting patching more than waiting simply finding the vulnerability. And so the weights for this equation were
released prior to prior to the competition. [clears throat and cough] And here we have the score breakdown of the teams, which I'll go into in just a second, but I want to finish talking about the game design aspects because once you look at the outputs of a competition after the fact, it might seem straightforward that, oh, great, it produced all of these things. Uh, that's exciting. But it's actually so much more complicated than folks realize to produce a well-designed game that will result in these in these tools. And so one of the other things we considered was resource constraints. One of our concerns was, well, what happens when a extremely well-funded team just throws all of the compute at the problem
and they beat out all the other teams? Not by innovating, not by thinking creativ creatively, but simply by using a an amount of compute that's not real world realistic. [clears throat] And so all of the teams had resource constraints. The teams were given a certain budget for their cloud compute and a certain budget for their LLM usage. And they had to think very strategically about how to use that budget. And what that did was produce results that are performant and that are constantly considering which agent which aspect of the CRS is using these scarce resources and how they can best be how they can best be applied. [clears throat] And the final requirement for this
competition, the one that I am most proud of is the open-source requirement. And this requirement ensured that all of the teams competing in the competition in order to receive prize money had to open source their CRS at the end of the competition. This is a space in which there's a lot of nuance. oftentimes tools require specialization tailoring to specific software projects and in some cases research that's not open-sourced and not immediately commercializable ends up sitting on a shelf because there was so much prize money on the line. It allowed us to put in place the open-source requirement which basically made available all of the tools and their source code to the entire community to learn from to build
on top of and to turn into uh software security tools that could fit within a CIC pipeline. And so we're going to talk about a couple of the a couple of the solutions that have now been open sourced and what takeaways we can learn, what might be leveraged in other aspects of software security or cyber security more broadly. And we're going to focus to start on the team that came in first with a score that was almost twice that of any of the other teams and that was team Atlanta. And as you can see, this is a team that came in with a very high accuracy score. and high scores across the board in all
categories, but especially in their program repair, their program repair score. So, they clearly invested a significant amount in patching. [clears throat] They also were the team to use their resources most efficiently. And so as you can see they given a budget of 85,000 for Azure compute resources they used almost that entire budget they used while not nearly the total amount they were allowed for LLM the highest amount of any of any team. And so they thought pretty seriously about how to make sure that things were not left on the cutting board, that they were not letting resources go to waste. [clears throat] And it's really interesting to dig into the work that they did. They've released
all of their code open source. They put out a number of blog posts. Uh this is a team made up of researchers at the university [laughter] uh at Georgia uh uh Georgia Tech and out of Samsung [clears throat] as well as a couple other collaborators and they released all of their code open source. They've released blog posts as well as a 150 page paper on the design of their system and I'm going to refer to that paper uh along the way in this presentation. But as you can see they have a very complex design structure. You can dig into each of these components as we will [clears throat] in which they're interacting with the LLM providers. They're interacting with
the AIC organizers and they've essentially split their tool on the right side of the screen. As you can see, they've split their tool into several different components. They designed one component for finding vulnerabilities in C, one component for finding vulnerabilities in Java, and one that was language agnostic. And those were their three bugfinding modules. One of the really interesting takeaways though is that they found that their multiling their multi-laying approach was the most successful approach that they had and it accounted for counted for 70% of vulnerable and I'll apologize because my cat one of them is making a surprise appearance but I figured for bides that would be a welcome uh a welcome addition. So
[clears throat] this also provides a nice opportunity to go into a slight aside. So as I've been putting together several keynotes on this on this topic, I've been thinking a lot about the ways in which we use AI to automate to automate tasks. And I thought, well, there's so much information now on AIC on the internet. Could I simply ask an LLM to automate this for me? Could I ask them to automatically produce slides for me? And I went and I said, "Generate me a slide deck on AIC." And it gave me a slide deck that was relatively relatively uh uh generic and high level, but it did say some things that were true. you know, we want fair competition
between AI systems, realistic complex cyber environments. And so I thought, okay, this is interesting. I could have probably paid an intern to who knew nothing about AICC but could Google to do the same thing, but this is this is useful. And there is a point to this analogy. So then I said, well, let's add an image to the slide deck. And so I asked it to I asked it to do this. In this case, it was chat GPT and uh chat GPT said, "Oh yes, let me add a a nice image." I believe it said something to the effect of, "Let me spruce up this slide deck for you." And that was the image it
added. And so then GPT asked me if I wanted it to add some nicer images. And I thought, "Okay, great. It recognizes that that's not really the kind of image I want in a slide deck." And that was that was its approach to uh making the images more creative, right? Just right on top of the text. Uh and and then it asked me if I wanted uh some additional additional features to the uh to the slide deck and I said sure. Uh, and it added a added a timeline um, which is honestly could be a timeline for quite literally any project that has ever happened. So, I decided, okay, what if I went to a
uh provider that isn't just taking an LLM and having it produce a slides, but someone that has integrated knowledge of how slide decks should work and a broader structure around the creation of slide decks with an LLM's ability to generate text and generate these aspects. So recognizing that you're not going to have an out-of-the-box approach and there's going to need to be an instrumented system here. And so I asked I [clears throat] asked a tool that does slide generation to make me this slide deck. And it had much better graphics this time around. Really truly phenomenal graphics. and the text was completely completely nonsensical. And I I was uh glad to see that we've come a little bit farther in uh the use
of AI for slide generation. And I'm sure by the end of the year uh uh the tools will be 10 times as good as they are now. But what struck me was that what struck me was that you really need to combine different areas of expertise, different aspects of what LLMs are good at with what humans are good at with uh uh potentially agents t uh trained on specific visuals, etc. and combine those into even something as simple as making a slide deck on things that are publicly available publicly available on the internet. And I wanted to give what seems like a very very basic easy to understand example of where LLMs might fall flat in
our daily lives because now I'm going to talk about where they fall flat in something that's much much more complicated and where they are really effective and how those things have been combined together [clears throat and cough] because at the end of the day it boils down to the same problem is that LLM MS are able to solve some subtasks extremely well. And if you can thoughtfully engineer those different uh create different agents handling those different subtasks together and combine them into an instrumented system that then uses more traditional existing software techniques or algorithmic techniques for things that LMS are not good at like let's say discrete reasoning or mathematical reasoning. you can create a system that
is very effective and this was the approach that the teams the teams took. So we're going to focus on the multi-laying the multi-laying uh component of the system because that ended up accounting for 70% of the vulnerabilities that this entire system found. And we're going to talk about the boring aspects of this that have nothing to do directly with AI except for [clears throat] except for the cost but are things that have been a fundamental aspect of software engineering for years and years and years which is simply engineering and performance and resource management. So this is the first uh over I think 10 maybe 20 pages of the team Atlanta report on their system was just focused
on how they did performance management because it was so crucial to the problem. Ultimately AIC like many other things in software development and like many other things that tie into software security is an optimization problem. How do we optimize all of these different resources? How do we optimize all of these different agents? And they took several key approaches such as being able to reason over multiple challenges or challenge projects as they were called concurrently. They designed a system that was as failsafe as possible to ensure that even if there were issues in some components, the entire system would be able to continue moving forward. They, like I talked about, fully utilized their resource budget and they also collected
meaningful logs which we'll actually get to dive into later because this was again a requirement of the game design and it allowed the organizers to better tune or allowed sorry the competitors here to better tune their system to utilize all of these resour resources. [clears throat] So they have their five modules and if we if we look at what five modules for vulnerability uh discovery and program repair and if we look at which modules were using the most resources they again decided to focus the most resources on their multi-lane general purpose general purpose system which is really interesting if you have a history in vulnerability discovery because a lot of vulnerability discovery up to this point
was heavily heavily tailored for the idiosyncrasies of various languages. So the fact that they found that their multil- language component made the most sense to uh put resources on was a really interesting a really interesting takeaway. They also introduced rate limits for their agents in order to ensure that different parts of their system weren't boguarding all of the resources. So they spent a ton of time thinking about how to how to create this system and they used uh uh Kubernetes, they used a lot of existing existing resource management frameworks and other kind of multi-node instrumentation frameworks, traditional software engineering to build this to build this system. And so we're going to talk a little bit
about the multi-lane system. So the proof of vulnerability, that thing that I talked about earlier is an input into a program that triggers the vulnerability. The way that we designed the game to work was that you're triggering something like ASAN or Jazzer. So you're triggering essentially a sanitizer which is little bits of code that your program is compiled with additional instructions that check whether you're writing out of bounds in memory. So if you use anything like valrand or doctor memory to check something like this that's going to be similar to an a sanitizer. We built this on top of uh uh ASAN and u uh UBSAN. So sanitizers that are looking for outofbounds use of memory or
uninitialized memory and things like Jazzer for Java which focuses on a different class of vulnerabilities. So things like command injections. Jazzer will essentially instrument the code to see whether you are writing [clears throat] beyond what uh uh beyond the executed whether the user has been able to inject commands into a I'm sorry I have a cat on me again inject commands into a executable aspect of executable set of code that the user is not supposed to inject commands. runs into, for instance, that actually gets to be very complicated when using sanitizers because you have to have a recognizable string there and then systems need to know, CRS's need to know what that string looks like, what that
unauthorized command might look like. [clears throat] So, this required a lot of effort on the part of the game designers and it requires specific tailoring by the teams. The reason we use things like Asan and Jazzer was because they are commonly used in open source projects today, Google's OSS fuzz actually uses them in their large vulnerability discovery systems. And so by building on top of existing existing work, we could make this as real world as possible. So the AIC teams recognizing this combined traditional approaches that leveraged ASAN and leverage these sanitization approaches and symbolic execution engines with [clears throat] with also existing static analysis engines like CodeQL, Jor infer and they combine them with AI. So this ensured
that the teams were building on top of the state-of-the-art. They weren't recreating the state-of-the-art and they were also using AI to fill in the gaps between these tools as well as standalone AI systems. So this is what the multi-lane system looked like for team Atlanta. And as you can see here, it looks like a giant fuzz. There's a corpus manager which is something that manages [clears throat] that manages all of the seeds. So the different program inputs that are being mutated and feeding back into the buzzer. Uh there's the mutation engines themselves which are taking those seeds and mutating them in various ways. and then an executor which is executing the code running with these
inputs and seeing if a crash occurs or seeing if a sanitizer has been triggered. And what's interesting is that in some cases they have just general purpose fuzzers, general purpose mutation engines that have nothing to do with AI. And in other cases they're using AI to actually mutate those seeds. So rather than just use AI as an out-of-the-box vulnerability discovery tool, they're combining it with existing program analysis tools to create something that's more performant than existing tools and something that takes it takes it to that next level. >> [cough] [clears throat] >> So like I talked about it accounted for their system multi-lane system accounted for 70% of the proof of vulnerabilities that they submit during the competition
but the fuzzer the general purpose fuzzer the one that didn't use AI itself accounted for 50% of the crashes found. We found this to be a really interesting uh takeaway because what this essentially on first glance might seem to suggest is that general purpose fuzzing actually was just as effective if not more effective than using AI. But this is why really digging into the approaches is nuance. We were more than happy to accept that as a conclusion of the competition. If the competition showed that AI really doesn't make a difference in vulnerability discovery, that's completely fine. However, what we found is that and what this team found when they went in and looked at the results
is that even though the general purpose fuzzer might have produced that final input that triggered the sanitizer, that input had been at some point likely mutated by the LLM mutator. And so, LM were absolutely playing a role. The other thing that we'll talk about in a second is that is that the LLM aided vulnerability discovery tools ended up finding different kinds of vulnerabilities that were more complicated that required weirder, more complex input structures to trigger. And so the LM was really helping to get deeper into code and create inputs that would trigger very hardto-reach code paths. The team concluded that you really need solid solid engineering solid fundamentals and traditional effective approaches before adding in LLM to truly
harness their potential. So we're going to talk about that LLM aided mutator MLA and how that worked because I think it's an excellent example of agentic approaches in vulnerability discovery. And so here we have the overall architecture of MLA and as you can see here there's different components that are all interacting with each other essentially different agents and each of these agents each of these agents is responsible for a different task and team Atlanta did an excellent job describing what each of these agents was responsible for. So there was an agent responsible for parsing the call graph and being the expert on how to navigate through a large codebase. What thing is calling what thing, etc., etc. For large
open-source projects, for large projects in general, this is essential because you need to understand how all of these different files, all of these different pieces of the software are interacting with each other. You then had a bug candidate detection agent. So, this is a uh detective style agent that looks into potential vulnerabilities and investigates them. So, they tailored an agent to be an expert in bug candid detection. They also had an agent that made the call graphs that surveyed uncharted uh uh territory things that you couldn't use traditional tools to recover. If you've ever used something like IDA something or other kinds of tools even like a VS code and it will sometimes fail to connect especially
indirect function calls and things like that. So you have to have an agent that is tailored to fill in those gaps and become an expert on that. And then you have a understanding agent that gets a lay of the land, understands where all of the entry points are, understands how the code project, the spec some of the idiosyncrasies of the challenge worked. And then you had an agent that instead of just creating payloads, so inputs that could trigger vulnerabilities, it actually generated Python code that would not just create payloads, but mutate and change those payloads live. This is far and beyond what any general purpose buzzer is able to do. And they were doing an excellent job of saying,
what are LM really good at? code generation for Python scripts. Let's apply that here. And so what team Atlanta would say is that the real breakthrough in their MLA component is that it generates attack strategies, not just attack payloads. So each generator is essentially a a function that condenses the security researcher playbook. And this allows for using using uh uh using a generator with very complex pieces of software that have very complex uh formats for entry. The teams also spent a huge amount of time learning how they could best do prompt engineering. So they took existing prompt engineering approaches like system prompting, contextual prompting, role prompting, chain of thought reasoning, etc., and they adapted them for reasoning about
code. This is something that I'd really encourage everyone to go in and check out. in the open-source code projects because you can get a great sense of how the teams were able to prompt different LLMs to produce effective results for code understanding and software understanding. [clears throat] And I want to use this opportunity to actually compare some of the prompt engineering techniques between two of the different teams to see how they were able to do different kinds of prompt different kinds of prompt engineering. So I'm hoping that everyone can see this. This is an example of theories approach to uh theory is the third place team. theories approach to prompt engineering. And in this case, they give
the LLM a set of instructions that's very well structured. And so they say, "A static analysis tool has identified a potential vulnerability in a modified version of free RDP. Your task is to analyze the code snippet and consider if the claimed vulnerability can be reached and triggered via malicious user input. You may need to make reasonable assumptions about a malicious input's control over the relevant data in practice. This is basically saying you're going to need to assume that you don't know everything about reachability over a large code project. If you'll recall, team Atlanta actually outsourced call graph analysis to several different agents. So, [clears throat] uh the prompt goes on to give several such examples. So for instance, one of
these examples is a potential integer overflow computed from the actual length of the user input is unlikely to be triggerable in practice. However, if it is computed from decoded user input, it is likely to be triggerable. So that is an example that this prompt is given. Uh [clears throat] the reported vulnerability can likely be triggered via user input is the kind of details. And then there's specific instructions like you must respond with one of the following options with no other output before after likely or unlikely. And so you can actually scroll through and all of these all of these interactions are available online. You can scroll through and you can see the kinds of responses that the
assistants were able to the assistants were responding with. And for instance, theory used this as an initial pass on all of the challenges to triage for resources. So if we go through the free RDP example, this was a case where we had a backd dooror a vulnerability uh that was obiscated within the code base and it required certain kinds of formatting, certain kinds of messaging uh uh message formats to the inputs to actually be triggerable. And then it came in the form of a of a diff that looked like this. And so the the initial LLM pass produces a assessment that there's a likely vulnerability and the reason is that the function contains offiscated um
[clears throat] offiscated code uh that allows the executable uh to be triggered. And what I thought was the most interesting aspect of uh this uh this interaction is that you can see exactly how much money it costs to do one of these messages back and forth. And so for instance, in this case, uh it costs 80 cents for one of theory's additional agents to analyze and then respond with uh a specific structured input that they expected could trigger the back door. [clears throat] and their agentic approach was to decompose the tasks structure complex uh uh outputs and adapt to the models and so theory because they took an LLM uh forward approach which we unfortunately don't
have enough time to talk about they adapted to the models as the models were coming out so they had two years in which they were constantly changing the different ways in which they would interact with and prompt the prompt the model. So, I'm going to really quickly go through the rest of this to be to be sure that we have enough time. And a couple of the things I'm going to hit on very quickly are in some cases the teams ended up designing specific tools. So, modifying GRP or modifying cat or things like that such that they could be easily used by LLM. So they created LM specific tools. They also had varying approaches to
combining static and dynamic analysis and they structured this as part of their game strategy for how they wanted to account for the different accuracy accuracy weights. And so with uh theories approach they actually had pretty low accuracy compared to Atlanta. But if you go and you calculate out their score based on based on all of these metrics that ended up not having not having a significant impact because the accuracy multiplier had a four exponent in the exponent. So rather than being cubic, it was one beyond that. [clears throat] uh which I'm sure there is a word for. So I apologize to all the math people in the audience to whom I'm offending. But what that essentially meant is if we go all
the way back to all the way back to the beginning when looking at with looking at how these were being scored by raising it to the fourth here that meant that really bad accuracy was going to really kill you. But if you were getting around 50% it you would lose a significant chunk of points but you would still be scoring relatively highly as long as you were doing well across other across other categories especially in program especially in program repair. What I'd say is that all of these tools are open source. You can click through all of those different logs, all of the uh interactions that have been made available and see the different ways
that prompting these LLMs produce different results. There's really significant takeaways for how these prompts were engineered that can be applied to many kinds of tasks in software engineering in creating agents that can reason over code and be force multipliers to existing software engineers. uh there's some really interesting neurosy symbolic approaches that combine existing program analysis tools with LLM and there's very thoughtful approaches to creating agentic systems for the purposes of program analysis but the other aspect of it is that for the purposes of the competition software uh vulnerability discovery and remediation that is immediately applicable to the world of software defense especially as AI coding becomes very very prevalent. [clears throat] These kinds of tools can be integrated
into the software development life cycle as part of LM driven commit review integration to static and dynamic test frameworks specializing code understanding agents on in-house software. So from a vulnerability researcher point of view, from an exploit development point of view, from a reverse engineering point of view, there's a lot of work here in the open-source tools that you can learn from and adapt to your needs. But just from a software vulnerability discovery and remediation perspective, these projects, these CRS's are now open source and available for use. And I'd really encourage if this is the point I keep hammering on the talk, I'd really encourage everyone to check out the the tools. Uh I will um
I will uh tweet out a link and a uh uh to all of these blog posts, but the primary uh source you can go to for all of this is a cyberchallenge.com. It's all available on Google. And with that, I think I'm wrapped up. Sorry, Joe, for going a little over. >> No, no worries. I actually went over and started you late, so I apologize for that. Round of applause. [applause]
[music]