BSidesCharm 2025 - Past, Present and Future of Automatic Code Remediation - Dan D'Avella

Name: BSidesCharm 2025 - Past, Present and Future of Automatic Code Remediation - Dan D'Avella
Uploaded: 2025-05-21
Duration: 54 min 29 s
Description: Recently, the landscape of tools used to change code saw explosive growth. Several open source code mutation frameworks have emerged, allowing expressive code transformations. LLMs have also jumped into the picture, promising power and delivering “cool” – but also towing chaos. We’ll explore the cap

BSides Charm54:2929 viewsPublished 2025-05Watch on YouTube ↗

About this talk

Recently, the landscape of tools used to change code saw explosive growth. Several open source code mutation frameworks have emerged, allowing expressive code transformations. LLMs have also jumped into the picture, promising power and delivering “cool” – but also towing chaos. We’ll explore the capabilities of these tools all towards answering “are we ready to automatically fix code issues? Dan is a technical leader and developer currently working at the intersection of application security and generative AI to enable high-quality automated vulnerability remediation.

Show transcript [en]

[Music]

Uh so thank you all for being here with me on late on a Sunday afternoon. I appreciate it. I know this is uh the last talk of the day and people are going to be filtering out and going home soon. Um so I'm going to be talking about the past, present, and future of automated remediation. My name is Dan Della. I work for a small startup called Pixie. Uh we're largely based in the Baltimore area. One of our founders is is in this area. Uh my role there is as the lead engineer on our automatic vulnerability remediation service. Um and so we are focused on application security fixes. Um but this talk is going to you know cover a sort of

broader um broader view of the world and talk about remediation in general. Um so if we're doing past, present and future we're going to start with the past. But I want to say I I think that to this crowd to to a securityoriented uh conference remediation has a specific meaning or you know potentially many specific meanings. Um, and that's very much focused on fixing security problems, making them go away. And there's a bunch of different security domains where we care about fixes. Uh, there's application security, obviously. There's, uh, you're going to fix your own code. You want to fix third party dependencies that have problems. There's container security. There's there's all kinds of ways and things that we want to

fix. Um, but I think what is common to all of these kinds of things and when you take a broader view when we talk about automated remediation is we're really talking about code that is able to make changes to other code. And when you look at it that way, it's something that's not really a new concept. And in fact, it's kind of surprising how far this concept has gone back in into the past. So uh you know again looking at it broadly in the 1950s uh Forran was introducing what we would call compiler optimizations and this really is code that changes other code. The focus at that time wasn't security it was optimization and making sure that your

code ran as efficiently as possible on the hardware that you had. What's really impressive about this to me as an engineer is when you look at the amount of computing power that they had and they were able to do these kinds of things. Uh it's just really incredible engineering and it it it really puts me in awe uh and makes me a little ashamed of my own capabilities. But uh it's cool stuff. When you look ahead a little further and looking into the 1990s, uh we see that small talk introduced an IDE which was able to deliver highly directed but automated refactoring and I think this is the kind of refactoring that all of us would expect from our

modern IDs kind of thing where you have a method name that you want to change you want that uh change to take effect automatically across your entire codebase. It's something that can be done automatically. Um, but it's, you know, fairly tightly scoped and again not necessarily security focused. Although you may be using that to to fix a bug or a vulnerability, uh, it's really only indirectly uh, security focused. Where things start to get a little more interesting is more recently in the 2010s, we have tools that developers are using for linting and code quality that start doing more than just reporting the problems that they're finding. They actually are proactively fixing those problems. Uh, and these tools are

capable of identifying security problems. Some of them might start fixing some small very tightly scoped security issues in in some very limited cases. But this this is very useful to developers and it's something that is probably familiar to to most of us today. Um we also see around this time the introduction of open-source uh quoteunquote code libraries. Codemod is a pretty common term now, but uh we we see these first generation libraries uh like JS code shift, codemod uh for Python and Spoon. And these are these have a little bit of a different emphasis than than the llinters and other code rewriters. They're intended as frameworks uh that enable developers to write custom code changes. So you can

write, you know, if you have a particular rule or code quality thing that you want to enforce and you want that change to be made automatically, you you can write a plugin using these frameworks. And they also tended to have um very large open- source libraries where people could contribute and and share uh different things that that were useful to them. So So this is where things start to get interesting because you can imagine people starting to think in terms of a security focus. how could I write a code mod that might fix a common uh insecure pattern that I encounter in my code? And then when we look at 2020, uh there's this project

called C3PR uh which I wish I had come up with that name for that project. Uh but the idea is that it's going to send pull requests that fix sonar issues. So this was an academic team. Uh it's a GitHub bot that they built, but I I assume many of you are familiar with Sonar as a code scanning tool. Finds code quality issues. It also finds security issues. And their goal was to automate the remediation of those issues. So taking the output of Sonar and creating pull requests that developers can merge. Um it's worth noting that this actually was done, you know, very shortly before the age of the explosion of generative AI. And so they

were doing this in terms of uh deterministic code transformation, syntax tree transformation uh which is really cool and and very impressive. And then even more recently we see uh what we're calling second generation code mod libraries. Uh so this would be frameworks like open rewrite. There's uh lib cst from meta and um code modder which was actually developed by pixie. So this expands uh the capability of automated uh code changes, automated remediation. We're going to talk about each of these things in in a little bit more detail, but the the takeaway here is that these are very expressive code changing frameworks capable of fairly sophisticated transformation. And again, they're pluggable. Developers can write their own transformations and and do things

that are useful and interesting to them. So, one one of the big takeaways for all of these kinds of changes is that, for lack of a better word, they're relatively simple or they're relatively easy to describe. And what we mean is that they had very low contextual requirements. A lot of these changes are what you might think of like as uh you know like a peepphole optimization in a compiler. it's just really looking at a single line of code or or maybe optimistically it's looking at a at a function body or a method body. Um, but it really doesn't need to see the entire code base. It can operate on very limited information and it can create a

very reliable change. It's a change that in almost no circumstances are you going to look at that and think, "Wow, I don't know what it was doing. I wish it hadn't done that to my code." And um you know the trade-offs here are that uh it's highly reliable and and highly predictable but very low flexibility. So what we mean is that you know it can it's only capable of identifying certain shapes of code certain very specific syntactic um elements and and applying itself to those. And that means that when as you know I think as security practitioners when we look at the kinds of changes uh these things tend to make it's just not that interesting. It can't solve the

kinds of problems that we feel like we need to be automatically remediating in order to make our code safer and our application safer. So that brings us to the present. And um I'm going to show this diagram. I think that this is a diagram that a lot of us have in our heads about what self-healing software or um you know truly automated vulnerability remediation should look like. Um so depending on your security domain, you're probably using SAS or DAS or other kinds of um uh uh defect detection tools. Uh there's also, you know, lots of other inputs you can imagine here. There's bug bounties, people identifying bugs that way. uh pen tests. There's there's lots of ways that we can

identify problems with code. And in order to have a fully automated solution here, we need to take the output of those processes, whether they're automated co code scans or human-driven processes. We need to identify the asset that those problems uh apply to. So we really mean you know like a particular application or maybe a particular container or a particular cloud environment and we want to go fix that problem you know and for for the purposes of this talk that's going to mean changing code. We want to change the code and then we want that change to be automatically you know reflected in our production environment once once it's merged and ready to go. And ideally

in this automated process um at the end of the day once you once you deploy that fix that's been generated the next time your scanner tool runs or the next time somebody does a pen test they're not going to find that problem because it's automatically been remediated. So I I think this is the mental model that a lot of people have and aspire to when we think about um self-healing software and and automatic vulnerability remediation. But there's there's a couple problems with this picture that I'm going to point out. Um, one is that if you fix things that weren't actually bugs, you're going to cause even more problems for yourself potentially, or at the very

least, you're going to cause somebody a big headache. So, uh, the example I'm going to show here is, uh, when you have, uh, a tool that has overzealously identified, uh, a reflected XSS vulnerability in your web form or or your your, um, in your web page, and you see something like this. And uh this this actually happens to me all the time because I I have an apostrophe in my name and I'm not going to name names, but a very large online web store sends me emails that look like this all the time because they've overzealously sanitized. Um so at worst it's annoying for users. You know, it's wrong. At worst, uh you know, you're you're

annoying your developers because they're making changes that shouldn't really need to be done. Um but actually I should say even at more worst uh you've created a bigger problem potentially. There's some cases where this this may actually introduce a security vulnerability where there was none before. So again fixing things that aren't actually problems is a problem. So that that's kind of a a something we need to be aware of as we uh automate these processes. So I'm going to propose a a different uh picture here which is a a funnel rather rather than a loop. So one of the first things that we need to ask ourselves before automatically fixing something or this automated system needs to ask

itself is is this actually a bug? So we've found in our own uh studies in our own experience that the common uh static analysis tools that are being used in application security are they they report 50 to 80% false positives depending on the application they're looking at. And I think if there's anybody in the audience who is using these tools who's an application security practitioner I I I would expect that that resonates very much with your lived experience. Um the other thing that we need to be asking as part of this process is what is the real severity of the finding that that has been identified. So um you know a tool might see a particular code pattern. It

might see something that looks like a problem but if it doesn't have the full context and the full you know business case and where this fits into your architecture it can't really tell how big of a problem this actually is. like maybe maybe you've identified a a problem in a in a web form or something, but it's a purely internal tool that is, you know, not expected to be exposed to an external network. That that that's important information that we need in order to determine how badly this thing needs to be fixed. So then we also need to determine what is the safe and effective fix. So for the for the things that are real

problems, what does the solution look like? Uh we feel that we're pretty good at this. um coverage is not 100% but there's a lot of vulnerability classes that are remediable um and that you can you can propose an actual fix and and do it automatically and that will be acceptable to a developer. So we're we're funneling down into actually doing fixes now and then we need that fix to be automatically deployed. Um, so I think a lot of developers would agree the right way to do this is as part of CI/CD processes, but there's a lot of tools or uh I should say a lot of teams out there that may not have quite the

maturity uh that's needed to make to completely automate uh the process of a of a change being merged to that being automatically deployed and then that problem no longer being in production. And then there's some compliance uh requirements where um there there may be reasons that there needs to be a human in the loop uh to automatically or to to um not automatically deploy something and it needs to be approved. The person who made the change needs to be different from the person approving the change. So so that that's another part of this process to be aware of. And then of course we need to confirm that the problem is gone. Um and part in some

cases this might rely on your CI/CD processes. If you are automatically rescanning after deploying uh then you would expect a bug or you would expect the finding to go away once once the once the fix is merged. But again we pointed out some some some of these uh processes that are very manual. Um if it's a pen test or or if you've relied on bug bounty then there's like there's got to be a human who comes back and confirms whether that bug is gone or not. So the these are all things that need to be u part of our our picture as as we come up with as as we move towards our goal of true automated remediation

and and which parts of these can we automate and how how can we make things better. So um looking at the top of the funnel for accuracy and severity um I'm going to take a moment to talk about AI based triage. So again, remember we're seeing that these scanner tools are identifying a huge volume of false positives. Um there's a there's a lot of reasons for that. Um there there's some incentives for the tool developers to uh have that kind of behavior and and I'm not here to judge that because I' I've worked as a tool developer myself. Um, part of the problem is that when you are evaluating a security tool, um, you probably have some known test set that

you're running against. It probably has some known problems in it. And if that tool does not find the problems that you know about, you're not going to think very highly of that tool. So the bias is towards detecting true positives, but the effect is that the tool is going to detect many false positives. Um, I think there's also a sort of a bigger incentive there or a different incentive which is that the tool vendor has this u is this thing on problem. Like when you scan it something for the first time and you're evaluating a tool, if it doesn't report anything at all, then again, you're not going to think very highly of that tool. Even if there really is not

very many true problems to to to identify. So, so um what what I'm really trying to say is I think the tool vendors do not have such a big incentive to solve this problem themselves. But there there's also technical issues which is static analysis tool is looking at source code. It doesn't have access to like your business context or the technical context. It doesn't know that this is like a micros service that's talking to five other microservices. It it just doesn't know any of that. can only see the source code and so it's view of the world is limited and that that limits what it's able to tell us. So uh what we found is that we can leverage generative

AI uh to very effectively triage tool findings. Um this is a example of a uh insecure randomness finding in Apache Roller which is a open- source Java application. And you know I think if a human is looking at this code you're going to see okay the tool is telling me there's insecure random I can see that there's credentials uh involved in this somehow. So, this is probably something that I want to pay attention to um and potentially something that I'm going to want to remediate. But uh what's really cool is in this particular example, our tool was able to use broader context by looking at um the actual dependencies being used by this application and

determined that even though the code itself is not using uh secure random explicitly, the dependencies are set up such that um a true cryptographic random is guaranteed. And so our tool has determined that this is a false positive. And that's a very powerful thing because first of all um it means that you don't need to distract a developer with the with trying to fix this. Uh if you're using an automated remediation solution uh you won't get a spurious fix from that. Uh where a developer now has to review code that really isn't meaningful or or useful. That change is irrelevant. Uh so this this solves a big part of the the problem we think uh in terms of

filtering for you know true positives versus false positives. But this triage process also applies to severity. So it it allows um we can take a look at a tool finding that may be listed as high severity and we can determine based on context that it's actually in fact low severity and that's going to impact uh your prioritization of of this issue and and how you treat it. Um, so that solves uh the the top of the funnel we think by by limiting the the the number of uh fixes that actually truly are problems that need to be fixed. Um, so in terms of current practice, uh, I I'm going to shift gears a little bit and and show what some of

the state-of-the-art is in terms of automatic code fixes and automatic remediation. So, uh, a big part of the picture is modernization. Um, which means, uh, and and this has huge security implications. You want to make sure that you're using the latest frameworks that have that are still supported, you know, have security support and are are being regularly fixed and improved. And in a lot of cases, migrating from one version of a framework to another is not as simple as bumping a version. there's very large API changes that can have huge uh impacts across your application. And so there's this uh open source project called open rewrite which is developed by a company called Modern um and they

have spent a lot of time and and very impressive engineering effort in building um tools that are capable of these very large scale refactorings and uh and migrations. Um and so they are doing this mostly at least when they started it was almost entirely deterministic uh syntax tree transformation. I believe that they uh have recently been uh starting to use generative AI. uh I'm not sure exactly in what capacity, but again, this is this solves the problem of like making large sweeping changes to a codebase um and doing it safely and effectively in order to to to modernize the the framework uh the frameworks being used. Um in the more explicitly security domain, um there's a framework called

code modder which again is uh developed by Pixie. uh and the original intent of code monitor was to be both a detection and fixing framework. So uh the tool used uh an open source rules engine to identify potential security problems in the code and then fed that into language specific um uh syntax transformation libraries. So, so for Java it's Java parser for Python uh lib cst and uh to this was intended to proactively make fixes to code bases uh to harden against common security vulnerabilities. So um you're seeing an example here of a SQL injection vulnerability being remediated. Um this change that you're seeing is uh it's entirely deterministic using code transformation. there's there's no generative AI involved in

this at all. So, um you can imagine there's a lot of engineering effort that that goes into making a fairly sophisticated and and safe change like this. Uh but it's really cool to be able to to harden um and secure against these uh these common security problems. Um and then in a completely different vein uh you know I think it's 2025 and and we're seeing the the uh proliferation of uh AI uh AIdriven generative AIdriven software engineers and uh co-pilots and and pair programming tools. Um so uh this is one particular example of this called ader which is open source and so the idea is that uh ader is a pair programmer you're going to give it instructions in the

form of uh you know really natural language like you might tell another engineer and it's going to go and make changes for you. Uh actually if I could just get a show of hands I'm not sure how many developers there are in the audience but how many of you out there have experimented with any tools like this you know GitHub copilot cursor there's cloud code okay so like these things are really starting to take off and I the the purpose of this talk is not really to argue about how effective or how secure uh whether we're all going to be replaced by AI agents next year or something but it you know it's really changed the game in a way um to to see

what these tools are capable of um and and to to be able to make the kinds of changes that that they can. Uh one of the inter interesting observations I would like to point out about these pair programming tools is at least in my experience and I think a lot of uh the community seems to be agreeing on this. They're really pretty effective at green field development. Like if you just have an idea and you want to create something new and you can describe it pretty well in, you know, maybe the terms that a product manager would describe it, it's going to do a really pretty good and almost scarily impressive job of of writing an app for you that actually

kind of works. Um, and I I think it's amazing how quickly we've lost sight of like how incredible that is. Like imagine saying that to yourself like 5 years ago that there's like a computer program that you can run on your laptop that can just write like a fullyfledged web application and get it up and running. Like that's very impressive. Um but however, I do think that these tools struggle a bit um in environments with existing code. They struggle with high context kind of changes where it needs a lot of understanding of what the code does and how it fits together with other components. it may not be able to fix bugs unless you already kind of know

where the bug is and and where it should look. So, uh these things are, you know, they're they're very impressive, but not necessarily um capable of just making very uh very sophisticated, very high context changes quite yet. That does raise the question, however, as we're looking at automatic vulnerability remediation and we want to be able to make changes to code, how are we going to decide whether uh that solution should involve determinism, meaning you know traditional code transformation libraries and frameworks uh versus using generative AI. And so I I just want to point out a few comparisons here. Uh so for code transformation again you know think JS code shift or lib cst or these other um

as cst transformation libraries the changes are in general highly predictable highly reliable. So if you run it 100 times on the same code it's always going to do the same thing and you should be able to know exactly what it's going to do. The downside of that though is it's highly inflexible. So if somebody has not explicitly programmed your transformation for the code shape that it's about to encounter, it may not do anything to it at all or even worse, it may do the wrong thing to it. So you really it requires engineering to to build the right solution. Um and it's especially becomes difficult to support very broad use cases. Uh and the effect

of that is that it has high marginal development costs. So if I'm working in Python and I want to uh remediate SQL injection automatically, I have to you know support all the different OMS that I know about. I have to support all the different ways of uh building queries and and using APIs and you know again that that's just for one language for many frameworks. um if I want to build a new rule that remediates reflected XSS or something else that requires like a whole new engineering effort to to get that up and running. Uh on the other hand, we have generative AI which I think we would all agree is uh fairly unpredictable in its output. Uh also

fairly unreliable at least uh with naive approaches. Uh but the flip side of that is it's highly adaptable. I don't need to write uh you know five or ten different uh rules in order to tell it to remediate SQL injection in Python. The model itself likely has enough context to be able to take whatever it's looking at and and extrapolate and determine what framework it is and come up with a passable change. Um, and so the effect of that is that there's relatively low marginal development costs because I can build something that's very adaptable and going to handle a large number of edge cases with a lot less effort than I would need for the um for the for the traditional code

transformation. Um, and kind of in both columns is like sometimes it's wrong. Both of them are going to be wrong in certain cases. the code transformation is going to miss edge cases that I haven't accounted for and it's going to make some wrong transformations if I haven't fully accounted for uh different code shapes but generative AI is going to I don't give garbage sometimes like complete garbage or sometimes just bad fixes or it's going to add a bunch of stuff that I don't want to my change. Um, so to enumerate that a bit further, you know, there's a whole lot of failure states that we need to account for if we are going to build automated remediation

solutions using large language models. Uh, and we've uh listed a couple of them here, you know, but it's all the way from the from the layer of like, oh, I just like couldn't make this network connection right now or I timed out or I used too many tokens or something like that. Um, if we're building software using generative AI as a component, we really need structured data. And there's oftentimes where the models struggle to give us the structure that we expect. Um, one of the approaches that has been explored a lot in open source and and also with ourselves is asking the models to generate diffs for fixes because if you can tell it to structure its changes

in a particular way and you can get a diff, then that's very easy to take and apply to a codebase. But uh that's a fairly complicated task for a large language model. It's going to do things like give you a diff that is not a valid diff or it's going to give you a diff that like has invalid code or changes that you don't want. Uh you know there's this term hallucinations where it's just going to make stuff up and you don't know what that's going to be. Um and then specifically for code changes we see a lot of things where you know it's trying to be very helpful in a lot of cases but it'll put placeholder code

where a developer needs to come in. it it thinks that a developer needs to come in and finish the job or it's going to add to-dos for you or it's going to remove your comments that you don't want it to remove. Uh it can break APIs. It can do all kinds of things. And so for the failure states that are enumerated just here and this isn't exhaustive. If you assume a reasonably, you know, good success rate of 90% for each of these things, if you were doing just like a singleshot request to to make a fix, when you account for all those things, the probability of that fix being correct is really pretty low. It's 34%.

uh and I I can speak to this like from personal experience like uh when you try to apply fixes across you know large code bases or large numbers of uh findings you're going to see that it just doesn't do that well. However, there's a lot of opportunities um to to use really clever software engineering to improve those results. So I I think um I'm just going to pause for a quick aside here, but it feels like especially when we talk to security practitioners, there's a pretty large amount of skepticism about generative AI. And I think there's good reasons for that. I I'm not going to ask the audience to raise your hands and say if you agree or

not. Um, but I I also think that there's this image of just somebody's going to like take this and just paste it into the chat GPT and say, "Hey, give me a fix." And I'm going to throw that into my editor and, you know, create a pull request and that's going into production. And it turns out that I'm not saying people don't do that. I'm sure people are doing that. But when you're building a product or when you're building software that is using generative AI as a tool, it it actually requires a lot of engineering and it it's it's not doing things that are quite as naive as saying like here's the vulnerability, you know, just give me

some output and I'm I'm going to throw that into code. There's there's a there's a lot of systems and a lot of feedback loops and and a lot of things that we've built to try to constrain the output and to try to direct it towards correct fixes and then to validate that those fixes are actually good and correct and safe and effective um and before any of that ever uh you know again in an automated process before any of that is ever presented to an end user. So that so there are very much uh I think good engineering practices and and clever things that we can do to dramatically improve uh this rate of success and and some of our own

benchmarks internally. For example, I've got a benchmark that um has has a large number of uh SQL injection vulnerabilities across a lot of different languages. Uh I see a 95% success rate based on our own metrics. So, so that's that's all engineering um that we're applying and and using these models as a component uh but building a lot of software around it to to constrain the solution. Um so another approach that we've uh explored is uh using hybrid approaches to code changes uh where it's not solely generative AI making the change. Um, so the sort of naive approach that that maybe we've tried in the past is saying I want the LM to give me a diff. I'm

going to tell it what the problem is. I'm going to tell it where the problem is. I'm going to give it some good context and tell it how I think I want it to remediate it and just give me a diff. Uh, and what happens is the LM will give you a diff and it's wrapped in JSON because you need structured output to be able to work with it. And that problem is really pretty difficult for an LLM to solve reliably because you're asking it to generate JSON. You're asking it to generate a diff correctly. You're asking it to um you know get the fix right. There there's a lot of moving parts in a lot of tasks and doing that

as sort of a singleshot uh prompt is not necessarily going to be really effective. Uh something that we found to be somewhat more effective and opens up a lot of possibilities is instead of asking for the code or instead of asking for a diff uh we ask it to generate instructions that a human would use to remediate this vulnerability. And we can not only do it in a human readable way, but we can ask it to also generate uh machine readable instructions. And so this might look like something like, you know, remove statement on line 39. In this example, you know, add string parameter fu to line 40. And what we can do with you can

do a variety of different things with this output. You know, one of the things that it turns out to be pretty effective is feed it back into another model or another prompt and use that as the input for the change rather than having one giant task that it has to do. But another thing that we've actually done with this is when it's in a machine readable format, we can actually feed that to one of the transformer libraries that we've talked about before like JS code shift or or lib CST or Java parser and use this context to direct a change um that would have been difficult to do otherwise. So in in these cases, we're

really asking the LM we're relying on the LM for its problem solving ability and its context gathering ability, but not necessarily for its code generation ability. So I I think that this is really pretty cool and and it's a it's a very effective way to make changes that are very reliable and if not quite deterministic, they highly predictable. Uh just to shift gears to something else we need to consider when um doing automatic vulnerability remediation. Uh there different kinds of security fixes have uh different what we what we refer to as blast radius. So there are changes that may be very simple to describe uh and that change may have uh an impact that is limited just to the code that

you're changing right there or it may break your entire application. And so for for one example of the former um you know if you uh disable external entities to prevent um to to prevent uh would this be xxe or um another xml based vulnerability. I mean there's probably almost no code out there that is relying on external entity expansion uh or resolution. And so I I'm I'm this isn't a real wager, but um you know this is a change that can be made very safely, very reliably. Even if in theory it could break somebody's code somewhere out there in the world, it is highly unlikely to do so for most code bases. However, for a different example,

if I take some random line of code and see that it's using MD5 for password generation and I decide, oh, I know what I'm going to do. But I'm going to change that to use BCrypt because BCrypt is really secure. Your your app is just never going to work again, right? Like it you you're going to have to do a database migration in order to make this work and all of your current users are going to be affected until you make that change. So even though both of these changes in and of themselves are really simple to describe and and really both of them can be done with very simple deterministic transformation, they have very different blast radius and and we

need to be able to discern that when we make the uh when when we do automated remediation. Um I think that the second change is something that we aspire to do correctly. um and I think it will be within reach for us sometime soon but it's it's not something that we are capable of of doing today raises the question though of uh what percentage of the problem can we address today um and so in in our own studies in our own case studies and looking at uh customer and open source data uh we have found that about 50% of findings can be triaged to a false positive or a low severity um something that a developer or a security engineer

doesn't really care about. It doesn't matter very much whether it gets uh resolved or not. And then we've found that 38% of things from uh from the average SAS scan uh is something that we can remediate automatically and that could be using deterministic processes or that could be using generative AI processing. uh if you expand beyond just the security domain and look at um at at uh at llinter type findings, code quality type findings, then uh we can we can do even better than that. Okay, so that brings us to the future. This is where we get to make some educated guesses about uh what's going to happen next and what we'll be capable of doing

next. Uh so the first prediction I'm going to make is that generative AI is going to continue to be very important for automated vulnerability remediation. I think we are very near the point where uh generative AI based code changes is going to be almost completely commoditized. There's going to be all kinds of tools that are capable of making uh code changes based on reasonably uh descriptive prompts. And so that you know out in the community you'll see a lot of discussion about uh you know AI scaling laws like how much better can these models really get or have we kind of tapped out everything that LLMs can do? Do do they you know just not have any more data out there to

to make them improve? And I I I think that there's some good arguments there and I'm I'm not going to weigh in on that because I am not an AI expert. However, the one prediction I would make is that I think that these models are going to get cheaper and cheaper and cheaper and faster to run and that the cost of prompts and the cost of tokens and both in terms of uh you know money and time is just going to continue to go down. And so when you look at it that way, I think what we're going to be able to do is make very small, very targeted software components that are using generative AI under the hood. And we're

going to be able to piece a whole lot of those things together to build more and more powerful and more contextdriven uh fixes and all kinds of things. But specifically in the remediation domain, it's going to become more and more powerful. um my uh my our CTO and and co-founder who who has uh given this talk in in a variety of different settings has uh compared this to you know it's just kind of a a stretch goal but what if you imagined a model scaled down to the size of a transistor and you know you have the power of combining these things in very complex and and interesting ways so so I I I do think

that um using generative AI as a component of software ware is going to be more and more uh more and more of a thing going forward and that's going to enable more and more interesting things. However, I think that what's really going to make this more make automated remediation more and more effective is adding more context to the kinds of changes we make and using more and more outside sources of data to inform the kinds of changes we make and how we make those changes. So you can imagine, you know, looking forward a little bit that a very intelligent system would be capable of integrating context from a whole bunch of other systems. So you

know, maybe it wants to read uh Slack messages that have discussions of of a particular feature. You have Jira tickets that describe how something was designed or how something should be fixed. We could have other knowledge stores that it's looking at that might have coding standards or security policies. And then um we anticipate getting feedback from uh the SCMs themselves like GitHub or Bitbucket like did this build pass uh you know what was the failure if it didn't and then also having tighter and tighter integrations with the SAS themselves. So getting more and more context and bringing all of that to bear on this problem is going to help us generate more effective fixes. Um and then there's there's also in the

um generative AI and LLM development uh world this concept of agent tools which is giving the models abilities to choose what kind of information and where it wants to look for information. So um it may choose to look for uh context in different places or go out to the internet or maybe it wants to generate some code that it can run itself and and test it out. Um, and then we also anticipate interaction where maybe a developer leaves a PR comment that says, um, hey, this fix looks pretty good, but I need you to change this little thing. And that's automatically going to be taken into into um effect by by the system. So, we think that where we're

going is something we're calling the magical fix cognition stack. um where we have all kinds of sources of data and where the really interesting problems is going to be uh for us is integrating all of those sources of data and there's going to be a huge amount of of competition in this area. So who can have who can gather the most context uh from your organization from your tools and from all these other sources to build the most to generate the most effective fixes that we can. Um and the more information that we have the more context that we can use uh with remediation. So you know there's like business cases or or what is the role uh

of the of the um of the uh business component that needs this fix to be made. You know I think there's going to be all kinds of information that becomes available and that we feed into these processes. Um and then you know at the end we'll be uh whatever whatever the model wants to do with that which in a lot of cases we hope is is fixing a security problem. Um so for very high level predictions what what do we what do we expect to see? So in application security we expect to see a shift towards using intelligence to drive remediation centered solutions rather than just detection. We are already seeing that remediation is uh is very

important. Uh it's not enough to just find problems. We need to fix problems as well. Um we're also going to see a move away from when using generative AI. Uh we're not going to use fine-tuned models. uh we feel like there's a lot more uh a lot more to be gained from using reasoning models and relying on the the context and and knowledge that already resides in those models rather than fine-tuning it ourselves. Um and we expect to see a dramatic rise in the vulnerability in the percentage of vulnerability classes that we can fix. uh you know again with fixes today that we would with fixes that we'd consider magic today uh becoming commonplace in the very near

future. So I for that I would refer back to the example of you know changing the the uh algorithm used for hashing passwords um and with that becoming something that maybe we can't do it today but maybe next year that's going to be a very effective and and reliable change that we can make. Um, so, uh, that's that's my talk. I think, uh, my timer says I've only been going for 25 minutes, but it's quarter to 4, so I'm not I'm sure not sure how to square that circle, but I'm happy to, uh, take time for questions. I I know there's another uh, wrap-up session in in just a few minutes here, but but thank you again um, for being here

today.

this a question.

So, so are you asking like that it's um like a compliance requirement that you're required to use automated remediation?

Uh, that's a good question. I'm I'm not too sure about that. I you know I think the expectation is that as a security practitioner you need some way to remediate these things and and whether that's being done manually or or by another tool right now I I don't know that anybody is dictating that but uh yeah good question

question but

So you're asking if sonar itself provides its own remediation solution. Yeah. Oh, I see. For the actual detection part of it. Yeah, I I do think that security vendors and SAS tool builders are going to start using generative AI in combination with their traditional static analysis processes. I think that uh they will use it to fill in places where not enough context is available to a traditional static analysis tool and that should lead to better and more accurate results. Um, I'm not sure how many are looking into that right now. And I'm not sure if it's going to be the existing vendors that do that or whether it's going to be some new players that that come along and and

build new things that they can sell as being more accurate and more reliable. But yeah, that's a good question. Sorry, I saw that one back and back first.

Yeah. And and let me clarify because I think I may have misspoke. Um it's in general we use well no we we we do use multiple models uh from uh multiple model providers. Um, but what I what I really mean is like having multi-step prompts where it's not just like one giant prompt that is fed to the model and then you expect for like a nice packaged up answer at the end. It's we need to break that down into different problems. So you you ask a question here and then depending on the answer to that question there's going to be a decision tree and you do something different here, do something different here. Um, but yes to direct answer your

question directly yes that increases the cost. Yes, because we're, you know, spending more input tokens, more output tokens and and um depending on the models involved at different steps, they they may be more costly than others. Yes. A question for the models that you use, are they fine or like you mentioned reasoning over fine-tuned, but did you provide um particular extra information? I guess I might be a lack of uh whether or not that's different tuning and like which models have you found useful for which parts of the the chain of prompts? Yeah, good question. Um so we are not using fine-tuning and and just to be clear on terminology fine-tuning would imply that you have a

set of training data um that you are going to apply to an existing model to refine it in a in a particular way. So you're you're making it more directed towards a particular task. Um, we have not explored that and we honestly uh, it might be unfair but we have a bias towards thinking that's not going to be super effective because the sort of foundational flagship models already know a pretty good amount about security. What we do which is what you're asking is we do provide many examples within our prompts. So this is referred to as like fshot prompting where you maybe I have three examples of what a fix for SQL injection would look

like. And then yes, also trying to provide as much context as possible. Um, you know, like here's what the vulnerability class is, here's what trai typical remediation uh examples look like, here's how you want to fix it in this case or that case. We we find that to be very effective and so we don't feel the need to to do fine-tuning. Did that did that answer your whole question? Um, if you can share which model Oh, yeah. Sorry. Um, so I I talked about two different kinds of pro a couple slides back if I can get back. Uh, when we're sort of building the instructions, uh, which may be fed to another model. Uh, we found that the the reasoning

models are very effective at that. when it comes to actually making code changes themselves, um I think that the reasoning models are overkill and they don't uh necessarily produce the best results either. So that the the code change itself may be made by a model like GPT40. Uh for a while it was GPT4 Turbo. Um but yeah, when you when you need when you needed to to come up with a plan uh and gather as much context as it can, the reasoning models seem to be the best choice for that. Uh yes, question there.

Sorry, can you repeat? How much? Oh, okay. Yeah. How how much hallucination? Um, it's sort of a copout answer. It sort of depends. It depends on how you're prompting and it depends on how you're building these chains of reasoning and and chains of processing. I think that if you were just to go to chat GPT with a reasonably complicated piece of code and say, "Hey, I think there's a SQL injection vulnerability on line, you know, whatever 100, uh, please fix it for me." I would say there's a 50-50 chance that the model is maybe not going to get the fix wrong, but it's going to add something that you didn't want it to add. Uh you need to be really

very specific with constraining what you're asking for. So saying things like, you know, don't add or remove comments, don't add or remove whites space, don't try to fix the problem that's really over that another problem that's over here, and I'm only asking you look at this thing. It will be very proactive. it's trying, you know, I I don't want to uh humanize it, but it it seems like it's trying to be very helpful, and I appreciate that, but when you're when you're trying to build something very automated and very targeted, that that's not helpful at all. Um, when you really constrain the solution space with good prompting, you do see hallucinations still, I think, at a much

lower rate. But the way we account for that is by building in feedback loops and and asking it to evaluate itself on how well it completed that task. Um and uh sorry, is that my my all done? Yep, all done. Sorry. Uh I'd like to take one more question, but I'm I'm getting the sign that that we're all done here. So uh I'd be happy to talk to you afterwards. Uh so thank you very much again for uh all your time and enjoy the rest of the conference. [Applause]

BSidesCharm 2025 - Past, Present and Future of Automatic Code Remediation - Dan D'Avella

Related talks