
now please join me in welcoming Joe and chandrani with their talk all right so welcome everyone to the last session between you and happy hour um today we're going to talk about a big question so what if your security team could automatically fix your code vulnerabilities so think about this AI can go through code faster than any human potentially finding and fixing your vulnerabilities but we know that AI fixes are still imperfect and we need to manage our expectations about the performance of AI fixes um according to a s SW bench uh Benchmark even chat gbt 4 could only fix about 2% of complex bug fixes so that means we need to use AI with oversight
and still keep humans in Loop and today we're going to talk about strategies for increasing our chances of finding and fixing correctly and we're going to talk about where to focus effort and this problem is actually two distinct issues this is finding vulnerabilities and fixing them and next we're going to give you a little background on finding vulnerabilities in the traditional fashion so traditional methods of finding vulnerabilities are going to be through manual code reviews static application security testing or sass dast fuzzing external bug Bounty programs incident reports pen testing or red team reports the issues with this is often times you have a lot of false positives generated and this can erode developer trust when they see a
potential vulnerability that isn't actually there Additionally you have a coverage of vulnerabilities can vary depending on the type of uh detection you're using uh these this process can be timec consuming and man reviews are slow and prone to human error and incident reports like the the known slice there are the ones that you don't want and we're going to see if we can use AI in this process and the goal state of this is we want to be able to increase our coverage and increase our true positives and decrease our false positives we're still not going to be perfect in our coverage of all vulnerabilities in this and there are at least two ways of
finding vulnerabilities using AI we can do an AI only approach and that would be just using a prompt and feeding in your code to the prompts or we can do some sort of fine tune model like GPT 3.5 fine-tuning and we tried this with a few hundred examples of real uh vulnerabilities and fixes and it wasn't that great um and then you can additionally use a prompt driven strategy where you're also using that to supplement your results from sast and use Ai and then where can you do this in the process you can do this at the stage of uh PO requests or you can check your code retrospectively you can do your whole code base you can do entire repos
and that's a little bit harder of a challenge but that's what we did um but the key is you can you can build on your existing Frameworks and not totally throw them to the wind and now we're going to talk about non- aai fixes so as you can see traditional methods of fixing vulnerabilities without AI are just people in seat in seats doing work um there are problems of prioritization a lot of times you don't know what priorities your vulnerability should face your vulnerability should uh use and then also you need a skilled team fixes are timec consuming uh humans are also prone to error and then you can have an example of um instances where
your team issues a patch for vulnerability the patch didn't work your team has to issue another patch also you can have a the same type of vulnerability can show up across multiple repositories if a bug Bounty Hunter finds one and and runs it across all of your different products but using AI we know that we can we can use AI to generate fixes for at least simple code vulnerabilities and we would still have the human validate the fixes and we still need our expert team but now they have more time to resolve the complex fixes AI can't handle so there are commercial Solutions out there but we're going to talk about why we you might want to build your own
solution a lot of times you'll have you will almost always have proprietary coding languages Styles in libraries um for instance your your team may have a uh xss fixing library that you want to San you want to make sure that your data gets sanitized and you don't want to depend on an open source solution where you're not sure if it's going to be maintained there's a large cost associated with commercial Solutions if you have a team of hundreds to thousands of developers checking in code the cost for for using a commercial solution can run into millions of dollars a year one of the more most important parts is you want to be able to customize the
measurability of your your fixes um you want to be able to measure your historical Performance Based on finding vulnerabilities and fixing them and a lot of times you have those resources at your disposal you if you're tracking your uh fixes through J and GitHub then you can automatically extract those programmatically and finally you want to be able to grow internal expertise in AI this is important because you don't want to be dependent on Commercial Solutions so with this I'm will pass it over to shandrani thanks Joe uh so now we talk about the solution Journey this was very new area for us we did not really know like uh what would actually work what would give us a good uh result so there
was a lot of trial and error uh before we could say yeah okay this is an accepted good solution right and then whatever work we are presenting here and the results they are done on the gbt 4 um but then uh it should be language like model agnostic uh the methodology so first we started like we need to we needed to do a PO right we needed to establish the fact that uh this very thing that we are trying that uh trying llm to find vulnerability and fix it this would actually work so uh for this what we started with a zero shot prompt engineering method uh therein what you do is you rely on the
what the model already knows the intrinsic knowledge of the model and you are you uh you are relying that the model will give you the most optimized results right so um as an input uh we created a dami repo uh with some very obvious straightforward vulnerabilities in it and U we were like okay find and fix whatever you can uh obviously the uh system prompt was not as simple like uh just before the talk I think by open a they mentioned that you need to praise your model so we did the same thing uh you you mention you are a very expert AI security engineer you're very methodical and blah blah blah so uh and then um it
was able to find out um all the 8 to 10 vulnerabilities that uh we had injected uh with um with a very good precision and then it was able to find um great fixes for them so with High Hopes uh we wanted to apply the same technique on our production report right uh so it still is zero shot and but then the in uh input strategy we changed it like uh okay let's go onto the production repo and let's see what it gives uh with on the system prompt we are not giving any additional example or any additional context but then we are making it little bit Advanced that means that um we look through our jira uh we found out like
what are the top 15 vulnerabilities uh we picked out uh we picked those and we are like okay give me only the vulnerabilities related to this we give a list of languages um so that when it is generating the fix it is uh context aware it has that language awareness and it we also gave a output response schema so do not just give me Blood Out response you give me only for these fields only in this particular format right so um everything was I mean going great it gave out a lot of issues it uh in the format that we expected and all that but when we tried to uh manually eval when we were going through the
results the issues uh there were a lot of false positives and not only that like uh the fix it had generated it was very much hallucinated which means like for example it would say that um include a function include a sanitized function but it would not actually um include a proper definition of it so then we kind of take took it slow uh we went with like okay let's give a a single example to it so we uh single shot method is uh is in wherein you give your model A a a single example how your output would look like and you expect that your model will learn and will try to mimic that so we changed our input strategy
from uh from going to the production code to uh we changed over to jir because in jira we already have our like vulnerabilities logged in uh there are vulnerable Cod Snippets that have been exploited so those are also logged in so we took that as an example and uh there is also a related uh repo that is written like in this repo we have found out so we are giving a one code example and the related repo right so um here what we saw is um if it is very similar the code if the code is very similar it is able to uh detect that but then uh variable tracing within the same file was still a problem
uh like for example if the variable has been defined somewhere later it's accepting input and then um finally it has been used as window. out window do URL and all that it is still not able to make all that context that was the one and then also even for like we are we have given one particular example of injection but if it's injection and if the code is little bit different xhr related or stored uh injection it's not able to detect properly so then we moved on to the F shot method fot is uh like when you not give just one but like three to five examples U so that the model understand the context and the task better uh so
here we look through like okay in our jira if we say injection like what are the different types of injection examples that we can find and let's give that and see what um how we are uh saying right so here basically multiple injection examples you are you are giving with this we uh seeing improvements in uh in detection in finding the vulnerabilities but we would say we would say say that there was still a lot to go uh what we could say an accepted fix so almost at the same time uh we came across a paper that was uh by Boston University that talked about this Chain of Thought method wherein they have mentioned that U they have done
vulnerability detection with with about 70% uh Precision rate wherein um um so this is basically how it uh works is you have a problem statement and you are explaining what is the problem if you are giving a decision to the model then you are also explaining like why you are taking the decision so basically you are um you want the model to think the way a human would think and it would uh mimic uh that entire think process thought process so here we are not only giving the um J ticket summary and the description but we are also giving an explanation like for example here here is my vulnerable code snippet then next I'm giving that
why I think this is vulnerable so I'm explaining like okay uh this param input is coming from user input if a malicious JavaScript is uh loaded then when the window do URL gets executed that uh gets executed in your uh in your environment and then giving a proper fix so I'm not only giving like sanitize this but I'm also giving a uh like proper definition of that um like how that sanitize would look like right and and and and I'm saying like why that s um that fix would uh would work so with this uh this kind of thing and um for like we mainly triy it so far the injection we saw a lot of
improvements uh both in detection and the quality of uh the fix that it was generating uh but I should uh definitely mention that it is not for vulnerabilities that is spread across multiple files it is for vulnerabilities that is contained within the same file um so reality check uh definitely like from the point when we started with the dumy repo it gave everything perfectly everything was able to uh all the issues where it was able to found um uh 10 out of 10 um till the point where we are today there was a lot of realization right so uh challenges as I mentioned um like for example the um complex vulnerabilities like whenever it is spread across multiple files multiple
repos then uh GPT would do uh really poorly then uh when we moved over to uh the uh the production production file production report there is no proper test file right because you never know how many uh vulnerabilities are out there in your repo we have a static code Checkers but we all know that also produces false positive so if you are uh if the my model is producing 10 uh issues I do not know if the 10 is out of 100 or if the 10 is out of 20 so you never know the denominator and then um the correctness of the fix right so this is also like very subjective like how deep how
accurate you want your fix like for example if you are um using a function if you are saying that you need to include a function the proper function definition should be present uh if uh there is a place for Constructor if there is an import St statement all those should be included as part of the code fix right it should not just say include this uh there should be Pro U library inputs of this there should be proper code for those so how much deep we would go with the fix so those are also like subjective uh so next we move on to metrics and evaluation uh initially when we started this obviously we started
with a manual evaluation we will go through each of these findings and we would see like whether it's a correct fix uh if it's a correct fix whether um the fix generator is um you know is uh is good enough but then obviously uh with manual evaluation you cannot scale that's not accepted so what uh we tried is um the auto eval framework so this is something uh langin provides uh they have a lot of like um um eval framework eval apis or functions that they provide the particular one that we used is the uh scoring evaluator so basically what we did is um we initially created created a data set uh with vulnerable code and the
corresponding good fix we call it this as a golden data set and then um we would take that same vulnerable code and run it through the llm and it um ask it to generate a fix right and then we would compare uh like how close uh that fix is with respect to the reference fix and this is where that U the Lang chain accuracy calculator comes into picture uh so B basically it lets you um Define the accuracy on a scale of 1 to 10 uh 10 means like you fixed uh the predicted one and the um like the reference one they're very U context wise and everything is very similar seven means there are minor differences and so on so
we decided like okay if the accuracy is looking like uh closer to 70% we can say like yeah this is a good enough to go and then we can accept the prompt and uh we go ahead with that like gen code fix of similar typee in other reports um for metrics U there are two types of metrics that we have considered so far one is the Precision right right so Precision is like U if you have 10 issues but uh you have only four of them are correct so your Precision is then 40% and then accuracy like how good your uh um your fix is um so with um like all these um um like uh methods and U like prompt
engineering methods that we have tried we found that with Chain of Thought um it is giving us more promising results um yeah and then we have a demo over to you
Joe yeah so there were three different ways that we implemented this um first we did a code fixing directly from a slack powered AI agent um and we also had an option of implementation where we had a CLI based um infrastructure where we would iterate over entire repos and then you know select and improve fixes for PRS later uh but this one is the most simple um and this is just a simple command line where we issue the pr
directly and this is very short it's about 25 seconds so this is the fix happens and it explains um the fix and why what was vulnerable and then this is the pr and yeah that's that's about
it so conclusions we had uh we had it quite the journey and it's still ongoing so for people who aren't familiar this is representative of the gardener hype cycle so it's where your expectations are over time so when I first saw this this a few years ago this really resonated with me so when we started off at the beginning we had our our Po and we had a lot of expectations I thought this would project would probably take a couple weeks and we'd be done with it um and then we started when we started moving up with increasing code difficulty increasing vulnerabil vulnerability difficulty uh you know we still thought we could handle it we did
zero shot we did single shot on real real code same with f shot when we got to chain of thought we were still getting the fixes we wanted that actually aligned with historical fixes using our proprietary um code and libraries they weren't just generic fixes and we thought that this was great but as the complexity of all the the vulnerabilities and the different coding Styles and the the different files you kept going um we were still using this manual valuation process and and we realized that this was not a sustainable thing and then when we transitioned over to the auto Evol framework that was you know we realized that we were at a place where we're going to be moving up and
there's actually a path forward so now we're at the point where we are in a we have have a clear path to full productivity and some prompt strategies that we learned along the way that ended up being pretty valuable we'd like to share so the first one is you always want to reduce opportunities for LM laziness and what you see in the upper right hand corner there that's an actual function provided that's supposed to be a fix from uh GPT for so that's not what you want so your prompt should contain something like placeholder ellipses and other shortcuts will never be used in place of functional code and maybe even more than once um the second strategy is you want
to make sure that you have your proper output output schema you want to include your file formats the scheming details and how you want your output Fields so this is just a snippet of a couple of fields you might want um and even if you have this specified you still may get synthetically incorrect output from time to time um the final one is you want to reduce opportunities for LM hallucinations so you want to have output fields that the LM can generate into instead of hallucinating so having Library Imports as a field or method creation for a field uh or you can have language specific uh fields that would only be used in context of a specific language
like um Constructor changes or something like that and so some key takeaways uh directed fixes are helpful so the more hints you can provide to your llm the the better your fix is going to be and um finding and fixing uncomplicated vulnerabilities is a simple task for ANM and this may seem obvious but fix things that are 100% known to be vulnerable first for instance if you have results that are validated from a bug Bounty uh program you can use that as as input to your prompt and get a fix quicker that way uh metrics are key so regardless of whether you're build or by you want to measure the effectiveness of fixes for your code and so traditionally when you
do evals um your it's pretty simple you have a question you have an answer and it's it's pretty easy to think about but when you're trying to do this do evals with code you have to think a little differently because code is not human language so for evals your question is your code and your fix is the answer and your answer can depend on how near your question is how much information do you want and fine-tuning prompts can take you pretty far Chain of Thought as we saw was the best for um performance and well-designed prompts and output schemas are key and lastly humans are still Gatekeepers and you always want to keep humans in Loop for final validation
that's it [Applause] all right the audience had has a few questions for you guys okay first one nice work how do you manage the accuracy of the AI code fix do you maintain any eval data set yeah yeah so we track um all of the fixes and all the results we track in a in a table so that way we can measure the performance over time the second question is why not use same grip which is open source yeah so a lot of times so we have um an internal solution that um we we can take the results from our static analysis and feed that into the uh LM so it's kind of like you're you're adding an additional
filter on top of um your results which one was more effective finding vulnerabilities or fixing them using AI fixing yeah fixing because um you know when you're trying to find something it's pretty Green Field you can look through an entire file and a lot of times the LM would it like it would generate so a lot of false positives um a lot of times you'd be good at finding vulnerabilities but it would it would find things that weren't there too it would so yeah the fixing you know it's more directed if it had something more narrowly focused to look at then it could it could do better what kind of vulnerabilities here you have experimented
with yeah we have mainly um tried with excis injections yeah how does it Endo when the fixes require a complex code Fix You Again how does it handle when the fixes require a complex code fix so yeah I can take it and can add um you need to give a lot of details like for example specific to your company um what all standards it follows so sometimes in the system prompt you need to give those details as well and then you need to mention in your output Parcels like where exactly you want if there like Joe mentioned uh the import statements if there are Constructors if there are functions where exactly you want them to be placed so it really
depends how fine gr you can make your system prompt want to add something when you do few shot do you retrain the model or just provide the samples as context so we had we have something set up that was like I called it an Adaptive fuse shot so depending on the type if we knew the vulnerability type ahead of time it would um it would pull out few uh the few shot examples from a known set of that particular vulnerability type how do you mitigate any security implications of giving AI access to production so we always would only issue PRS so nothing would ever go to uh go to main yeah have you considered any semi-automated
feedback mechanism so the llm responses can be refined over time in a scalable way yeah that is there in our um plan like for example right now it's very much jir driven the flow we have done so probably something like jira commments and then that integrate uh incorporate that um in in the llm in the back end uh yeah were you worried about sending internal code over to external model apis that's great question we did everything uh through zero open AI our own companies in sense we would have been that was that yeah that was something we would have been concerned about normally all right these are all the questions we have on slide of for now please join me
in thanking our guest speakers than [Applause]