Using Large Language Models (LLMs) To Augment AppSec Testing - Thomas Ballin

BSides Newcastle32:2433 viewsPublished 2025-01Watch on YouTube ↗

Show transcript [en]

yeah hi guys um I'm Thomas um I've on a Cy security software business um my background is in pent testing and um very much was on the man security passing up to about two years ago where myself and a couple of my other PR family team members decided that maybe there was a a better way to try and look at cont and we our jobs and then something really really interesting happened because chat GPT came out um and we sort of realized maybe maybe there's an application for this um and all this time we saw loads and loads of conversations with loads and loads of different people some of which were saying this is going to take our

job it's going to completely put us out of business other people saying this thing can't do what I can do this thing you know isn't able to replace person and it's just a complete novelty thing um but then start to see some of the more interesting conversations start to emerge with people talking about the idea that maybe using large language models isn't about replacing a person it's about what we can do to actually augment what we're already doing and help to improve things so I was really Keen to just try and start a conversation um about the kinds of ways that L language model specific be applied to application security um things that we might be already trying

to do and failing to do that language models might be able to help with and things that we simply didn't believe were possible before but might be able to do and that's really what this this all talked about is just a conversation can get your ideas too about ways that we might be able to get language models to help enable us to do a better job really at what we're trying to accomplish at the moment so a quick show of hands in the room who of you uses AI in your daytoday life uh any want to give an example I will you random somebody [Music] example who here is using AI to benefit security bu

[Music] so that's exactly the same response I get every single time I think that it's fairly clear whenever I have these conversations with people that AI is not a replacement for a person but it certainly is an opportunity for us to help a person help prevent burnout help prevent um sort of exhaustion and help deal with large volumes of data so to talk about how we might be able to use large language models to augment ABS testing I first want to talk about what I mean when I'm saying absc testing I kind of break it down into four core things and go into a good upset program or at least addition so the first thing is scoping

def finding out what it is that you're supposed to be testing when you're supposed to be testing it the second thing is reporting communicating effectively your output the third thing is automated scanning piing up lots of a low hanging group and doing things that are high volume and then fourth thing is manual testing I'm going to touch hopefully on on each of these a little bit um through the talk but to help me along with the talk I had to my brain works in a way where I I need to visualize something in a in a real life scenario so I kind of invented this company Acme limited hopefully you can sort of think along with me and imagine

how this might work in a typical software development business so at me they build a piece of uh HR software they use large language models to help build it their primary goal is not to be a secure piece of software that is something that they'd like to do but their primary goal is to get new features and new functionality out there all the time at the moment they just get an annual pen test and they run a bit of D on the side of it and it's working okay for them but typically when they get pen tests done it's still finding things it's finding SQL injection it's finding cross scripting they've got no idea where these things come from and they're

just trying to battle them whenever they get that pen test report in so let's look at what scoping today looks like for a company like acne so um what would usually happen happen um is somebody would come in they would sit down with their development team they would try and understand what it is that need to be tested what the features of the application are the functionality of what the application is they understand what the motivations behind that test is and after maybe a half an hour conversation with four or five people um so that's like two and a half hours wor of actual resource time you come out with some sort of a a summary

that you're going to pass on to the tester that's fine but I really really felt like this might be the first opportunity to start looking at ways to augment the the absc process so I I started by looking at these things called guided workflows and effectively what a guided workflow is is you challenging the model to be able to present you with a series of questions and these questions dynamically generate based on the you did that can then provide a summary um in the format that you're looking for and effectively it's just question and answering what this means is that rather than having to sit down for half an hour with people in a team you can just give them access to

this model and they can have a conversation with the model and what it spits out is a very high quality um scope and I thought that was brilliant right but the problem with that is the only person whose time that was saving in the whole process was my own so to a customer it wasn't actually that valuable so then I started wondering what if we didn't need to do this scoping conversation at all um and the the way that we came about looking at at removing the scoping was to say where is all of this information already you know where can we go to to be able to collect the information about what exists that

needs testing maybe what's been changed since the last pen test and information like that and it's already existing inside businesses it's inside high level document ation it's inside ticketing systems it's already recorded but the problem is usually it's recorded in 10,000 different J tickets or it's recorded in 14 or 15 different pages of high level documentation it's the kind of stuff that a person doesn't have the capacity to to through to quickly generate a scope but you can feed that information into a model you can just um provide it with all of that text and it canol through it brilliant tring through loads and loads and loads of this information to be able to extract the

key bits of value and present them back so that was the first um toe dipped into the water of using large language models and I definitely think there's real opportunity there to be able to uh expedite the scoping process so much so that you don't necessarily have to worry about doing these Scopes once a year and and relying on trying to update them uh or or trying to redo them each and every time what you can actually do is a cool thing where you actually keep the scope updated throughout the year every single time somebody make a development change you feed that in as further context to the model it quickly rapidly updates the scope so that whenever you want to kick

that test off that scope is always an upto-date summary of exactly what there is and exactly how it might need to be tested the second thing that I looked at was automated scanning now automated scanning is really really interesting because it's such a broad Broad category you've got sash you've got Dash you've got Ras you've got you've got all sorts of different ways of going about to identify certain vulnerabilities the problem is that it's only reliable at being able to identify about vulnerabilities at the moment it depends on the vulnerability right it's brilliant of finding H heads it's awful of finding business Logic fls for example so I started to wonder whether or not I could go down the road

of using AI to find the vulnerability that it wasn't very good at and very very quickly realized that actually loads and loads of other people are working on that problem and hopefully hopefully they will be able to solve that problem but in the absence of being able to do that myself where could I use models to be able to help make the most of at least the information that these automated tools are able to produce um and there were really two areas that it seemed to make sense to be able to use the models the first one was in being able to stitch together findings for multiple different sources to get a greater level of confidence so

if your SAS tool is telling you something and your D tool is telling you the same thing then models are really really good at being able to interpret and identify but in fact that that scan finding from those two different scanners is actually the same finding what that means is that while s is wrong say uh EXA statistic is wrong something about 33% of the time G is wrong something by 15% of the time if your model is able to determine that that s finding is the same finding is that D finding actually that positive rate goes down to something 2% which is an incredibly incredibly valuable way of being able to mitigate some of the

challenges that you have around feeding your development team your engineering team with lots of results that they then read and realize aren't actually accurate aren't actually that valuable the second thing is in attack PA mapping so attack PA mapping for those of you who aren familiar with it is about describing the ways you can chain multiple different vulnerabilities together in order to describe what an attacker might actually do so mapping is a good way of how you can understand that a lower risk vulnerability might actually be the the primary vulnerability to focus on because it's the thing that enables a load of other vulner abilities to be exploited and to be AB use um the typ mapping on a small scale is

really B straightforward if you've got 10 vulnerabilities Ming aort you can start to imagine as a as a consultant we can start to explain how those things can be tied together but for a lot of organizations where there's 10,000 different vulnerabilities open at the same time trying to understand how to connect all of those together trying to draw all of that out is an incredibly challenging process but these are the kinds of things where you can feed that in as Contex for model you can ask it to explain to you how these things might connect in together very very good and very very effective of being able to change the vulnerabilities um that automated scanners have been able to

find so the next thing is manual testing right manual testing um is useful because humans can add value to testing in very very specific ways right they are they are very capable of finding those business logic flaws those complex identity and access control issues um but in order for them to do that they have to turn over every single Stone inside an application because they don't know where they should be looking they don't know exactly what they should be looking for and you end up with this situation where humans end up duplicating a lot of the same tests that automated tools are duplicating and there's loads and loads of time that gets lost there so um one of the things that I I

started looking at was ways that we might be able to actually use data from again ticketing systems and data from automated scanners to feed into a model and just start finding down that soope of what a person actually needs to look at and say you don't need to look at every single part of this system every single time for all of these vulnerabilities all you actually need to do is look for the vulnerabilities we know that the automation isn't able to find and in the places within the application that somebody's been working on and been developing on since that last pent Tex um and I'll come on to in a minute there's some really really cool

tools that you can actually do to do this inside the sdlc meaning that you can introduce these Gates into the secur into the development process so that before you're going to um deploy a piece of code production you can actually pick up on whether that piece of code can be tested automatically or can be tested needs to be tested manually for certain vulnerabilities um and the final area that I started looking at was reporting so I don't want to keep anybody to suck eggs in terms of you know what we do with reporting at the moment I don't really need to talk or think about how you can use the language model to facilitate your reporting I'm sure

anybody use CH G can imagine where it has Val there but what I did do was realize that you actually it's useful to start defining some ruls for where you should where you shouldn't be using models or how you should shouldn't be using models in order to

within our business as optimize mod um brilliant writing generic descriptions oil plate they're really really good at um finding this WR you write something you feel like actually it's the information there that it could do with a bit of a v structure they're really really good at that and they're also really really good at cing recommendations those security we not develop we don't necessarily communicate development languages um we don't necessarily know um how to write the code that thises the thing that we're able to build but a model is a really really useful way of being able to um make much much more disp recommendations um not necessarily always right not necessarily infallible but often a a lot can how to go about fixing

things I think that's only going to become more more relevant things like cod pilot much better A lot of these um a lot of these pieces of code there also some Dev right so um sending a bunch of data to a third party API like chat DPT is obviously a risk that needs to be considered um it's not necessarily something that everybody does consider but sending vulnerabilities to third parties is generally considered to be a knower um relying on the output of LM is to be accurate so one of the reasons why I don't think that an llm is going to replace a person anytime hear about hallucinations like these things doent things these things

um do make statements that are so convincing in what they say but so fundamentally inaccurate that what you need to do if you're going to be able to use these things as you need to be confident in what the right answer is you just don't necessarily have the time um for the best words to be able to say it if you can do that then you can use the models if you can't do that probably to get the foundations under the under your belt before you start using the model to to facilitate you um and then the third thing is don't waste your time trying to BU ra I burn many many hours trying to find ways to trip models but

for those of you who AR familiar guard or effectively the mechanism by which um things like CHT will refuse to um WR refused to um WR payloads for you to be able to exploit various different vulnerabilities and you can sometimes find ways you can tell her that you're an old lady who's desperately needing it to be able to save her her son from some or something like that lots of people come up with all sorts of different ways to try and model and doing this but honestly you've earn way way way more time by doing that than you would just by writing these things manually so at a certain point except the limitations of these tools um and go back to the old

ways so obviously um there are some things that I definitely recommend in terms of resources be able to facilitate if you are interested in helping augment your program the first one is for I've mentioned chat DPT chat DPT is obviously the really famous thing that everybody uses um but I've experimented loads and loads of different models over the last couple of years and C out performs pretty much everything that I've seen um in its ability to work is more technology focused It's ability to work with code ability to understand developers um and it's definitely definitely where I be going instead of GPT if I was looking for a foundational model to work for the second thing is hugging face so um for

those of you not familiar H face loads and loads of both open source models and resources to teach you if you would like to build your own model it's a really really powerful place to go to if you're starting out and learning about these sorts of llms um but is a way way around that right if you want an offline model you want a way po it's effectively Docker for offline model so you no longer have to send vulnerabilities for third party like um un or or like open AI you can actually post a lot of this stuff on your own machine with very very limited um very limited configuration in order to be able to do that um bir GPT problem

there this is an example of how other people are already implementing um this kind of Technology into the tools and a lot of us already use in application security testing I definitely recommend checking out what else is out there these things are going to continue to be introduced into um into the um that that we're already using I think the ear do that going the people to find that they're performing better in the long run and then the final thing is red flag this is what I mentioned earlier so red flag is um the tool that be written to sit in front of a releaseable to make sure that um certain development are not going out

production until theyed this is the highest change these abilities it's a really really cool technology that's developed and belief open source be able to do that and definitely recommend checking it out um and then finally I wanted to just give a little um strength and weaknesses analysis of um of models so these are are some of the key um key areas that I've considered so far but again really really interested in hearing some of your particular opinions on this so um models are very clearly good at automating preparative tasks right they're really really good at processing large volumes of data helping remove some of that into people so that testers focus on what ters want to do

testing um they are less good at being accurate right they hallucinate um they're also less good around privacy concerns and lots and lots of risks um associated with um sending the data off to parties as I've already sort of touched on um and they're not very good at explaining their reasoning behind things so often they can sound very very confident but you've got absolutely no idea how they actually ended up saying the thing that they were saying which makes it very very difficult unless you deeply understand what the correct answer is to know whether or not what they're saying is the truth um there opportunities um Mass adoption by tooling as I've already said so um as

people start to adopt this stuff more I can only foresee that the maturity of the capabilities is going to accelerate people are going to come up with new ways to solve some of these problems um and Rag and semantic searching are two perfect examples of this right so people complained about hallucinations and people complained about the lack of rationalization behind the decisions that these models were making um but with retrieval augmented generation you're able to give a model real time access to much much more upto-date data that it can actually reference and use to be able to massively reduce the number of hallucinations and to be able to say I got this data from this place

and I've made this decision because of this thing which only a few years ago we couldn't do with a bottle and is now a really really powerful tool another opportunity is slms and disil models so we all talk about large language models and they're brilliant right but large language models take massive massive amount of resource and compute power um slms are small models they're effectively the kind of things that you're seeing operating on your phone now entirely offline they're very very simple they're built for a very refined purpose um but there are great way to be able to avoid some of the limitations and some of the challenges that we're seeing when you're trying to

run a 70 billion parameter model for example um some of the threats to talk about and I've left this till the end because that's my security brain doing what security people do and talking about the bad things um it's it's a relatively unknown tax surface right so what I'm talking about here is that we actually don't know all of the vulnerabilities that mod might be exposed to we hear about things like um parameter leakage or prompt injection people are talking about the kinds of ways that these models might actually be exploited but it's still a relatively unknown quantity at the moment and that's that's definitely some is concerning but hopefully also quite exciting for a lot of people who are

interested in finding out how we might go about exploiting and then IDE how we might go about securing these um models in the future the next thing is regulatory contractual restrictions right so so this is um there is regulation coming in there's already been some regulation coming in to be able to limit the way that we're using these models so if you're going to go away and try and automate your abset program just be cognizant of the fact that you might not actually be committed to train it on some of the data that you might like to train it on but be prepared that if you are restricted in the way that you're going to use that

you're not using it for life critical systems for example um or be aware if you are doing that that somebody might come along in the future to tell you that that's no longer allowed and then finally hyp driven decision so this is this is going back to what I said right so um I really really do believe that there are applications for models inside the abset program I think I've given a few examples of areas where I can definitely see this being used for me program in the future um but I also think a lot of people are really really excited still about using the models for the wrong things about fa people with them about

um um just adopting them in places that they don't need to be right I don't think that we need to use a model to find SE injection vulnerabilities with SE map' been able to do that for decades know and so I think that trying to keep focus on those areas where we are struggling already inside security it's a really really important thing to be able to mitigate the risk of us just using models in places actually models underperformed compe and that's my sort of General SWAT analysis of models and that's kind of the the talk interested to hear from you guys

[Music]

to look at way

understand this

[Music]

and I'm I'm a big think

[Music]

Lo [Music]

[Music]

[Music] gu you to try process an extremely large amru data

[Music] um so yeah it is something to beware

[Music]

so [Music]

[Music]

cont [Music]

[Applause]

yeah so one of the things we trying to do is decision

[Music] am

[Music]

any question let thas of applause

Using Large Language Models (LLMs) To Augment AppSec Testing - Thomas Ballin

Related talks