← All talks

Prompting the Priorities: Evaluating LLMs for Smarter Vulnerability Triage

BSides Sydney 202537:0817 viewsPublished 2026-05Watch on YouTube ↗
Speakers
Tags
Mentioned in this talk
About this talk
Research from Macquarie University's Cyber Security Hub evaluates how well general-purpose LLMs can automate vulnerability prioritization using the SSVC decision framework. The team benchmarks five models (Claude, ChatGPT, DeepSeek, Gemini Flash, and others) across prompting techniques like zero-shot, chain-of-thought, and self-planning, using a VulZoo-derived ground truth. Findings: Gemini consistently outperforms, chain-of-thought with examples is most effective, and LLMs work best as decision-support rather than replacements for human analysts.
Show transcript [en]

Muhammad Ikram to the stage. >> Thank you.

>> Welcome um and good morning everyone. How are you doing? Doing good. Okay. So today I will be talking about uh our work that we have done at McQuary Uni Cyber Security Hub with an Emiris degree student for about six uh to 9 months uh on how can we use LLM and AIE based systems for prioritizing vulnerabilities and I hope that some of you already know about what are vulnerabilities and why prioritization helps in tackling those vulnerabilities in the systems that we use in the system that we uh leverage most of our financial as well as businesses as well as many different application out there. Um so um here is the outline of my talk.

I would try to briefly introduce you uh a little bit little bit about the different framework that were used or they are being used in um in industry to prioritize vulnerabilities and to triage those vulnerabilities and then I would talk about how can we leverage LLMs and how can we use these different techniques which are out there for instance like prompt techniques the pre-training the finetuning and so on and which are which among those different LLMs which are out there can help us the most. So I would talk about that in uh in in this talk uh and I would try I would try to present our uh methodology that how we automated vulnerability prioritization and how can

we leverage the cost leverage LLMs to reduce the cost and make the prioritization much more um uh effective and efficient. Okay. So, uh a step back why it is challenging to prioritize vulnerability. It is uh we we all know that um software and systems are developing at a very fast pace and there are attackers and there are you know like different developers who may not be secure security aware where they would try to use these third party libraries and maybe components from other comp from other parties in their system to um to speed up their development process and to meet the requirement from the you are growing and you are fasting. uh growing and fasting um um

technologies and demands from the industry. Uh so our thesis is software and system vulnerabilities um in short ones appear faster than remediations uh and attackers weaponize faster than defenders uh triage. And I think uh in uh today as well as tomorrow some of the presenter would talk about that how LLMs and even these underground LLMs can be used to uh weaponize the different techniques and the different uh procedures by the attacker to attack the system that we use. Um last year there was uh about 29,000 CVE reported with high security or severity risk uh in the ma major um vulnerability databases and it showed that um more more than 200% growth in critical high vulnerability

over the first over the past five years and we are not sure that whether this is because of the uh skill level of the um uh attackers um or maybe the um the over reliance of developers on those third parties and copy the codes and you know like uh um these different um method that they use to fastly develop their software or this could be depending upon the uh the um LLMs that they use or the different system they use to uh find vulnerability and to enhance the vulnerability search. So we don't know exactly but this is there are some statistic u um in our research that we found out that um worthwhile looking at

and we found out that 60% uh plus vulnerability uh missing key uh meta data. This could be like you know um is this vulnerability appearing um in the core um component of a system or core um um uh services of a given of a given enterprise or this could be uh either on a service um disk and so on. So those critical information or meta data is basically missing in major of the databases which reports or which uh um um record or archive those different vulnerabilities and we found out that 80% of the exploited vulnerabilities are exploited within 20 uh days of the first disclosure. So this is quite fast and the the the exploitation the time up

exploitation has been reduced and that's why we need to have like you know a system which can help us to faster um uh detect those exploitations of the vulnerability and then propose methods to either triage or trick or handle those vulnerabilities in the system. We also know that uh security analyst face increasing volume of complexity. This basically um result into a high cognitive loads on the very small um uh security team that organization normally have to tackle vulnerabilities and this is uh across the different sectors that we have analyzed um and so on and sometime those um uh cognitive load on uh security analysts result into inefficient handling of those vulnerabilities reported on the different organization.

Now traditional system have limitation. Sorry about that. Yeah. Well, the sound the tone is very nice. So that's why I put it. Okay. Uh so uh traditional scoring systems have limitations. Now I hope that some of you already know about how vulnerabilities are prioritized depending upon their severity levels. There are different scoring systems uh which try to categorize those vulnerabilities either high, severe and things like that. They assign specific coursees. This could be either categorical score or this could be some numerical score that they assign to those vulnerabilities. And depending upon those categorization and the score they receive normally security analysts try to or or you know like um um um device or revise their response in order to tackle those

vulnerabilities. But they have inherently a limitation because these categorization in the numbers does not tell you that where is this vulnerability lying. Is this on a critical system or a critical component of organization? Suppose that there is a vulnerability which is high severity and if lies in the service disk of a given organization. Should the security analyst prioritize this one compared to a medium level se medium severity level vulnerability which are founding in the customer database? Which one should they prioritize to take a lead? A general question for you as well. Normally the security analyst should prioritize to protect the more core component of the system and try to prioritize vulnerability which lies in the core system and functionality of the

of a given organization. But the scoring system would not tell you how to prioritize this one. Uh so namely providing responses that are independent of organization context. So in in a in a nutshell those scoring systems in those different categorization that we traditionally use that does not capture um the context of the vulnerability where is this vulnerability lying and so on. So in order to capture that context there are different methods and framework that have been proposed and our challenge and our question of the research is that how can vulnerability prioritization be automated in the context aware manner. So this is something that we looked into with my uh Emirates degree students um over the

last couple of months and we come up with a system to do this thing. Now as I mentioned that there is this different categorization and scoring systems out there in order to prioritize or somehow rank those vulnerabilities and some of them is that common vulnerability scoring system CVS. This is uh EPSS and SSVC. The most prominent one among these three which capture the context of a given vulnerability is the C c SSVC which is proposed by um America's cyber defense agency. Um so they propose um many different steps where um a security analyst should take in order to either triage, track or handle those vulnerabilities and all those steps that um that a given security analyst

normally takes there are manual. Um I would just go through one of their systems. If you go to and search SBC calculator, you will find that you know in other if you if you receive a given alert for a vulnerability how the security analyst take those decision to either triage, trick or so on those different vulnerability. I will show you a demo as well. But before uh showing the demo, I would like to go through or walk through these different uh uh phases of handling vulnerabilities. So here is we here is we we have the SSVC calculator and as you can see here these um um eventually like this calculator would um output a decision tree which is which looks like

here and this decision tree normally based on the different decision that security analyst should take on these different nodes and for that they should have like the background knowledge where to where to to search the exploitation for for the vulnerability where to find the context of the vulnerability and so on so forth. So this was co-developed by CISA and CMU um in 2019 and this framework um is guiding vulnerability response actions meaning that how security analysts should take an action depending upon an alert about security. This is a contextaware decision making meaning that they're um not only using the scoring system and the availability of the um um uh exploitation they the analyst normally takes the context uh of

that vulnerability lying in the organization and there are four um core decision points which is exploitation. One decision point is that um depending upon the alert about um um a given CVE or vulnerability, the analyst uh asks this question whether exploit is available for the um vulnerability or not. Um and if the exploit is available, is this automatable? Can the attacker automate those exploit to in order to attack the systems? And what would be the technical impact um impact on on on this like how um um what sort of knowledge and what sort of uh methods are required for a taker to attack the systems or vice versa to counter those um vulnerabilities uh being exploited and mission and

wellbeing. This is according to the organization. The last decision points that analyst normally takes is that if this vulnerability is exploited and um if there are attack happening on the system is my system is my organization would be facing let's suppose breakdown or the mission that we are providing whether it would be a breakdown and so on like for instance sometimes Sydney train trains uh breakdowns because of the weather and things like that if there are attacks in this uh how what would happen if there is like critical infrastructure like dame or maybe like nuclear power power plant power uh plants and so on. Uh would they uh stop working and so on. So these are some of

the decision points that they take and this is you know like uh when the exploitation is available here you can see here um then these are the different questions like you know this is not automatable PC available for that uh vulnerability and here this is active meaning that exploitation is available out there and it is active which can be used by a taker and nowadays we already know that uh building P uh and automated task with LLMs is quite easy for most of the developer and a taker and this could be really fast and this is what I talk about that um the exploitation of the vulnerability uh the time duration has been reduced to 21 days. So

this is how like you know LLM helps uh in order to reduce uh let's assume that uh we choose that uh the um exploitation um the vulnerability is not exploitable exploit is not available and it's not automatable then the next thing is that whether this is automatable yes and no and things like that and then if you proceed further to take this decision then the mission you know like uh criticality comes here where the attack where the security analysts decide whether this would be severe uh or there would be some impact on the mission of the given organization if the um vulnerability is exploitable and things u and so on and at the end there are

like you know these decision track uh and I would show you in the demo these different decision here. So whether the analysts should track the track the vulnerability or they should uh take an action in order to uh handle those vulnerability. Um this is um this is the framework uh that they use. Um uh let me go back. Um so stakehorse specific vulnerability and if you can scroll down you would see uh these um calculator here. Um and this would be um start decision. Okay. So this was uh um let's say clear on

uh then start decision. Let's say we have an attack which um which is alerted on a given uh is screen and let's say the security analysts want to um take an action either to work on it uh leave it alone or maybe like take it. So the first thing that they ask that whether exploitation is available for that vulnerability or not. So here you can see that let's suppose if none is available is PC is available of this executive. Suppose that attacker found the security analyst found out that there is exploit available for that vulnerability and it is automatable. Then the next step would be the technical impact. So um it the the exploit gives the adversary total

control over the over the systems. Let's say we take that this is yes and then the mission critical. So this is um if this attack happens in the customer service uh disk um then the mission may not be that much impacted for a given organization. So let's say here in AITD or tape me bank there is a vulnerability in the customer um um disk you can still register um for a given course using your own phone uh browse in there and you can register yourself. So there is there there is a um there is a way that you can um control um those let's say here we decide that this that this is essential and this is public impact is

uh minimal material or irreversible then depending upon those different decision uh you can see here uh it tells um it tells the analyst to act upon immediately on this uh vulnerability and tackle this one. um and and these are the different decision points that can be um that can be acted upon or used in decision making. Now all of these uh points that what I mentioned here these are manual and the analyst has to make a decision. They have to search those database they have to know the context of those vulnerability and so on. So this is all uh manual task um to do this thing. Now um the question that we ask can we use LLM and can we use this

system to automate um some of these vulnerability and there has been talk that you know people have used LLMs uh to um to to to communicate with scammers to communicate with different systems and so on and try to get those threat intelligence and here we basically following a similar um uh patterns in order to um in order to uh take this decision in in there. Now um there has been a lot of work in the space where people have used machine learning algorithms in order to detect um um vulnerabilities. Uh some of them are like supervised learning, unsupervised and reinforcement learning. They basically look into the mo modeling vulnerability and detecting and having the behavioral analysis of the

vulnerability. There are some model like uh Casis Bird um which maps vulnerability to numbers and there are different methods um like GP GPT based frameworks u they try to build the context around uh those different vulnerability but none of them take a uh in ranking and prioritize uh vulnerabilities um so there is a limited use and organization context framework like SVC haven't been analyzed by all of these um system or machine learning algorithm that I mentioned in the previous slide slide. So our contribution is that we evaluated the LLMs that are out there uh that we use u um for our work in our daily lives. uh how can we use how can we leverage those in order to automate

those different techniques and we use um evaluated the different prompting techniques we evaluated different um fundamental uh LLMs um in order to um prioritize those vulnerability and here is our work like the schematic or maybe the the uh bird eye view of our work um that we did. So we have taken a vulnerability from wzoo databases in order to have a ground truth h of how our system compared to other system would work. So we use the wzoo uh vulnerability database is a ground truth. We come up with these different prompting techniques uh in the literature. So those who um maybe like you know uh in machine learning algorithms the people who are working there they would know about zero short

few shot and so on. meaning that these are the different way of training machine or fine-tuning the LLMs in order to uh um rank or rank or or maybe like you know uh prioritize the vulnerability. We also um those decision points that I mentioned the four decision points the exploitation mission being and so on. We also use those LLMs with the prompting techniques in order to find out those decision making um like in a in a manner where let's suppose a security analyst uh would rather use LLMs for searching let's suppose is the exploit available for this vulnerability uh is it critical to my organization and things like that and it is what is the technical impact of

this vulnerability and so on so we not only uh prioritize the vulnerability in in large but we also propose in uh a system where uh analyst can use the system in order to find out those decision points individually. Uh we use these uh five different u uh models uh like deepseki cloud uh chat GPT and so on. um in order to uh benchmark those different LLMs um with prompting techniques and with the decision points and then we have like you know extracted those um decisions from um from LLMs and then we analyze using these different metrices in order to compare and contrast these different prompting techniques, these different decision point uh prompting techniques uh decision points evaluation as well as

the um um um the LLM's performance and uh this is how we you know like uh uh work with so we have had you know these different vulnerabilities and in this case particularly like we uh use the example of this CVE which is a window common log file um driver evaluation privilege vulnerability uh we use these different prompted technique here like zeros chain of thoughts uh mimics proxy these are some of the um prompted techniques that helps in order to better extract those information from from uh LLMs like zero shot meaning that LLMs is not trained with uh with any any prior vulnerability uh data it's something like similar like uh you just use the um

JGPT or Gemini as it is in order to evaluate them in the next couple of experiment that what we did we trained LLMs in a sense that let's suppose a given organization would be fine-tuning and training LLMs to do this prioritization task. Um and then we have this uh uh these three uh mission well-beings statement like high, medium and low. So the things that I showed you that when um when depending upon the mission um of the given organization how can we basically find out whether this is critical or not these are the decision find that we found out. It is because that uh this uh sample vulnerability database does not has uh it does not have this uh mission

critical uh um data points. So we generated those mission critical data points from LLMs by giving them scenarios by giving them this that we are let's suppose banking system and this vulnerability happens what could be the mission possible mission critical uh scenarios here and let's powerpoint let's suppose supermarket and center. So we give these different you know like scenarios for LLMs to generate those decisions and these are the LLMs that we have analyzed and uh these are the trials that we have made and why we made this trial here is that um sometime these LLMs are non-deterministic if you prompt LLMs with the same prompt you would receive these different outputs uh it's not deterministic and in order to

reach to a deterministic decision there could be many trial that you can make but depending upon the budget that you may have and the token that you would like to utilize there. We have had three trials there and from this trial we take the uh majority voting in order to find out that whether a given vulnerability need to be triaged and so on so forth. So this is the technique that we use and it is a technique used uh in many different research work as well in order to reach to a given conclusion. uh the data set as I mentioned we have like this one one wild zoo uh data set we choose uh to take you know these uh 384

uh vulnerabilities for which we find out all the ground truth and meta data that we can use to analyze those data set and this data set is developed by people from um uh in US uh in Singapore uh they collection um so this is a high uh quality data that they have collected um uh and they use SSVC um in order to uh prioritize those vulnerabilities. uh um and here we have this evaluation uh one which is we created I mean uh and maintain uh uh so we we maintained by CISA which is automatable uh exploitation technical impact and so on and we are um uh they have like approximately 45,000 CVE ID um and we took the random samples sample um

sample of 3 um 84 different vulnerabilities and so on uh we use these different LLMs at time up when we were doing the experiments. Uh you know we have different um cost for um for uh these tokens that we have used. Um our decision was based on you know like since this was a degree student and as a senior lecturer at public at university we are limited with budget so that's why we choose you know to have a limited budget uh and utilize the money that is allocated for our research. Um so among this um uh here uh cloud and deepseek these were like you know uh some of the expensive uh GPTs that we evaluated in

our study. Um uh so this is these are the promptum techniques as I mentioned in talk about that um uh one of the researcher um uh that we use um we base our methodology was Tony Atan who review uh um who who basically studied the different prompting techniques and there are these different prompting uh techniques um that they have proposed to use and we use these uh prompting techniques in order to evaluate um our methods. Um here is a very quick um and very um high level statistic that we have figured out uh we we have um extracted from our evaluation. So this is uh this figure say um shows the different LLMs like cloud 3 kgpttseek

and gemini flesh 1.1. So these are the um you know like uh uh the different decision points that I I mentioned like whether the technical impact is what is the technical impact what is if this is automatable and whether exploitation is available if security analysts use them as it is what is the performance of these different and we found out that among these uh four different LLMs that we f we used Gemini result into a consistently high result compared to these different decision vision points and as a disclaimer we are not supported by Google so we are not watching for Gemini it's a basically paper research that we did and this is what we found

out uh about the Gemini um uh the Gemini who which result into um these different um um output one um you know like um u speculation could be is that Google has a vast majority of code base and so on And more it could be possible that Gemini is using that codes code base in their training and that's why um that's why like you know you have like this Gemini um high result for Gemini and so on. Um some of other key points uh key finding that we did uh was that um um was that the primary technique which is a chain of thoughts emerge the most effective techniques. Chennap thought is a methods where you basically in uh walk

through or guide LLMs to take a decision and we found out their fart with working example meaning that let's suppose you give uh some example to LLMs and see them that how can LLM prioritize those vulnerability uh also help us in um in in in better um uh effectiveness in prioritizing the vulnerability Um simple techniques like zero shot and priming yielded u vehicle results. Meaning that any analyst would use uh LLM as it is just for like just as we use LLMs for u um um for rewriting our um emails or maybe like so on so forth would not work in there. Uh we also find out that chair GPT uh um have moderate and consistent

response uh and excel with uh chain of thoughts and deepseek with self-planning um this is another uh method for prompting align with it architecture um um and we found out that cloud uh cloud which is expensive is not that much effective in in vulnerability prioritization and so on. Um and one of the point that we basically evaluated and um key takeaway our key takeaway is that model prompting techniques and compatibility is crucial for optimal vulnerability um analysis outcomes. So before you use LMS for prioritizing vulnerability you need to have like uh the best possible prompt um um and so on um in order to prioritize the vulnerability. Uh our study has some uh some limitation as well. uh we didn't

evaluate it premium model that yield it stronger performance or the models that are security focused we use you know like the general available LLMs in our evaluation um and so on and the data set that we have had u it's very small one of the crucial um point is that you won't find uh a ground truth database which has um diverse set of vulnerability and so on um so our paper Um okay some of the points that I uh already uh uh talked about there uh so LLM show moderate efficiency Gemini consistently outperform the model and prompting examples significantly improve LLMs cannot yield replace expert judgment. This is something that we already know that there should be you

know human in the loop in order to prioritize and rank abilities. uh an effective uh it is effective as decision support tool rather than just replacing the human but there is something that we can use to increase the efficiency of those attacks in there. Um some references are here and we have this paper available um on archive. Um this paper would be presenting presented would be presented uh next month in a workshop in uh US. uh my p my master degree student is Osama uh who is uh uh outside the country that's why he couldn't present so he did a great job in uh in working this uh this this whole work um the the code is available on

GitHub so we use you know um um um we use um um very nice tool to automate this whole evaluation and if you are interested you can check out this tool in order to use it in your um in your research out in your work. Thank you very much for having me.

Any question for me? Yes. >> Hi, my name is uh Jeffrey. I'm from Arsert. Yeah, one people from Brisbane. Um just out of curiosity um any other particular features that you've used uh to be able to train from the uh data was it just the vulnerability statement and the hope that Gemini would have known um the context of uh the product or was there any other features that you uh uh put in to be able to train yeah >> and uh and get those results and also leading to that uh why do you think it did so badly with um well-being 43 Okay. Yeah. So, um good question. You know, like we use uh all the features

that these different vulnerability databases would yield like uh um whatever feature that they provided to us, we use those features. Although there there should be some work to uh extract those feature and enrich those features with some more data so that you can improve uh uh improve the performance of parallels. And so this is something that we lift for future work. And we have this scenario that let's say what would an analyst do in order to prioritize vulnerability. So an analyst would go there read some statement and having a little bit of you know like context around that vulnerability. So we use that notion and perception that how analyst would think and work and that's

why we didn't went deeper into and this could be further enhanced you know in the tool. Uh your second question is that what could be the impact of on well-being. Um so with this tool you know uh it help it would help the the security analysts to better find you know or efficiently find um those answer that normally analyst would be searching and spending days and hours to find out those things. So that could help you know not only the analyst but also organization to better and faster um and efficiently you know like tra triage those vulnerabilities. I hope that I answer your question. Yeah, thank you. Yes please. Yeah, Patrick from Dan Secure just just

wondering um are you happy with the results? like how did it perform compared to what you expected and compared to what you thought experts would have accomplished? Did you do any at least anecdotal analysis on that to get a feel for that? Uh since you know this work was research work is just the initial phase you know like uh a master degree student who does not have any any research experience start and come up and do things in uh 6 months is really challenging to cover all the aspect. We did you know evaluated uh the performance on some other um um aspect you know overall categorization of the LLM and we find out that LLMs can help

us in hitting that 80% accuracy to prioritize vulnerabilities meaning that it can reduce the workload on um um on a security analyst who might be searching in and who might be uh prioritizing vulnerability and spending a lot have time there are some work and we are open to you know anyone who can test and try our system in their workload and tell us how effective this would be. So this is something that we are looking forward and that's why I'm here to find out you know like any uh help from industry to test and uh try our system. >> Yeah I mean we'd be happy to help you with that and I think that you know an

evaluation suite on this would be really helpful. Volvo isn't doesn't tell you what the right answers are, but there are some other evaluation suites around that can tell you >> you know accurately what the right answer is and you could test it against. >> Exactly. Yeah. Exactly. This is uh this was the data that was freely freely available and we would happy you know like uh to cooperate and work together. Yeah.

If there are no other question, let's let's thank the speaker again. Okay.

[ feedback ]