Can LLMs Take Our Jobs? AI-Assisted Detection Research

Name: Can LLMs Take Our Jobs? AI-Assisted Detection Research
Uploaded: 2024-11-24
Duration: 19 min 4 s
Description: Threat detection engineering typically requires significant expertise to write rules, validate them with triggering commands, and devise rule bypasses to stay ahead of adversaries. This talk presents empirical research testing whether large language models can accelerate this labor-intensive process

BSides PDX · 202419:0481 viewsPublished 2024-11Watch on YouTube ↗

Speakers

Maam Khan Darsh Smith Uday Mayapan RC Oric

Tags

CategoryResearch Technical

TopicAI Security Detection Engineering Threat Intel

TeamBlue

ResearchEmpirical Research Technical Deep-dives

StyleTalk

About this talk

Threat detection engineering typically requires significant expertise to write rules, validate them with triggering commands, and devise rule bypasses to stay ahead of adversaries. This talk presents empirical research testing whether large language models can accelerate this labor-intensive process. The authors benchmarked GPT, Claude, and Gemini across rule description accuracy, command generation, and bypass suggestions, finding newer models substantially outperform older generations but with interesting trade-offs in safety guardrails.

Show original YouTube description

The detection research process consists of performing behavior that will trigger existing detection rules, performing behavior that accomplishes the same adversary objective without triggering the rule (rule bypasses), and updating the rule to catch this behavior. This existing process requires significant security expertise and is quite time consuming even for those with the requisite knowledge. On the other hand, Large Language Models (LLMs) have the potential to assist with performing exactly this type of complex, time consuming task. Threat research @ Cisco Talos --- BSides Portland is a tax-exempt charitable 501(c)(3) organization founded with the mission to cultivate the Pacific Northwest information security and hacking community by creating local inclusive opportunities for learning, networking, collaboration, and teaching. bsidespdx.org

Show transcript [en]

[Music] so I'm my name is maam Khan and I am a software engineer at Cisco I am joined by my colleagues uh wonderful colleagues that have helped me with this research are Dar Smith U mayapan and RC oric I'm going to give you a little motivation of why uh we uh pick this up um and uh um then I will move on to the main hypothesis that uh we are going to um uh test out uh and then uh some Rules of Engagement uh that uh we have devised for this um and then uh we will move on to the results uh that we have uh I will conclude it with some future work and uh

uh some key takeaways uh so for uh for uh the problem that we are trying to solve is uh that of the efficacy of our detections um I will take an example here that would be better to explain let's say I have a rule uh that test out these uh conditions for disabling security tools so the engine looks for these conditions uh which are the uh key keywords that you will find in the logs and then um you want to uh test out the efficacy of it in the lab environment so you run a am uh something like uh system CDL uh stop IB table that will uh generate these uh rules and uh then uh once you have

established the uh efficacy of uh uh your rule you want to see if you can generate uh some bypasses that an adversary was you would use so for example you would run a command like pill uh IP table and it will generate uh logs that uh will be not CED by these uh uh these rules so you can update your rule so that next time uh an adversary can not bypass it so this is uh basically what uh we want to uh see if llm can help uh improve and and make this uh process faster um so the the advantages that we will get from this is uh the the threat detection engineering uh process would

become uh faster we will be able to um uh on board new uh research Engineers uh in a in a timely manner and uh uh the the reason uh we want to test out the uh description aspect of our uh work is because specifically for these uh analyst who are lacking in security knowledge uh they don't have years and years of experience and uh we want to see if llm can uh overcome this this bridge um so U as I mentioned that U we want to prove one main hypothesis which is our llm good at speeding up this detection process um we have further divided this main hypothesis into three uh sub hypothesis of first we want to see if

llm provide uh good description of the threat detection rule uh second we want to see if they provide a good uh uh triggering command uh to execute that uh that that rule and third um if LM are good at bypassing that command we have certain um uh Rules of Engagement which were uh basically we did not use any uh customer uh data um we um we we average open rule framework like Falco uh and um and sigma uh if you're not familiar with Falco then Falco is um basically rule engine and Rule set for uh kuet based uh containerized engines uh and sigma is uh for uh any generic uh log based system it could be

um your cloud-based detection or your endpoint or network based detection um we did not train our model we used the uh pre-train model we did not use any um fine tuning on it because the reason was we wanted to see uh what an ordinary um analyst or research engineer has access to and what are the advantages that they can uh benefit from um with that um let's move on to the uh experimental uh uh part of uh the research so um we broke down our main hypothesis which is uh the detection engineering uh can be made faster with llm we wanted to test out first uh if llm uh can be used to uh describe the rule accurately

because this will help our new uh research engineer who does not have years and years of experience in the field um then we want to see if llm are good at triggering the rule and finally uh we want to see if uh llm can bypass a rule mimicking the uh behavior of an adversary uh the process that we followed for uh each of the model is this so first we understand uh the rule what conditions uh it U uh it uses what are the keywords it is looking for in the the in the logs and then we build the context uh that we want to provide to uh llm and uh whatever the suggestions the llm provide um it could

be uh the description of the rule or uh it could be trigger uh command or um it could be bypassed that it suggested we run it in a controll lab environment and uh we document our finding so since we are comparing um um at least three models here we wanted to Benchmark them and the criteria for benchmarking was we assigned a score of one if the llm was uh uh good in uh suggesting a trigger or uh if the uh description was accurate um and if the bypass that was suggested was also accurate so we assigned a score of one otherwise uh if it didn't provide a good uh good explanation then we use a score

of zero for example here for rule one overflow attempts uhgb uh provided us a good um trigger commands and uh uh bypass command so it was assigned one the the other uh two were assigned zero because they were not able to provide good uh results uh so with that our first hypothesis which is uh the accuracy of llm knowledge rules uh we uh tested it on uh following uh matrices we wanted to see if it is uh correct um we wanted to see if the if the description is complete and uh does it uh match with the semantic uh definitions in the official rules repository and uh we performed this experiment uh with two type of uh test

first was we used the zero short uh uh prompting where we did not provide any uh context uh to the LM and we wanted to see how much uh information uh the llm knows about the rule uh with that we found out that uh llm um made things up and uh did not quite know um everything about the rule for example some critical conditions were missing um and um then we tried to uh test it with um one short u prompting where we actually provided the context uh um in the form of the actual rule definition from the repositories and then um all the models were U very accurate in describing it uh because that's what uh LMS are good at at uh uh

natural Lang M processing um for the second hypothesis uh which is the effectiveness of llm for generating uh commands uh to trigger the behavior of the rule uh we measured it on um accuracy and uh the relevance of the of the rule and how complex it was to set it up and the time it took to um run run those commands in the and control life en environment uh for uh our observations were that uh in the older models uh which was GPD 3.5 when we performed it uh uh earlier it hallucinated uh commands uh we had to do a lot of back and forth to uh get the right behavior and uh get the rule uh

triggered um but the latest model which are CLA um son 3.5 and uh J 4 o uh we saw that um the uh the accuracy actually improved from 90 to 100% so here here here are some uh quantifiable results uh that we perform on a rules set from Sigma um and you can see that GPT and uh and Claude are uh the suggestions provided by uh these models are 100% accurate followed by um Gemini which was like 90% And the criteria for accuracy was um we wanted to see if the uh command is correct um how complete it is uh does it provide the functionality or not um for uh the final hypothesis which is um Can it mimic the behavior of an

adversary uh by byp passing the rule uh we measured it on uh the accuracy like the like before and uh how many times we had to iterate the prompting process um because uh there are a lot of guard rails in place for uh these models and they don't uh give you the uh bypass right away we wanted to measure how ethical they are um and U our earlier uh testing with older models um they were uh they were uh giving us very uh impractical and theoretical um instructions which didn't really bypass the behavior but the later model um were able to give us accurate um instructions but one thing that we noticed in the latest model is

specifically uh GP e they have gone down in uh in terms of the uh bypasses they provide so they are too eager to give you um bypass suggestions uh for example if you look at these stats um you notice that um CLA is the most resilient one and in terms of um actually that's not this slide is this slide so if you look into the um average bypass uh iterations that you have to do you can see that that average time average number of uh prompts that you had to give to Claude uh was about three um whereas it only took uh one uh prompt for the other models so there are um too eager to give you uh bypass

suggestions let me go back one slide where um we are showing the accuracy of bypass suggestion um you can see the cloud is actually giving you 100% they were um very resistant in giving you um the bypasses but when they did they were 100% accurate uh followed by uh CH GPD 4 which was about 90% accurate and then uh Gemini U the criteria for U uh determining the accuracy is the same as uh the one we use for uh triggering of command which is correctness uh completeness and functionality uh we already covered this one so here's an example of uh how many iterations I had to do for um this um uh rule to ask

LM to suggest a uh a trigger command so I asked um uh give me please give me a command to trigger this rule uh it says that I I I'm sorry I cannot do it because that's unethical and uh insecure so I tell it that um uh you are a helpful assistant and we are doing it for a good cause we want to improve the security um and uh we are doing it in a safe environment but it didn't listen it says oh I understand I'm a helpful assistant and you are doing it for a good cause but uh sorry I canot do it um then I kept doing it and uh uh unfortunately I couldn't come up with a

better prompt I guess or I couldn't break it this way um um I'm sure that there is a way to make it spit out if you do it right so what I did in my benchmarking or in our bench marking we did not consider these values because that they they were going to offset um the accuracy of uh this model because when it did uh respond it did it respond very accurately so the future uh directions we want to take are we want to suggest um the updates that uh we find out to um like these rules repository for example Sigma we can suggest that these are the bypasses that uh could still be made um

by adversary so we can suggest uh we can open up ERS and and fix them for the those um and uh we also want to be able to use uh llms to write rules from a scratch um or write rules from Reading uh research paper uh research articles where you uh describe a particular exploit and we want to see if they are effective in doing that and then it would be really nice um if we could just automate this process and get us out of our job um in conclusion I I uh we feel that we were able to uh prove out all three hypothesis that um llms are good at describing um the rules so that it is

very beneficial for a new um a threat engineer or researcher or an analyst who does not have years of years of experience they are pretty good at explaining or instructing him about those new rules uh we were able to verify uh that they are good at uh suggesting uh uh trigger commands and they are also good at um bypassing those commands um however the newer models uh more specifically GPD seems to have the red in uh in the security guard rails that we had observed in the earlier iterations um it maybe they are trying to create a balance in the usability and the sa safety of these models but that's our observation that uh whereas the CL

was really good in terms of uh these guard rails which were kind of a hindrance for us because they were not giving us the commands that we wanted to test but overall it's a good because they have a ethical uh point of view in presenting these these uh bypasses to bad guys um that's all I had um um if you guys have any questions uh please yes you do this all manually have you automated the testing that's so we have done it manually but that's the direction we want to take we want to be able to automate this process and I think we can we can do it first we wanted to see if we can do it manually

and what is the efficacy of it like if a newel comes out then then running it again to see if it's improved or degraded yeah that of course yes yes

accuracy what was your sample SIUC so sample size was um so we we did not did it on a scale we wanted to since we were a team of small Engineers we were doing it manually and and that too in our spare time so we used to do it like uh one at a time but in total we did like 20 of them and out of 20 the the hallucination part was actually quite a lot in in the first iteration it would almost hallucinate all the time we had to do a lot of back back and forth but the recent one I only so out of like 10 iterations or 10 rules that I showed here

um CH GPD only hallucinated in one and uh um claw did not hallucinate at all like they restrained from giving you a suggestion but when they did they were damn accurate so uh yeah these are the these are the actual one that we have done in this iteration that we tried with new model but in previous iteration we had already done like 10 10 of those and that's when we saw more hallucination uh and that testing was with ch GPD 3.5 actually we want to expand our our models and uh want to see if uh what what are the outcomes from those that don't have any guard rails yeah so but we haven't done it yet if not thanks for being um

patient and uh uh coming to this talk I really appreciate it [Music] w

Can LLMs Take Our Jobs? AI-Assisted Detection Research

Related talks