← All talks

BSidesSF 2024 - Faux Data, Real Defense: ML advancements in data synthesis (Arjun Chakraborty)

BSidesSF · 202424:50137 viewsPublished 2024-07Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
Faux Data, Real Defense: ML advancements in data synthesis Arjun Chakraborty Access to realistic cybersecurity data is difficult to procure and expensive to simulate. Synthetic data generation has seen great advances with LLMs over the last year. Can ML based detection methods benefit from this? We dive into some methodologies and explore a use case. https://bsidessf2024.sched.com/event/d2ad032125ef1f702552e75b9622e9e7
Show transcript [en]

uh let me introduce uh arjin arjin chakraborti and the presentation on fa data and it is fa data real defense ml advancements in data synthesis yeah I'm just going to say you crammed every marketer's dream because I'm a marketer and you got all the right buzzwords in there my friend uh so folks a round of applause for arjin and uh our time and our attention and our love thank you arjin thank you hey everyone uh thank you for coming in for my talk um you know I'm not going to repeat the the whole thing again but it's for data real defense uh ml advancements and data synthesis uh quick introduction to me uh I work within the detection engineering

team at datab bricks uh generally my interests lie in security and AI um and I've been in the area for for a while now and I was previously uh at Nvidia guidewire and Home Depot doing uh very similar things uh so um diving uh straight into the Talk itself um I think everyone in this room is sort of aware of how threat detection is a is a very difficult problem and the way I see it it's sort of uh sort of broken down into two specific issues um it's hard to get for detection Engineers it's hard to get hold of the right kind of Representative data to build out their detections uh which in on in a consequence leads to

rules and models that have to be constantly tuned um and this is a problem that that I feel like detection Engineers have faced for a very long time and will you know continue to face as as sort of the attack surface area changes and you know with Enterprises growing and with the AI coming in so one of the solutions that uh we've been sort of thinking about is can we use synthetic data to help detection engineers build better pipelines um synthetic data has seen a lot of success recently in training machine learning models in healthcare medical research software testing and the two questions that we are sort of trying to answer through this presentation is can we efficiently

generate uh adversarial synthetic data to improve detection and how do we go about actually evaluating the quality of synthetic data that is generated and in my belief the second question is a little bit more important than the first because I feel like with with llms that have come in it's actually very easy to generate uh data but evaluating it is is sort of the hard part uh surprise surprise so I will be talking about llms um and the the the hypothesis is can we actually feed into an llm uh description of a data set uh description of certain columns associated with the data set and with the power of llms can we actually generate synthetic data that can mimic

adversarial data which consequently can help detection engineers build better better pipelines um and of course you know uh with respect to how sort of llms have blown up over the last two years I can imagine everyone in this room is pretty skeptical of that thought um I sort of picked out uh you know three very recent articles on on the capabilities of what llms like sh gbd can do uh and and the goal of this presentation is to not create you know spectacular headlines such as this but the idea is you know using llms can we actually give detection Engineers the right tools to build out their detection pipelines so that they are more efficient in terms of

the work that they're doing so for the purpose of this talk what kind of adversarial synthetic data is is being generated and there are sort of a lot of candidates that that you can generate synthetic data for but for the goal of this talk I picked out kubernetes API server audit logs uh and that is for a few specific reasons one is the fact that you know Enterprises uh around the world are obviously using uh kubernetes for for uh the framework that's sort of used for hosting containerized workloads uh in terms of API server audit logs these are often used by detection teams to build out detections to to sort of defend their kubernetes architecture and in terms of the data

itself uh there are a few good things about uh API server audit logs that can really help uh evaluating synthetic data so the first is you know they have a large number of columns to generate and that's a good way to test out the Fidelity of the data that's being generated uh kubernetes because it's been around for a long time it has a pretty well- defined adversarial setup so it's a good way to test the accuracy of the data that's being generated um like I said uh detection rules have been built on kubernetes audit logs for a pretty long time so it's possible to create those precise detection rules and the last one is uh

kubernetes forms a very important part of the kill chain so if you're talking about privilege escalation persistence or credential access it forms a uh a pretty important part part there so what was the experimental setup that I used to generate uh synthetic data so we were testing out four different models uh and the idea was we'd uh generate a prompt which would be fed into the models each of those models would then be asked to generate synthetic data associated with four different attack techniques and multiple samples associated with each which would then lead to an output parser um and then lead to the data set uh being created itself and what are the attack techniques that

are being simulated um so credit to the stratus red team uh which sort of has a list of um attack techniques that are generally used against uh kubernetes audit logs uh kubernetes also has a specific miter attack Tech miter attack Matrix so uh what I did was I picked out four specific attack techniques associated with credential access privilege escalation and persistence in increasing order of difficulty and the idea was that we'd use an llm to generate synthetic data that would be simulating these attacks uh with the idea that these attack techniques have well- defined detection rules which would uh later on um help us test out the synthetic data generated pretty easily so I feel like this is the core

of of the Talk itself um generally when you're looking at synthetic data and cyber security uh there is no good Benchmark to actually evaluate the quality of the synthetic data being generated I think one of the things I mentioned earlier was it's pretty easy to generate it but it's um when I was sort of scrounging around I couldn't find a way to actually evaluate it so um based on some work done earlier in different fields I came AC I came to the conclusion that these three metrics would be the best way to sort of go about um um looking at the quality of the data itself the first one is Fidelity so how well do llms do in terms

of following instructions and just generating The Columns required um the second one is reproducibility so the idea was that for each attack technique and each model we generate multiple samples and how easy is it to reproduce results on multiple runs of the llm and the last one is accuracy so uh there's are two parts to accuracy one is is the data generated accurate in in in a in in itself and can we actually use the synthetic data generated to build detections on top of it because that's the that's the end goal the end goal is to produce uh adversarial data to help F you in our detection pipelines so the accuracy part becomes U extremely

important um this is an example prompt for synthetic data generation um I don't think there's a lot of mystery in this uh I would say you know it's it's um it's pretty cookie cutter in terms of prompts that are being fed into llms you have an initial description you have a description of columns and you have an attack description as well um Lang chain is your is your friend here uh there are multiple different apis in terms of uh uh calling different uh llms so anyone can be used um there's also um the possibility of using um um examples that you can sort of provide within the prompt itself and of course there are more Advanced Techniques as

well where um where um uh you're using something like rag or you're fine tuning but but in this case we were basically just doing one shot prompting of uh uh of the model itself and uh which is probably why we also ended up um selecting a bunch of models in our evaluation that had um very similar capabilities um so going on to the results itself uh the first section deals with um the model's performance in terms of fidelity now when we were thinking of fidelity I think there were a few things in my mind when I was trying to figure out how to how to look at Fidelity one was uh is the model generating all the

columns in itself uh but you know based on the instructions and the prompt whatever I did the model would always generate the columns but in a lot of but in a lot of situations it would basically just add null or nonone values within certain columns and this is specifically to do with um uh with the more powerful models that I have uh have here so for for measuring Fidelity we were calculating the percentage of values that were null or none within the sample generated so the idea is the lower the percentage values the better it is um like I said we were evaluating four different models um as you can see most of the models do

are sort of ranging in the 10 to 15% um you know GPT has a couple of higher values for uh for one specific attack technique and llama 70 be uh tends to uh tends to do pretty well with it has single percentage null values which means that um you know very small number of values within the synthetic data generated were um were null now as we sort of go on on our journey of of figuring out or figuring out metrics or figuring out evaluation metrics um in this case um you know it's it's very evident that that sort of llama tends to do the best but when I was sort of looking through the examples itself um

you know Fidelity or looking at the number of null or none values isn't always the best way to go so this is an example of the of the data generated or the log generated um by mixl as well as llama on the right um as you can see mixl actually uh follows the instructions to the T and generates the uh columns and the data as I had instructed it to while llama sort of goes off on its own a little bit in terms of the knowledge it has of K audit logs and ends up missing out on a bunch of columns that I had specifically asked to and changing the schema as well uh so which means that you know even if

null values or non nonv values or the or the percentage or count of them is a decent metric but it's not the best way to in terms of dealing with Fidelity but uh so you know whenever you are generating synthetic data then you know it's very important to sort of look through each of the samples or look through the samples and make sure that at least it sort of following the schema that you had sort of instructed it to do

the second one is on how well does the model do in terms of uh reproducibility um so the goal for the reproducibility uh metric was that whenever you whenever you have are are building detections based on on adversarial data the more diversity in the data that you have it's actually better because you can then build out more robust detection rules um but when I was sort of going through the data I felt like there were a few columns that needed to be the same regardless of the diversity of data so the goal behind the reproducibility aspect is if you are prompting the llm over and over again from generating multiple samples how many times or what

percentage of value are the most important columns repeated so like in this case we are looking at uh resource verb and um request object kind now regardless of the variations in the attacks that are being generated by the by the llm by the llm um these three columns should not change um and this is just a representation of what we're seeing on on mixl and and GPD 3.5 turbo um there is some variation you'll see a lot of not of lot of nonv values on on the mixtur side of things on on GPD 3.5 as well and the idea was that you know these specific columns should generally be the same across different samples and um these are the results for

the for the reprodu reproducibility side of things um as you can see that um you know there's a consequence of of of llama 2's results from uh the uh the previous metric you know it actually did not generate any of the columns that I needed in this case which is why you know I couldn't actually generate results uh mixol does really well across uh almost all uh attack sequences and U GPD 3.5 and uh both versions of GPD 3.5 actually do pretty decently as well um the goal is higher the percentage values in this particular table the better it is um I saw some of the models struggle on the ability to run privileged BS

because it was a very complicated uh uh attack in terms of uh The Columns that needed to be generated and the data that needed to be generated in in in those cases um in terms of escalating through node proxy permissions you'll see most of the models do really well uh with mixtur sort of beating all of them out the last metric that we are looking at is the accuracy of the synthetic data uh being generated for detections so one of the things I had sort of mentioned earlier was you know we picked these four attack techniques because we wanted to uh because we already had specific detections built out for them they were standard attack techniques and

standard detection rules that had been built out now since the goal of of sort of generating synthetic data is about um um using detections we wanted to see whether any of the standard detections that were available could be were effective on on the data generated itself um as you can see like for all the models you know the I would say like the simplest attack technique which is just listing out all the secrets or dumping of all secrets you'll see there are very high values in terms of all the detection rules do work for the data generated associated with it um Lama 2 sort of struggled a little bit uh in the middle for uh those two particular

attack surfaces while all the models did really really well on um uh the escal the privilege escalation through the node proxy permissions I think some of the aspects of um how synthetic data is generated is with the fact with the complexity of uh The Columns itself so you'll see with like an example like node proxy permissions or an example of like dump all secrets where you're looking at specific verbs or resources that are associated with it it does make uh it does make like the detection part easier you'll see most of the models struggling a little bit more on um stuff like uh running a privileged board or stealing a a pod service account token uh of course in this case as well

um higher values are are better in terms of the performance

itself um when I was sort of working on on this uh uh on on the research associated with it I think one of the things we wanted to show was that at the current moment there are no there is no model X that is better than model y um it genuinely sort of depends on um uh the use case that you have and the model that you have available um you can see that for Fidelity most of the models have had null or nonone percentage percentages less than 20% with the exception of 2 um you have a lot of models that do very well on um the reproducibility reproducibility side but um you know mistl sort of had the best performances

on on that end uh and then on accuracy you had uh GPD 3.5 and its variations uh doing very well uh especially on on on something like escalation through node proxy permissions um but yeah like I feel like in in general based on the attack situation you have have or the attack technique you have and the model that you have available it's it's important to figure out the right kind of evaluation metrics that you're using for the synthetic data generated because uh too many times we go ahead in terms of using these large language models we go ahead by using the data that's being generated and not really looking at evaluating the quality of the data

that's being uh that's being generated um there are some uh challenges um and sort of looking ahead um I think one of the goals was to prove that U you know yes synthetic data can be used for some amount of purple teaming for uh for uh detection engineering teams but I do believe that currently uh synthetic data cannot replace uh red teaming or sandboxing of adversarial environments um I feel like the the metrics uh that we using or the metrics that are being developed uh within um within this sort of area of research are sort of still a work in progress um like I said I couldn't find a benchmark or or for you know

synthetic data using cyber security so you know it's it's a lot about um sort of trying to figure out your own benchmarks and I do believe that within the variations that we have in Enterprises and security we will see different sets of benchmarks that are sort of applicable to um different teams and different organizations um the other part that I have not covered in this presentation is synthetic data can also get um get expensive um even though a lot of the apis associated with large language models are becoming cheaper and it will probably be cheaper than having like your own red team uh but the idea is uh that you know I think it'll be a mix and

match of both in terms of trying to figure out what kind of data suits you and what kind of uh attack techniques or SE quences you would want to replicate um some of the things that um you know that I'd want to see looking ahead would be um the evaluation framework that I just sort of laid out um you know there are there's definite scope for for tweaking it the the quality of the of the data being generated currently it's being done through onot prompting so like I had mentioned before it's uh it's probably worth using fine tuning rag to sort of improve the results of the model itself um you know especially um different models have different tweaks

associated with them in terms of prompting as well one of the reasons that um I had kept the prompt same for all the models that were being evaluated was because um I wanted to standardize on the model and the parameters as well um and the last part is um you know we uh I had sort of showcased a limited set of models uh within the evaluation framework um since I did this work there were lots of um other new models that sort of came in dbrx uh there have been variations of Claude uh one of the reasons I did not include gp4 in the evaluation was uh there were two reasons one was uh I

wanted to evaluate models that had similar capability so if you look at the list of models that I had before uh whether it's llama 2 um DB um U GPT 3.5 and mixl they all have similar capabilities um and there is also scope for um using gp4 as um um as an llm as a judge so that was an idea that I was thinking of as well where you could use uh the results of um of the model itself to to see uh to to evaluate uh the synthetic data generated uh but those are all sort of I would say uh more work in progress but the idea is of course to sort of add more models into this

framework GPT Claud dbrx uh all the new ones that are sort of coming in and um yep quick shout out to the data break detection and response teams for all the help and feedback they've sort of provided um uh during the sort of whole process and uh that's it for my presentation there any questions oh that's excellent there are a couple questions here for you arjin we have uh yes please where else do you use llms in your security operations work um right now I think it's pretty pretty limited uh depends on what you'd consider an llm like if you were talking about uh sort of bird-based encoder models we do use them but are we using

gp4 across all our are we using dbrx or are we using all these other bigger models across our our ecosystem probably not yet yeah yeah yeah not yet okay uh another question came in here and if there's any other questions folks feel free to add them into slido here are you using linting rules to filter out syntactically invalid yaml uh we do have a so I was using an output parser through Lang chain which was dealing with the issue of incorrect yaml file yaml that was being generated yeah so yeah so there is uh parsing or or the ability of the llm to generate uh data in the format that you want it to is is a pain at least in my

experience so there are some forcing fun functions in terms of Lang chain that do help you sort of force an llm to to generate data the way you want to but you still face issues and uh so yeah that I would say yes but not exactly the tool the person is is refering to but I do use l chains output parsers but they actually also asked are there any other pre-processing steps do you use before training your model could you share any of that perhaps so in this case it's not exactly I wouldn't consider this as a trained model because the one thing that we doing is uh we're using oneshot prompting so we basically

asking uh the model based on the map of the world that it has or the the data that it already has to generate data just based on the prompt itself we do give it a few examples sometimes so you know you can you can have prompts where you provide it with specific examples and ask it to generate data based on that okay well thank you we've got a little parting gift here for you thank you ladies and gentlemen this is uh a gift from socket security our friends at socket uh have been very gracious with their time in their uh in their donations and thank you to Argent for coming today folks thank you to all of

you for coming