← All talks

GPT-like Pre-Training On Unlabeled System Logs For Malware Detection - Dmitrijs Trizna

BSides Prague35:14345 viewsPublished 2024-04Watch on YouTube ↗
Speakers
Tags
StyleTalk
Show transcript [en]

first of all I think it's sensible to say Kudos organizers it's so cool to see everyone here and slowly building this network of uh professionals who are working here in PR and and nearby cities really thank you thank you guys for organizing this I like it's it's super cold I like the these days makes like understand that we are doing work together that that's cool so today I will try to share uh one of the research that that is more of like independent uh investigation other than uh affiliated with one of those uh companies you see I'm I'm working as a security researcher Microsoft but this is more about like independent research how we can uh

Foster techniques that that Empower GPT like systems for our own domain and uh like a brief buer here just because the psychological studies showed that if you trust my authority in this question you will remember material better uh yeah so basically more than 10 years of experience in both Blue Team and Red teams I actually did a red teaming for several years now switched back yeah to to to do like AI powered detection engineering and uh two Master degrees uh One on network security second on data science and the PHD which is ongoing in in in Italian universities bunch of certificates and presentations in different venues so yeah uh there is some experience in this domain and uh we

can we can start uh I don't want to overburden you with some lowlevel details like demos from terminals or something I want to you enjoy the talk and came out of here knowing something new and this new will we will go over the concept that power this kind of multidisciplinary research from various directions and you will get intuitive understanding after this talk so we will start with the threat model with which is actually defining the problem we are trying to solve then I will show what is dynamic mod analysis for those who are not familiar with this concept a lot then we will touch an AI topics basically self-supervised learnings and Transformer neural networks so you will

get to know what it is and then basically I'll touch the experiments that we like we did in in this research how it works and and and some ideas on a future work so if you are interested in this feel free to reach me out after we can yeah maybe work together for for future ideas so threat model uh speaking about malware detection a lot of times the discussion comes only about P files right X and DLS like how we detect uh suspicious stuff in this executable and that executable but actually like code uh threat perimeter is much wider and this is only one operating system if you think another there's different form formats different ways to execute stuff and uh

moreover there are techniques that use legitimate software to do malitia stuff like living of the land so there should be a common denominator over all of these things and those are system locks basically it's a behavior of a system like a human we can take a look how human behaves and infer his intentions the same is happening when we take a look on system locks that's why a huge kind of set of of Industry professionals work in this space when we speak about the row formats it's only like like incident response and digital forensics and the AV vendors like 90 plus% of people work in system loog space in conventional sock operation centers security operations centers so this

stuff is fed into the seams where detection engineering happens and then you get these alerts to the security analyst and uh for those who are not that familiar with system logs like what they consist of it's machine language right it on the contrary to the natural language which is like English French cck uh we have machine language which is kind of ubiquitous among the different formats you have key while pairs most of the time short kind of machine strings that are arbitrary between different maybe Fields like API HH can be completely random file path is less random but still can can have a different values and uh often times it has a time stamp so we can infer this

information to build an AI Solution on top of that there are challenges in in system lock space and to solve them you actually need a meix box like it's it's horrible it's hard to acquire supervised data sets and in system loog space uh it's it's even like for those who don't know what is super supervised data it's actually when you take a human and he like labels data and this label goes into the AI model as a knowledge and here it's hard to know what to label single lock entry or or maybe all data from compromised system if yes then within what time time period And if we even if we answer those questions it's

expensive you have to like your time should be spent on that and yeah we don't want to to spend a human hours to to label those entries and because there's huge amount of them like Enterprise businesses have terabytes of logs per day and many more problems it's like untractable with a with a like human resource that's uh why we can use a dynamic mware detection uh as a proxy task what I mean by that is U uh with the dynamic M analysis 101 you take a p file you pass it through the sandbox or emulator and you basically get a representation what happens in a system like process chains command lines what API calls it does uh like what operation

and performs in registry Network in file system and uh yeah uh having passing thousands of samples through sandbox is kind of hard for deep learning quantities they they require huge data sets for This research I used an emulator which is basically a replica of operating system and luckily mandant supports one of the uh emulators pesy focused Sol on malware tasks uh it's basically in four lines of code you can a behave you can get a behavior rep representation of what malware does on on a Windows system uh emulators are not perfect it's kind of software engineered replica of real system so there are errors across different model families but it's still tractable like 90 more 95 maybe more

percent is still expresses some behavior in in emulator and we can improve this stuff for example like what what we did like we contributed to demand end solution when when I research this with some extra functionality that was in the malware data sets that I've seen and is not not yet in in the Speak Easy so how look uh those emulation reports yeah network uh that that Mal work called the registry access file access like file that is dropped on a system and even its contents is there API calls so this really resembles how actual system locks look like so we can take take apply AI Logic on this and if it works well on

this data we presumably can transfer it to the real system logs and uh yeah the process might might look like this uh I will cover what is sell supervised learning in a moment but basically we can investigate we we want achieve a good self-supervised learning for detection locks for detection on system locks but there are no Benchmark data sets and the data is sensitive that's why we don't have those public data sets if I take system locks from your organization it will contain stuff that you don't want to to to be public right and uh that's why we can consider a malware detection as a proxy task there are data sets there are rich literature

there are lots of research in this domain and we can evaluate relative performance of of the methods like self-supervised learning algorithms and then transfer those algorithms to the actual system lock analysis uh what are those methods that we want to explore so self super learning and Transformer neural networks let's start with the neural networks it's basically deep learning is a branch of machine learning that that builds an an neural networks in uh the second decade of 21st century uh convolutional and recurrent enance ruled the world uh first introduced by like at the end of the 20th century Leon uh introduced convolutions it's a basically method to infer patterns from an two-dimensional structure like Imes but it can be applied to one dimensional

structure as well like sequence of strings and U then in 1997 uh lstm which is one of the most common manifestations of recurrent neural networks was introduced which is basically passing the same data data through the same neuron again and again uh and these methods ruled the world in in second decade of 21st century but everything Chang changed when when Google released they called attention is all un need paper which kind of releases and introduces the Transformer model and maybe you've seen this image maybe not that's what kind of rules the world now and Powers the llms and basically any other success in in AI today uh for example in GPT T stands for Transformer generative pre-ra

Transformer and what Transformer uses it has this attention mechanism inside which is distinct to the recurrents or convolutions it has this weighted average over the tokens within within a layer and first used for machine translation but then kind of ubiquitously used in all the fields if we see uh for instance state-of-the-art Solutions in 2016 each domain had its own kind of fine-tuned solution and like computer vision Engineers didn't work in NLP domain and vice versa but today basically Transformer overtook all those domains and and kind of yeah is the best architecture for any of those tasks and since then there are no really kind of clever brece it's just scaling up and and kind of good

engineering of the existing technique and uh yeah the models keep growing and surprisingly they give a better result and that is kind of indeed surprising for even experts in the industry what is main contribution for example of open AI is that they they were the first who said why we don't try to spend Millions on this and see what comes out because before that like people thought that they need something else and they kind of thought thought okay let's let's run this thing for longer than others and something appeared something emerged and uh systems become better and better why why is it so like the main reason is that Transformer scales up really well it it is efficient

architecture that paralyzes super well but the idea behind this moralization is self-supervised learning you cannot scale up really to huge quantities with the supervised data you you don't have enough humans to label everything but GPT is train on the whole internet right so how they do it through self-supervised learning SSL I I will use this abbreviation although it's overloaded in our domain right https thank you but uh yeah so how SSL works you have this huge unlabeled Corpus like Wiki Reddit books whole internet you you do this pre-training using SSL methods and um that's where 95 plus% of compute comes in that's where the majority of the work is done is this pre-training however then the model this produces

so-called foundational language model that should be fine-tuned in a in a way and you can do this with a like smaller but label data and to do specific task like GPT chat GPT solution does question answering basically instruction following but you have other kind of things like malware detection and you have find you many models out of the same foundational model which is cool uh so uh how these techniques work on natural languages basically which is textual data as well uh there are two competing methodologies one is the GPT like more scientifically called a regressive but it's dump as F uh it just predicts the next token that's that's it so you give it a context and you try to

predict what's next like nothing next word is is and nothing is and model should predict the word impossible if this is in data the another idea is M language modeling for bird like systems and uh yeah basically takes the whole sequence and corrupts it it hides specific tokens inside and then model should predict what is hidden uh like in this case instead of mask it should output Brown and uh those two competing methodologies worked alongside in 2017 1819 but then the looks like the gbt like Auto regressive is more efficient over the time but uh in our case I tested both of them we will see the results and uh one interesting notion for our domain is that maybe we don't

need those huge quantities of compute because yeah like that's cool but we cannot train this right we need uh like data centers like like open a has actually no there is evidence and and some people who are kind of really experienced in the in the domain like founder of the of the hugging face of or leun that that was mentioned earlier who's like head of meta AI right now they think that in narrow domains we actually may come to use small language models and that might be a case for us as well because like large language models are good when people interact with them but when you have to make thousands of queries T of thousands of

queries within a day uh this model is not tractable so you have to make like for machine to machine Communications where we want to build detections for example we have to build more uh like computationally sensible uh models here so when we speak about what how we can use this knowledge to to approach our domain so this is the model uh that that I used for for the malware detection here we have this first step of the kind of extraction of behavior from something either it's operating system or the like P sample then you have a data cleaning step which is I'd say super crucial for our for our domain and that's where the domain knowledge

can come in that's where you guys are really relevant for those data science projects because when you take a like vanilla data scientist out of the domain sit him next to the computer and say solve me cyber security problem they they really like cannot do this uh unlike like national language domain or computer vision it's like more intuitive here people have a really huge uh entry curve and they Lo they are lost in those steps like data cleaning what I mean by this for example filters you have this blop of machine data right and uh you you can take it as this and pass it into the model but like take a look on the unique tokens

tokens we will touch a moment what it is it's basically unique unique unit that model uses when you don't filter anything just take row Json it produces huge amount of unique elements that model has to understand but when you apply some some filtering for example you take API calls actually it's not only names it's the arguments as well to the API calls it produces more tractable size of the of the stuff and and model can already operate more efficiently in the space and this is logarithmic scale it's not half it's like yeah it's like exponentially lower amount of of unique tokens here that model have to understand and that's because there are non-essential values like memory regions

basically memory regions rarely if only say something about the malicious where benign benign behavior in emulator at least there are hashes which are completely arbitrary and the exact values in hash do not matter at all and if you do not apply any filters uh this is the rock curves and uh yeah it's not fancy it's not kind of uh exploit or something but uh you have to infert this type of evaluation of your work when you're working with in such quantities and basically recurve the higher it is the better it is it's like kind of higher detection rate under the same false positive rate if we don't apply any filters it's so it's so bad it's

like it's first of all it fluctuates a lot depending on the on the initialization and then it's it just plain bad if compared to the clever domain uh knowledge influenced filter uh what I did here basically for the field experts it's obvious we filter for file access for Network events for register access and apis and everything else goes away and this is the best setup according to my tests and uh yeah another thing to do here is normalization uh we can and want to replace some of the stuff in original data with the so-call placeholders for example and the model instead of going like digit by digit character by character in The Hash it can grab the

information that this is just a hash there and learn a token depending on just that and this is increases in efficiency of of such models significantly the same applies to the IP addresses and and many other fields and this is like never ending Loop where the domain knowledge can come into into such models so if you participate in any AI related project on your organization this is the field where you can kind of flourish so then what happens next then goes feature extraction step this is more of like native AI already Direction where the kind of data scientists have the credibility um basically again tokenization so what is a token it's a unique element like an atom in a real

world that you cannot split further yeah we you can kind of talk about quarks but like they do not live independently and here the tokens that's the unique element so the the string can be splitted into the kind of yeah tokens here uh here it's like plain and dump based on words but like for intuitive understanding it's enough then you get a token and you encode it with a vocabulary which basically corresponds with a specific integer behind that and the vocabularies are are fixed GPT like systems for example gpt2 used like 50,000 tokens uh modern uh like GPT 4 we don't know but kind of three and a half three and a half uses around like 100

tokens 100,000 tokens sorry and then you go and and code expand this token into like multi-dimensional embedding space and embedding is kind of hard to grasp but you can think about this it's like those vectors into this multi-dimensional space kind of group together based on like semantic similarity on the meaning for example similar API calls will be closed together in this like multi-dimensional space like I don't know G file a and gate file V will be super close there even though they they can be represented by different tokens so yeah what we do here we take this record uh the Json file and we go like token by token and flatten it we discard some of the stuff

that we don't want like Keys we don't want it to replicate them all the time and um yeah in this work I used both widespace tokens and BP tokens BP is is the stuff that actually GPT uses it's more a databased uh tokenization with cooccurrence of data in in like data set if it sees that those two characters appear often it will create a token for this and and does this like incrementally over time until it reaches the vocabulary size uh so yeah and then that's it basically you have this vector vector representation of data you pass it to the model which is just Transformer in this case with little to no modifications so what we do with the

with experiments I I collected the data set about 954 uh thousand of samples across two training set and uh test set you do not train on test set you evaluate your performance there several malware families clean Weare and uh how how you do it actually you have this training data 7 70,000 of samples test test dat 25 uh thousand of samples you choose large part of training data set to act as an unsupervised Corpus you take this blob in in this case I took 80% and just discarded the labels assume we don't know labels for this part then we pre-train a model using those both techniques and uh without any labels at all so basically for MLM you mask them

pass through the Transformer and ask uh which tokens are masked for the the GPT like you pass a context and basically ask which token comes next and then you have this uh foundational model let's call it that way and you found fine tun it on a small part of supervised data label data where actually humans let's assume labeled it and uh you evaluate how it performs on test data compared with different scenarios so this is the result uh the blue one is a is kind of the perfect model let's say it knows all the labels in the training data we can use it as like upper bound model that has all the knowledge of the labels in the in the

data then we have this lowerbound model that uses only 20% of the data and the knowledge there and we have this uh fine tudent models in our case it's green and and yellow which is uh basically green is GPT like yellow is this masking uh pre-training and we see that it indeed is somewhere in between with the MLM actually doing worse on the lowest false positive rates than than this lower bound model meaning that it kind of pushes away the the model farther than than actually just fine-tuning it but the most surprising part is that with the auto regressive GPT like pre-training you almost match the performance on this best model uh that knows everything out of the dat

with only 20% of the labels you can put it that way here you have like 10 reverse Engineers sitting and labeling really this like large data set and you get a model right with all the labels and here you have the S supervised pre-training on most of the data and just two Engineers labeling like small portion of data in this case it was 20% and getting the same results basically so I believe that's cool and again this is kind of uh dirty preliminary results without huge uh data sets without huge computational resources but this kind of hints us into the correct direction already so this might be interesting stuff to explore throw more compute on

yeah one interesting thing that cames out of this especially for domain experts is this explainability part of those models so I just took a random sample without any cherry picking or something uh and and just took a look what activations are there in the Transformer and and you can see that some tokens are more kind of emphasized by model to correspond to maliciousness for example in this sample it was like exodos token and and get get t count token and uh get t count actually is um is used in some of the handcrafted rules for anti-debugging right it's like timing based anti-debugging and and the malware uses it to determine whether to execute further logic or not and exodos

is basically exim from like two components that are common in Windows but they're actually this string is never seen in in a default Windows system there are no exodos like in a f path anywhere yeah uh so that's cool this might be used as an like focal point for a reverse Engineers that can use this to kind of narrow down their scope of analysis of a sample to build later manual more robust ruls or something and I've checked this with the more sensible explainable AI techniques and they correspond like you can use many techniques and they kind of match each other and uh so you can use more robust ways to get this knowledge out of the

model so yeah some of the ideas on future work uh I'd say this stuff the GPT like pre-training is not the best for for this type of data right in natural languages when I say a word it's easy to predict what next right and in the system logs it's harder because yeah values can be arbitrary the the IP address can be like different the file path can change independently on the context so I think we can influence here from other domains or come up with something new for example in computer vision those model models that you've seen like Sora from open AI recently that creates a videos or like mid Journey if you heard they

don't use this type of training there different um ideas mostly based on diffusion and augmentation for example this B Yol bring your own latent pre-training takes basically an image and augments it in two ways like it crops it and I don't know in this example it rotates it and another in another case it makes it darker and it kind of tasks a model to to kind of predict that those two examples are still the same make the difference between between them as low as possible and uh in the other case you can take completely different image and make the distance between them like as far as possible and that way model interestingly learns a lot of Concepts

about the input data as well so we can use some mix of ideas here built maybe custom pre-training for system locks for malware data and uh yield even better results surpass the label data and uh yeah again I can ideas are I'd say Limitless here and uh about repr reproducibility of this work you can take it it's public the data public the code is public uh installable with peip uh if you want to play with it feel free and U yeah even the technical report in form of paper is published on arfe so feel free to grab it yeah and that's it uh for this talk uh if you have any questions I don't know do we have a

question sessions here probably not yeah and yeah uh if you are interested to chat uh grab me any moment today we we can we can discuss these ideas in like don't hesitate to come close

cool so if not questions yeah we can is there any question hey uh so you said that the the logs are unlabeled and therefore you need for example uh GPT but aren't they structured in the way for example of Json where they have predefined keys or uh different forms yeah the question is why to use GPT for natural language when you can uh make for example XG boost for example uh XG boost I mean yeah why we you okay can I paraphrase the question so we have this system locks in well- defined structure and uh if I understood correctly why do we use GPT like systems like ideas that are tuned for natural language if we can use XG boost for

example yeah XG boost is a tabular model it's not a deep learning solution it's like a different machine learning algorithm which is kind of does the same but it's purely supervised model you cannot really apply those self-supervised pre-training ideas on g x gbt uh Solutions but you're right for many tasks where you have a well- defined supervised model or supervised data set you can use GB gbt boost which is like gradient boosting much better than the Deep learning Solutions and that's one the things uh in Practical applications it's not sensible to go like neural networks right away if you have well defined data use some more like easy algorithms here but yeah for this example if we want to pre-train we

have to go to the neural networks xgb xgb boost cannot really parse these self-supervised learning algorithms in into it it doesn't works we have to have attention here I hope it

answers yeah maybe I I would hear it that way thank you very much for the great presentation uh just a following question to the this one uh is that the data you use is somehow a structured data uh so why do you use uh this kind of language llm based or small language based data because they are not unstructed data data so what is the reason yeah probably you are asking between diomi uh with so-called feature engineering that you can do on top of the structure data so you can predefine rules that can extract something out of it and like and we actually there are good Solutions and I'd say the best solution so far for malware detection

static malware detection when you take a PE file and don't execute it but just take it as a row bite blop is a basted on the feature extraction from it because it's like structure data it's like RFC standard right and it's called Ember and you extract a feature Vector out of it like static one and you apply and it it works really well but it has some limitations and one of the limitations again it's purely supervised you have to label all the data and especially for system locks again it's not tractable so if you want to extract some feature Vector from system lock you have to Define it yeah what it is and the idea behind those self-supervised

learning algorithms is that you make as little as possible manipulations to the original data and let the Deep learning exract itself the kind of valuable representations out out of it so I'd say both again uh Solutions are viable but the feature engineering approach historically proven to be good but not scalable and like back in 2 I'd say five in computer vision domain they used as well feature extraction techniques like they extracted edges or something and built like feature Vector out of the IM image and it kind of worked but as we've seen if you build a convolutional network that extracts itself the cool feature out of data it's it's just better and even more with this

self-supervised learning so it's kind of an next that we can influence from other domains in our domain but actually yeah in our domain we are still backwards and and the good Solutions use feature extraction but yeah maybe we can we can come up with something better here any other

question thanks uh very simple question I'm not sure I get the whole message here but if you have a language model like this you fit it with a prompt or kind of it so I suppose you f feed it with set of tokens or whatever you extract so what what's the what's the outcome it tends to finish the sequence right so what would do it in your case yeah uh again um yeah it's it's a again difference between soal generative applications right you can fine-tune this large model to do different tasks later right one of them is actually instruction following you give it a context and it predicts output like chat GPT right you ask and anwers

back but you can tune the model do something different and in in this case what we do is basically you pre-train model and then you do a different fine tuning you say I pass a sample through a model but I don't want you to like follow what comes next for for this case just give me a representation what you think about it and let's classify it whether it's bad or good so you with this fine tuning you can play it with the model and you can like like with the Lego pieces you can build a different solution for different application like yeah you can think of it you can ask a CH GPT something it will output and if

you have a word true then it's one and if the word FS then it's zero but it's like naif way there are more like lowlevel way where you can just take a vector uh soal like numbers out of the model and like steer into Direction whether it's good or bad yeah I hope answers your question uh yeah kind of yeah yeah it's it's already engineering of neural networks and you have to make more informational steps like knowledge Gap should be covered to kind of understand this yeah yeah it's basically a bit out of scope so you turn this language model into classifier it yeah yeah yeah you can say it that way yeah like based with some engineering tricks

yeah thanks any other

yeah okay that that's it yeah thank you everyone for attending thank you