Automating Binary Analysis With Machine Learning… And A Bunch Of Scripts - James Stevenson

BSides Cheltenham22:11743 viewsPublished 2024-07Watch on YouTube ↗

Speakers

James Stevenson

Tags

StyleTalk

Show transcript [en]

So today we're going to be talking about binary analysis we're going to be talking about Mal analysis we're going to be talking about reverse engineering and vulnerability research and how we can automate all three of those using machine learning scripting other bits and pieces so BR it down to a few main areas first of all we're going to cover why automate in the first place then we're going to cover some use cases people who would benefit from that sort of automation we're going to cover quick wins so tools that you could use right now to help you with binary analysis Automation and then finally we're going to cover some general high level techniques if you're interested in

writing your own automation or if you're brand new to scripting places to kind of generally start first things first though who actually am I my name is James Stevenson been working in the computer security industry for around about 7 years now worked in areas from security operations to software engineering to penetration testing and these days I lead a team of vulnerability researchers kind of the intersection of machine learning and vulnerability research now today I'm here as an individual not here as part of my organization not here as part of my University who I do a PhD in just here as a person talking about some Cool Tools some cool research so let's jump straight into it

right why automate in the first place there's a quote I pulled offline there are many like it generally what this quote is saying is it saying automation is about replacing tasks it's not about replacing people so when I talk about automation we're talking about automating out the boring stuff right we're talking about saying how can we give people Cool Tools cool techniques to help them do what they're doing if we take a hypothetical and we say that you all work in a technical role technical role that really doesn't have much automation we could probably say that 80% of the time you're doing the monotonous the boring the kind of tuning the crank kind of work and 20% of

the time you're doing the interesting the novel work and with automation what we want to do is flip that right we want to say that 20% of the time you're doing the boring work and 80% of the time you're doing the interesting human touch required work and so today when we talk about automation that's what we're talking about we're talking about automating out the boring stuff we're talking about giving people tools to help them do their jobs and we're not talking about replacing people with AI intrinsically so we're going to talk about three main types of users and talk about how they will benefit and can benefit from binary analysis automation we're going to talk about malware

analysts who will look at malware want to identify malware we'll talk about reverse Engineers who are looking at understanding complex systems and then finally we'll talk about vulnerability researchers who are interested in identifying and understanding new versions of vulnerabilities or maybe old versions as well we're also going to talk about three main questions that all three of these users have and then when we talk about quick wins we'll talk about how we can answer those questions using automation the first question we have is how similar is one binary to another second question we have is what does this code in front of me actually do and then the third question we have is where is this

given component or where is component X in the thing that I'm looking at what we're going to do next is we're going to dive into these questions and have a look at the specific kind of sub questions feature those three types of users and then tools that we could use right now to help us answer those questions so first question how similar is one binary to another it's kind of a binary diffing question right as a malware analyst maybe we've got an older version of a piece of malware and a newer version of a piece of malware we want to figure out what's changed as a reverse engineer maybe we've got the same question right we've looked at an

older version of a binary we want to figure out what's different in the newer version and then as a vulnerability researcher maybe we've identified a vulnerability in an old version of a binary and we want to see if that vulnerability is present in a new version of it so a quick win a tool that we could use right now to help us solve this and there's actually a good handful right at the bottom there as well but we're going to talk about a really simple tool called just another differ as name may suggest it's just another differ primarily what just another dier will allow us to do is it will allow us to look at two different binaries and it

will allow us to say well function a in binary one looks the same as function B in binary 2 and function B in binary 1 looks the same as function C in binary 2 and so on way that works is we take our two binaries we decompile them using gedra we then iterate through all of the functions in binary 1 compare them to all the functions in binary 2 we then use fuzzy string comparison to take the decompiled pseudo C code and look at which of those functions in those two binaries look the most similar we do that selection and we create this kind of mapping file again that mapping file saying you know function a binary 1

looks the same as function B in binary 2 now this top name I mispronounce every single time I talk about it so we're going to call it fuzzy string matching because intrinsically that's what it is and again the way that works is that takes our decompiled C code and it looks at the minimum level of character edits needed to go from one function to another and then that allows us to do that comparison to say well this function is close enough to this other function that it's probably the same thing let's run for a quick example we're going to take two binaries we're going to run them through just another differ we're going to have a look at the

mapping file and have a look at what's been produced we're going to take busy box we're going to take a stri version of buy box and an unstripped version of buy box stripped meaning that we don't have function names variable names stuff like that we run that through just another differ you can see all that decompilation happening there using gedra headless and then finally right at the bottom we're doing that iterating over all those functions that produces this mapping file uh this is the HTML version this is the text version we focus on one of those here we can say that just another differ has said that the function in binary one add F name it has a 70% correlation with the

function fuore 00 653e and that 70% confidence is saying that these two functions generally look the same at kind of a 70% ratio and we can confirm that right we can just open gidra and we can have a look at those two functions so here we have the strip version on that side we have the unstripped version and as a human being we can say yeah generally those look about the same there are some differences and that could be due to kind a whole range of things it could be due to gedra having a bad day it could be due to the compilation the decompilation they slightly different versions maybe but generally those are probably the same

thing and so there we have it right we've answered our first question that's not even using machine learning that's just using a bunch of scripts we've answered that question of are these two binaries the same or what's different between them so let's jump to that next question what does this code in front of me do I'm looking at a binary what's it actually doing as a malare analyst maybe I've reversed a bunch of malicious code and I'm looking at some new code and I want to see if it has any of the similar traits of code I've looked at before if I'm a reverse engineer maybe I'm reversing a really complex State machine and I want to just at a high level

understand what it's doing as a vulnerability researcher maybe I have a back catalog of functions that I know to be vulnerable and I'm looking at some new code and I want to say are any those vulnerabilities present in this new bit of code so the quick tool that we're going to use to do that the quick win that we could do right now is tool called tweezer the idea behind tweezer is the first part it allows you to generate a we're going to call it a database it's not a database but a datase of vectors of known functions and then when we have a new Target we vectorize all of their functions and compare them to that

database allowing us to understand what this function we're looking at is like the way that works we have that top phase of kind of aggregating all of those vectors so we grab all the binaries that we're interested in ideally kind of unstripped binaries this could be your malwest samples this could be your vulnerable code samples this could just be a bunch of unstripped interesting Linux bind again we decompile them using gidra you might be able to tell I quite like gedra we then run a word Tove model against all of the functions in that biner all the kind of decompiled C code and that produces this um Vector file that we're going to basically pickle and use later

on so now we have a new Target binary one that we don't really know much information on but we're interested in learning more about it we do the same we decompile it we vectorize it and then we compare the functions the vectors of those functions against our stored vectors the idea being the hypothesis being that where we have similar functions that are close to each other on that Vector space they're probably doing the same thing so for this we're using a word to V model to translate those strings those C functions to vectors and then we're using something called coine similarity to understand the difference the distance between those functions so let's imagine that as

this kind of table this is our pickle file or this is our Vector space every time we get a new binary to add to kind of our knowledge base we vectorize the functions and we place them on that Vector space we get a new binary we place it on the vector space we get a new binary we place those functions on the vector space and then when we get our new Target binary the one that we're interested in learning more about we also place that on the vector space but this time we can say well what's this function close to right what what is nearby and in this case we can say the red function that we don't know anything

about is closest to this green function up here and so we can hypothesize that if green is some sort of file IO function red might be some file IO as well so let's run through a quick demo so we're going to generate our Vector pickle file here for this we're using kind of a generic tals data set now we are creating or at least matching uh using buy box we're saying take buy box vectorize all of the functions in busy box and then compare that to our Vector space and then we produce this mapping file very similar to the previous mapping file here we have all of the functions in busy box and on the

other side we have all the functions that they are closest to in that Vector space so right at the top here we have function uncore 001 Etc and tweezer has said that the function that is closest to in the vector space is this function called willpo update so again we can probably hypothesize that this function right at the top might have something to do with updating in one way or another last question that we're going to talk about today where is this given component right so as a malware analyst maybe I'm interested in identifying Packers or Crypts or command and control systems in a binary as a reverse engineer maybe I'm interested in finding authentication or entry points

to a binary and as a vulnerability researcher I'm probably interested in identifying vulnerabilities in that binary the quick win the tool that we're going to talk about now to allow us to answer this question is a tool called monocle monocle allows us to perform natural language search against a binary so we can say hey take this binary and take this find criteria of Commander control systems of entry points vulnerabilities and go find them in that binary all in kind of natural language search the way this works is we provide the binary we provide that find criteria it goes off it decompiles it it then iterates through all of those functions and provides that function along with

that fine criteria into a large language model that large language model then goes and produces a score between 0o and 10 for how much that function meets that fine criteria along with an explanation on why that then feeds into this kind of live analysis at a super high level I'm not going to dive into how llms work or how this specific llm works one of the main things I want to get across today is that we can start approaching these problems we can start approaching these techniques with off-the-shelf tools and off-the-shelf techniques so this is using mistal which is an off-the-shelf large language model this sort of stuff would obviously be better if we use uh

tuned and kind of specific models for this problem space uh but it's a good step in the right direction so let's run through a quick example we are going to provide monacle with the pure ofd Linux binary and we're going to give it the fine criteria of memory corruption vulnerabilities really simple it then goes off and begins its analysis we can see a bunch of functions down here entry and a few kind of obsc ones uh where it's giving them a score of zero and said no no memory corruption here there's one of the middle it's given it a score of one and it said well no memory corruption happening here but you know if some this code was to change

there's a potential right at the top it's given one of these a score of 10 we'll focus in some of these areas and it's said well yep there's direct writing to memory here there's iterating over it multiple times and there's no bounds or checks or validation happening at all okay awesome right that sounds like potential memory corruption so this is the gor output and I appreciate the contrast is horrid uh but generally we can validate that right let's take that function that was given a 10 let's open it in gidra let's figure out if we agree or not uh and we can confirm right no bounce checking there is a loop it is writing to memory so

potential memory corruption now obviously we're missing a lot of information here right the LM doesn't have information like what are these actual memory spaces related to is there any bounds checking happening before this function is called it doesn't know neither do we from just looking at this function but it's a step in the right direction right as a research we can use that as a first Port of Call to go okay maybe this is something we want to look at again going back to that original point of being an additional tool in our tool belt to kind of help us point us in the right direction so those are all the tools I wanted to talk about today the last

thing I wanted to talk about is if you're here you know maybe this is your fitby sides and you're need to scripting you need to automation generally what direction should you go and how should you kind of pick this up there's three broad areas that I like to follow when I build tools when I script when I do things like this so I like to have an idea and that's generally who's going to use this tool and why are they going to use it or how are they going to use it I like to have a goal so that is kind of setting a time frame how long do I have to do this and what am I actually

creating and then finally having inspiration right so what is the goal of this why am I doing it so we're going to run through one more tool and talk about how it hits these three criteria and then we'll go from that so the last tool we're talking about is a tool for generating seed Corpus files for fuzzing using machine learning for this we had an idea right the idea was that we wanted to accelerate fuzzing especially that kind of onboarding phase of fuzzing the goal was I basically had a day off work and I said well I have four hours let's figure out how to do this and I wanted to make a really simple command

line tool the inspiration was that I was using llm monocle and I said well I really want to use an llm when it comes to fuzzing so that's where autoc Corpus was made the idea behind autoc Corpus is it allows us to generate seed Corpus files for fuzzing at a super high level when it comes to fuzzing we mutate off a seed Corpus set so we need a representative data set to start fuzzing and what auto Corpus allows us to do is it allows us to generate that Corpus set without having to make it ourselves or pull it from online or anything like that the way that works is we can provide it with an existing Corpus data

set which we then take a sample of and feed to the llm or we can just provide it with a prompt and Say Hey I want config files maybe I want some Json files that have this information in all of that then gets fed into the LM and it produces the seed Corpus input file or files so let's run through our last example for this we're going to be creating some orc command config files for a fuzzing run against busy box again for this we provide the tool with two inputs we provide it with a single Corpus file this is kind of a generic orc command file that you can find online and we ride it with the criteria

of generate some orc command config files it then goes off it does its thing and after a while it comes back and it's generated a handful of input files for your fuzzing run right at the top we can see an example of these fuzzing files that it's created similar but different enough to that original file that we provided it and then on the far side here we can see that it is indeed a valid orc command config file which is really important uh when it comes to fuzzing so there we have it right we had an idea we had a goal and we had inspiration so that is the talk coming to an end the last thing I wanted to

talk about uh myself and a good friend Nathan are running some training creating some resources in the binary analysis and machine learning space anyone's interested in getting some free resources or learning more about what we're doing there uh jump to this website drop your email address in and we'll let you know kind of as and when resources come live however once again thanks all for coming along thanks to the organizers the volunteers for having me here today uh if you have any questions feel free to ask me now come find me later I'm on LinkedIn I'm on GitHub and yeah cheers

everyone any questions you were using M 7 which is a general purpose right as of a week ago we for coding and before that we had with Cod Etc do you think they're going to be better for

this yeah so the question is around the choice of large language models in this case you know if you go on hugging face there's 101 different models for 101 different purposes um I think generally what I'd recommend is just triing a bunch out I haven't done that I've used some code models they didn't really do what I wanted uh but in my case I found you know generic models did the well did the job enough where I felt confident demoing it uh but they still don't do the job well enough where I think you could use this in a real VR or other environment I think intrinsically it comes down to finding a really good

model and then fine-tuning it building an adapter or something like that um is is what it will come down to in kind of a real Enterprise environment great right the back

sure so the question generally is going back to that original slide where we had this kind of 8020 split kind of 80% of the time doing this monus doing the crank kind of work 20% of the time doing the interesting human touch work and the question is do tools we covered today help us kind of bridge that Gap I suppose and I mean if I'm honest these are tools I have made in my afternoons and days off in practice they are not going to solve uh world hunger they're not going to solve World Peace um but I do think they help demonstrate what we can do with this sort of Technology if we adopt it and apply it to these new

problem spaces and so I think not even a years down the road we'll have tools like this that significantly help us get that direction great any questions right here yeah yeah so the question is in all the tools I covered we used gidra we used gidra headless and the question is you know have I noticed gidra being better than others in practice in practice I'm lazy and I know how to use G headless and so I have used grra headless uh I've seen really good results with radar especially Radar's intermediary language that looks really promising um but in practice the quick answer is because I'm lazy and I like ketra uh but I've seen really good results with RAR and also um

binary ninja is also a really nice tool and I've seen really good results with that as

well y so the question is especially feeding into the large language model and well into into tweezer as well why did we use the decompiled kind of pseudo C over the disassembly or over intermediary language or something like that in practice especially for the large language model my gut feeling because we were using generic models was that they're a bit more like human language and so they'd be a bit better at understanding it uh especially because we're including all the symbols so let's say we have a memory corruption vulnerability and it's a mem copy and it has a certain field all of that information is there while if we were looking at some data marshal

decompilation or sorry disassembly maybe we wouldn't have that information great yeah great cheers everyone

Automating Binary Analysis With Machine Learning… And A Bunch Of Scripts - James Stevenson

Related talks