Malware Code Similarity Through Vectorsearch - Remco Sprooten

Name: Malware Code Similarity Through Vectorsearch - Remco Sprooten
Uploaded: 2025-02-04
Duration: 29 min 11 s
Description: Abstract: In the dynamic field of cybersecurity, novel and efficient methods for malware analysis are paramount. Our research introduces a groundbreaking technique for identifying code similarities in malware samples, employing vector search with a unique approach. Diverging from conventional machin

BSides Belfast · 202529:11144 viewsPublished 2025-02Watch on YouTube ↗

Speakers

Remco Sprooten

Tags

CategoryResearch Technical

StyleTalk

About this talk

Abstract: In the dynamic field of cybersecurity, novel and efficient methods for malware analysis are paramount. Our research introduces a groundbreaking technique for identifying code similarities in malware samples, employing vector search with a unique approach. Diverging from conventional machine learning-based vectorization, we utilize Trend Micro’s Locality Sensitive Hash (TLSH) for vectorization of malware samples. Our process starts by disassembling incoming binaries into functions and basic blocks. We then compute the TLSH for these components, transforming them into compact, informative hash values that encapsulate the binary's structural and content characteristics. These hashes are stored in an Elasticsearch 'dense_vector' field, facilitating effective similarity searches. A key enhancement in our methodology is the integration of BinaryNinja’s Intermediate Language (IL). The IL allows for consistent representation of binary functions, irrespective of the platform or compiler variations. This addition significantly improves our ability to detect similar code segments across diverse malware samples, even when they are compiled differently. The consistent IL representation, coupled with TLSH vectorization, provides a robust framework for identifying recurring patterns, shared code bases, and potential lineage among malware strains. This method offers substantial improvements in speed and accuracy over traditional malware analysis techniques. Our research will detail the combined use of TLSH and IL, the mechanics behind this approach, and its practical applications in real-world cybersecurity scenarios. We believe this research represents a significant advancement in malware analysis, offering a scalable, efficient tool for cybersecurity experts to combat an ever-evolving array of cyber threats. Speaker Bio: Remco Sprooten; Elastic, Senior Security Research Engineer Remco is a Senior Security Researcher at Elastic's Security Labs, specializing in reversing and analyzing malware, particularly in the Linux domain. With a rich background as a forensic investigator for the Dutch Police, he brings a unique blend of law enforcement and cybersecurity expertise. At Elastic, Remco focuses on dissecting malware families, contributing to the development of innovative security strategies. His work is integral in understanding and mitigating emerging cyber threats, leveraging his extensive experience in digital forensics and threat analysis. #bsides #securitybsides #infosec #bsidesbelfast #belfast #bsidesbelfast24

Show transcript [en]

well thanks everyone Thanks for introduction and thanks for having me special thanks to Claire uh Claire and I work together she's one of the organizers here so that's how we made the little connection uh few sentences about myself uh before I joined the lastic I did everything from pentesting vulnerability research I even worked in law enforcement for a couple of years um and at elastic uh I focus on M research M reverse engineering and I a slight preference of uh for looking at a Linux smallware adjusting the bike yep um so yeah as I said I have a slight preference of uh uh for looking at uh uh Linux environments Linux malware um that's where most of my research is uh

is is done um another fun fact about me uh we I have about 55 56 slides with me today that is a lot my strategy to get through that is not to take my adg ation and speak really fast uh I'm here for the rest of today so uh if I went through it too fast just come up to me I'm happy to do it once or twice or five times again we can uh we can make that happen so a a quick disclaimer in the um on the title you could see the word Factor search people associate that with AI this talk is not about AI geni any hype bus word whatever you like it's not

about that if anyone is disappointed about that no no awesome it is about still Factor search it is about uh Technologies like TLS it is about uh disassemblers it's going to be about mware code assembly code um but it's mostly a story it's story time it's fun time it's uh things that I've discovered that I had no knowledge of um I have no no business talking about still I'm I'm on stage so uh but it's going to be fun and after that broad statement of not being a a a person to talk about AI let's start off with an AI generated image also fun thing no what you see in the image H it's a little bit of a

symbol a kid trying to get a ball off the of a roof using uh tools that were not made for doing that um I would not suggest getting on a on a chair using a rake to get that ball just call your dad that's easier um but let's start with the chair let's start with Vector search by the show of hands and just be honest here who knows what factor search actually is okay I see two hands out of those two I'm going to take you as voluntary victims now do you actually know how Vector search works I'm going to explain it but okay then the whole audience is at the same level as I was last year so

that's that's good I'll take you through the entire Journey what is Factor search now um these days we have all these machine learning models uh take jgpt for example they need to store their data in a format somewhere but you can't just store uh machine learning data in a normal SQL database you can't uh uh um ask a machine learning model to say give me the next word in this sentence by doing select star from blah blah blah you need something more specific to that use case so someone invented a vector search and Vector search is a database uh yeah let's call it a database type where you can uh find data types that are close to each other now what do what

do we mean by that um at the basic uh how Factor search works is you've got stuff you give that stuff to Something Magic we call that a machine learning model and at the and comes out numbers also magic um again this is where I was in at my knowledge level about a year ago before I started this uh uh This research now that stuff let's let's make it really practical Let's find let's make a machine learning model that is uh going to tell you if an image is either realistic or it's a cartoon and it's going to give you a number between zero and one and that number is if it's zero it's going to be really realistic if

it's one it's going to be really cartoonish and anything in between there is is a number now let's make that machine learning model even smarter and teach it the difference between mammals and birds so if an image looks a lot like a mammal they going to be it's going to be outputting a one and a zero for a bird now we've got two numbers and two numbers on a graph make coordinates and that's exactly how you should imagine working with uh uh with Vector search something is has a coordinate and that coordinate can be close to other things so if um if for example you um you want to ask hey I've got the this image on the left side of

the screen you want to ask that machine learning model how similar is this image to other images that you know well it'll uh analyze that image give you the coordinates of that image and then you can do a lookup and see hey I've got this image placed at a certain location in my Vector search database and everything around it is similar so that that is how uh models like jgpt find out that if I'm speaking a sentence and it wants to figure out what word is associated to cat it'll search the the the vector space for location or stuff that is in the same location near near the word cat um now let's make it really

Technical and the next um uh slide I hardly understand so don't beat yourself up but what is uh um what is actually close to each other you have several methods of uh uh finding out uh what in a factor database is close to each other um if you take the three big dots um on the let's say left that they're connected by uh triangle uh a b and c those are closely y uh the ukian distance as they call it um is eight equally far apart so uh in that method of determining what is close to each other those three dots are exactly the same you can also do cosine um analysis of the of the data and then you have

that that line all the way at the bottom where you can see a b and c following the same line that follows the same angle from the starting point and that's cosine um similarity meaning that A and C on that line are similar completely forgot about everything I just said it's uh uh just background noise but that's the way of of of uh thinking about uh similarities also this is two dimensional um people can imagine this in three dimensions but most machine learning models work on 300 to 7,000 Dimensions um these calculations become very complex very fast so so let's stick with this two dimensional that's that's all you need to know for for now now um I've done a couple of

presentations uh internally at our company and I've uh heard people saying hey the learning curve of using Factor search and other Technologies it's it's very difficult it's uh you need to teach people how to use it it's uh um hard to imagine you need certain skills to to use it um it's not that hard it it's really really not that difficult and that that's my personal opinion uh I've Got U uh people that will beat me for this but I'll give you a few examples starting with our own example if you do a vector search query this is the entire request that you would send to a server you tell it hey this this is

the factor that I'm searching for uh these are the numbers that I'm searching for give me the top five most similar uh uh ones of course we're not only ones to do something with Factor search you can use post for example to do the same thing this is a very simple looking query for the um esql gurus under us um in a few lines of code you can do a similarity search and recently even SQ light got a uh got a update or a plugin I should say vsss that supports exactly the same thing uh so even in SQ light you can do Vector service in just a simple query it does take let's say an

hour or two to figure out the documentation know what you're doing but it's as simple as this that this is all it takes now the second technology that I want to talk to and we'll we'll bind everything together this is uh uh this is just me rambling a little bit but um it's TLS H Again by the show fans anyone know what TLS H is no oh wow this is amazing I love this uh tlsh is the Trend Micro local sensitivity hash now normal hashes I'm I'm guessing shaan hashes md5 hashes uh uh stuff like that everyone probably knows what that is right I see people nodding that's good that's good local sensitivity hashes are a

little bit different if you H input um some data into an hashing algorithm normally you would expect if you change a single bit then the output is completely different that that's with the randomization and how those crypto algorithms work in local sensitivity hashes that's not the case if you change a little bit in the input the output will change a little bit and if you get the output of two calculations you can calculate how much the difference was between the two inputs this is great for finding um um what's the English word for this

um yep not going to if if someone steals your text and so uses it that's their own I'm not finding the word uh but you can use that to to find like similarities in the text h TR micro of course an AV company uh publish this algorithm and they use it to find similar mware samples so let's say you have a malware family uh there's a new uh new version of it a slight modification you can use his algorithm to calculate how much difference is there between the two pieces of data now these hashes look like um just random uh random numbers but in fact they're they have format you have the version you have a check sum and then

you have the actual hash bites now remember when we were talking about Vector search these hexadecimal bytes you can actually translate them into what I would call today a vector a list of numbers also good to remember a vector if a machine learning expert is talking about that vectors are a list of numbers let's keep it simple now those um hashes look really similar to uh um uh to vectors that we use in Vector search and someone uh mentioned like hey we have Vector search now these days can't we use that to find similar malware samples um spoiler alert yes we can um but oh yeah one of the more important things is uh you have two

hashes is two t h hashes i can calculate the difference between the two I can't predict how the output is going to look based on the input that means that I can't have a sorted list of uh uh TLS hashes uh which will tell me these files are all uh similar the only way to find the similarity is to calculate the difference between two two hashes that's uh important to remember now uh I've got a few python examples um in this example I um generate uh two sets of uh uh data I calculate uh two t h hashes i calculate two Shaw hashes and of course the output as you would expect is exactly the same

for data 1 data 2 because the input is exactly the same and all the hashes are the same in example two I changed the first bite of data 2 and uh the output of the shaash actually completely changes and you can see in practice if you look really to the far right there is a zero and uh in the second house that turned into a one um now last tool does anyone recognize this too no one binary ninja never heard of it one person amazing tool I love it uh I work with this every day what this is is a disassembler um imagine you get mware binary samples you get them at your desk your manager asks you analyze

it give me some information about the def fils now one of the first things that I do is put it into a disassembler the disassembler will take apart all the uh all the functions and give me nice graphs and um assembly instructions this is what what I looked like uh what my work looks like on a daily basis now while I was working with uh this this ass I realized that um the whole file that I'm working with is a collection of bytes nothing special and a collection of bytes is exactly what you give to a hashing algorithm like sha or tlsh um but these functions these instructions machine instructions that I'm analyzing on a daily basis are also

just a collection of bytes and what you see in red are the the B bytes representing the machine instructions I thought could I just not give these to a hashing algorithm turns out it I can and that actually started the uh the research uh that I did during one of we we have on weeks that means that we have a week where we do absolutely nothing uh productive we can do just research whatever you want and this was the research that I did during one of our on weeks um I realized hey uh uh if I put this function data through a tlsh hashing put it into a vector search database I can then find functions that

are similar to each other so I spent some time coding that up and a few hours later this was my output what you see on screen is a function offset in a uh um in a binary that's a second column which is exactly the same as a function offset in another binary um and what you see all the way on the on the right are the labels and what you what I noticed here is that I have some fun that I find in black matter ransomware that I can also find in log bit ransomware um I was pretty impressed until I found out that um two months before that uh the researchers at SOS found exactly the

same thing now I'm a positive guy you could say hey your research has been scooped you did everything for nothing no I figured out that my uh research actually worked my method worked um and what I wanted to do is uh automatically disassem assemble every mware sample that I that I have put everything into Vector sech database and just hit a button do my work now I went on I found even more uh uh overlap club ransomware uh had functions that are the same in ruk ransomware uh phobo and loog bit had overlapping um code sequences so awesome uh it's really uh going somewhere now the C some constraints but I'm going to talk about

that later uh back to the disassembler I mentioned uh one of my specialties within elastic is looking at um um Linux MW now Linux has one thing and it is it's it runs on a lot of architectures you will see a lot of malware that's running on maps arm arm 32 um all all kinds of uh different thing that you normally don't see on a desktop uh PC now take for example I have this little C code does absolutely nothing uh uh special but if I compile this put it into a disassembler then I get this as my output on the left you will see the output uh if I compile it for a Raspberry Pi which runs on an arm V6

processor and on the right uh SD output for an x86 system um now in the middle you see the the bite sequences and if you look at them really carefully you see that they're absolutely not the same anymore meaning that my complete idea of uh uh taking these uh bites factorizing them putting them into Vector search database and finding similarities goes down the drain instantly well luckily um this assemblers nowadays are way smarter than me which is not that difficult I admit um but this assemblers can translate assembly instructions into Intermediate Language and then eventually into C code but making the work of the the life of the analyst a little bit easier um what you see here is the same two uh

uh functions uh but then processed by the disassembler and you will see that even though a lot of stuff is is different the structure of the the function is the same it's basically recovered which means if I do a TLS calculation um and compare the two functions after that I got a difference value of 60 now that doesn't mean that there is 60% variance 60 is an arbitrary number and it changes by how big the the the input function is a normal threshold is 80 anything below 80 is considered roughly the same now that means that my uh uh that I now have a second option I can do the same calculations but just use the Intermediate Language process or

given to me by the uh the theblock and that makes me uh happy because now I can use plugins like this uh I can open up my disassembler my disassembler will automatically do a request to a vector search database find all the functions that it already knows about and it'll give my give me some some information makes my life as an analyst way easier oh now of course that's not the the only use case and this whole talk isn't specifically about just this use case it's about um using techniques in a way that it's not supposed to not not built for it uh I know a lot of hackers try to use the same mindset but Factor search was

developed to support machine learning models it it it wasn't made for for this um we do have some multiple use cases that we now use it internally but but just keep in mind that this mindset now let's talk a little bit about constraints Factor search is a new technology and um a lot of development is uh is going on uh in it oh sorry i y sorry slides first one of the constraints code quality THS H was a interesting project it was uh uh released a couple of years ago by uh by Trent micro and after that it was abandoned and we can see that it is really abandoned we have issues there uh uh for example a memory leag

that makes it basically unusable U and TR micro isn't addressing it probably internally it is but the the external algorithm isn't isn't maintained um luckily one of my co-workers uh took it upon himself to make a C++ version of uh of the library if you want to use it this is the link however uh I'm obligated to mention this uh is that I can mention the fork uh but I am not supposed to say that it is any good so I done my due diligence Claire make a note of that please um but no uh Chris did rewrite the entire algorithm uh this is the the fork that I use internally now but um yeah there's a lot of still issues with

it um another thing is we analyze roughly a million files per week and by we I don't mean manually we have uh very smart people like Claire that uh that make pipelines to make sure that we uh analyze all these files uh automatically but out of every file I can um extract over 200 functions um do a few calculations and that means that I would need to store 10.4 billion factors uh per year in a database now you would say Okay 10.4 billion it's a big number databases are me to uh uh to support big numbers yes indeed but for Vector search the entire data set needs to be in Ram in memory so if we do a quick calculation about how

much RAM do I need to do to have a single copy of one year's worth of uh functions I would need roughly half a terabyte of ram now I don't know about you but H when I went to our director and said I need a um a database cluster with about a terabyte worth of ram cuz I need some um duplication he he laughed me in the face and said try again that's not happening that's expensive especially for a research project now um I mentioned that this space is moving very fast uh there's a lot of uh interest in uh AI in machine learning in llms and also in Vector sech that means that uh you talk to a lot of people that

are developing these uh uh the solutions as did I and while I was talking to Benjamin TR that's one of the core Factor SE developers at elastic but also a open source Lucine developer very smart guy um he was looking at at my use case and he said rco um those those hashes those are um um bit factors and I looked at them and I said yeah probably I don't know you would know better than me if that you're you're the expert um and a month later I was talking to him again and rco I have something for you you should use a hammering desant and I looked at them in the same way and said yep you're

probably right I have no clue what you're talking about but if you say so then turns out that um calculating a hammering distance is something very easy and uh um Factor search developers are now implementing that into their products what is it take uh uh the two oh I made a typo take two heximal uh uh values convert those into um uh ones and zeros in binary code the the second one should have set 1 B at the end so uh excuse me for the the mess up there but uh if you convert a a hash into a a binary value then do an X or between the two and you then count the number of

ones that are uh in the result um then you can use that to find the distance between values now I don't know if I'm making any sense but this is really fast and really efficient and it's again a new algorithm that's being being implemented also the good thing about this is if you know something about CPU architectures um doing an exor takes exactly one CPU cycle so this calculation on a mod than CPU can be done three four billion times a second going back to my 10.5 billion uh vectors that I need to analyze I no longer need to store those in in memory I can do a lot of calculations on a mod system and I can use this

um um this new technique now this technique is completely unimportant just forget about it um the reason why I'm only mentioning it is if you have an idea and you talk to the correct people in the community about it they will start thinking about it they will start developing it and especially in a world where we live in today where a lot of uh new code is written um a lot of Investments are made as in AI you can use those uh techniques for other purposes as where they were for initially developed and that is also my my closing uh uh remark is this the best solution for the use case probably not probably there's are there are better algorithms

that are easier or even more efficient hashing algorithms that I could uh could see but this was a easy to implement readily available solution for a problem that I was facing uh the the whole concept of trans disassembling these functions doing hashing storing them into a vector database and finding similarities took me maybe 20 hours of work and I knew nothing about machine learning nothing about Factor search like I was I was you like half a year ago I I just had to open the documentation and find out how does this work for God's sake um and it it's readily available so think about the techniques that you work with on a daily basis think about new uh

developments are taking place in the world and maybe you can combine the two figure out a new solution and that's it if you want to reach me these are a few contact details I will be here for the rest of the day including the Afterparty if you need to repeat anything if you have any questions then I think I have 15 minutes left thank [Music] [Applause] you any questions awesome there you go

yeah uh there there is an algorithm that's wiely used from pronouncing right as as deep that's one commonly uh used you will see that in virus total uh there are other algorithms but I don't know them by name um and actually uh TS H uses another algorithm in in in the back end and does something smart with that uh yeah there are other LS algorithms definitely y I saw another question

um well the I think let me just this one um between I think Phobos and log bit I didn't expect that one sorry too many slides uh yeah fobos and L bit that was somewhat surprising because we just thought black matter and uh uh log bit were working together fobos apparently had a similar overlap um yeah klopp that's an interesting one I've while I was working at law enforcement uh I used to deal with two major attacks where Club was uh deployed so I really didn't expect to see that in uh in my results that was uh definitely a surprising one anyone else gentlemen uh 8 G F yeah that that's way better yeah anyone

else I guess not

Malware Code Similarity Through Vectorsearch - Remco Sprooten

Related talks