← All talks

Unmasking The Unseen: Effortless Linux Malware Reversing With LLMs - Remco Sprooten

BSides Munich27:1656 viewsPublished 2026-02Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
Show transcript [en]

Yeah, thank you everyone. Um, yeah, my name is Ramos, but most people have trouble pronouncing that. So, in most countries, I just introduce myself as Bob. That that's easier for everyone. Um, yes, indeed, we will have a bunch of demos today. Um, so if there's anything anyone in the audience who don't like, please let me know now so we can sacrifice someone to the demo gods. That's uh always helpful. Uh no, in all seriousness, um I have some demos. I'll try to uh keep it short because we only have 25 minutes. I will be here during the rest of the day. So if we don't manage to get to any Q&A, just come up to me and ask. Um yeah,

let's start with a little bit of uh background. Um the reason that I started the project around this talk was actually because of a uh project that uh we presented at Fires Bulletin a couple of weeks ago. Me and a colleague Ruben um are both very interested in Linux malware. So we at some point we decided it's about time that we look into that more cuz there's way too uh much Linux malware going around and way too little attention. It's it's really under research. Um just to name an example, if anyone has like a paid subscription to Virus Total, you can search for content for uh uh malware samples. The easiest way to find 80 85%

of Linux rootkits is by searching virus total for the word rootkit. It's actually in there most of the time. It's it's that that stupidly simple to uh to find. So we set out to uh start to investigate what is the current state of Linux uh uh rootkits in the in the world. We decided on some some metrics and some um things we wanted to investigate and as one does um one week into the research we decide hey this is a good idea let's submit it to a bunch of conferences and see what happens. Of course a month later you get an acceptance email and then think oh now we actually have to do the research.

What I haven't mentioned is that both Ruben and myself are a little bit of overachievers. I have ADHD brain, so stuff goes exploding uh all the time everywhere. So, we started out with, oh, let's investigate the top 10 most used Linux root kits. That became the top 20, the top 30, the top 40. Of course, every sample has multiple variants. Uh so, multiple versions and updates. And before you know it, I was looking at a batch of like 300 samples that I needed to reverse engineer for this uh research. and we got an acceptance email and we need to submit a paper within like two months. Yeah, that that's not going to happen. [laughter] Now um

normally when I uh reverse engineer a sample, um takes me between four maybe 8 hours to really like clean up the sample in my disassembler. Um get some knowhow about what it's what it's actually doing. Now multiply that by 300 and you can figure out that that's not going to happen. Um now in the time that we started this there was a lot of talk around using something called MCP uh the model context protocol uh in combination with disassemblers and people started uh uh playing around with that. Now what is MCP? The model context protocol is basically a protocol that is used or developed so that a llm GPT gemini grock whatever you want um is able to use real

world tools and context um to in their execution. Um here's a very very simple example and I understand that not everyone might be able to to read it but um here you see a simple MCP server that allows the LLM to do a uh addition very simple but gets the um um the the message across of course um I won't uh um use this uh um example for too long but you have to uh remember MCP it it's still kind of new and people were experimenting with it. Um, so, uh, one thing they really didn't take into account and I'll come back to that really at the end of the talk is, uh, the amount of contacts that you add to

the alam, the amount of overhead that you generate. Um, what you see here is a normal message which you would normally send to um, GPT if you're like typing into a uh, into a web prompt. um you have a system message where you tell the alm hey this is the role that you're uh performing and you have the the human input please explain this code sample for example now if I add some some tools to uh via MCP to that conversation the request looks like this um you tell uh you tell the llm hey I'm giving you a uh a function the function uh allows you to decompile something I describe that function and for every function uh and

every tool that I give to the LLM, the API request becomes significantly bigger. Um, this is a very powerful tool, but um, you will see why the the size of this [snorts] request uh, uh, might become an issue at some point. No, come on. Now, um, I've set up a um sort of a a workflow and step back. I'm I'm still like in the in the process of reverse engineering 300 samples. So what I did is I took my laptop, I wrote a small uh MTP um uh server uh using Python. Um I used the disassembler, my favorite one binary ninja and binarenda is great for those in the reverse engineering world. Yes, I know IDA is like the the the de facto

standard. I dislike IDA. I really don't like the product. Uh, binocular has like an amazing easy to use API. So, it was really easy for me to uh to write up uh the the plug-in that I need to to do this. Now, what I did um I have my LLM client um I used code I used um bunch of stuff but I'll show you that in a in a second. The LLM client will talk to a MCP bridge. So a program that will translate MCP into stuff that uh binary will understand and that's talking via HTTP to um to disassemb HTTP because I actually thought of um the fact that hey these 300 samples uh

are going to need a an actual workflow. So I might need to put this uh um this disassemble somewhere on a machine that does stuff at somewhere. Well, um just to keep things fair, people have written multiple MCP servers for all the disassemblers. If you're a fan of uh IDA or even Gedra, I don't know why, but um if you're a fan of that, then uh please check out these uh these projects. I also mentioned I'm using an um LLM uh client. Um in order to talk to an LLM via um the actual API, you either need an a client or write the code up yourself. Now I'm quite lazy so I prefer to use anything that is already built.

Uh today in the demonstration I'll be using cursor but I also have uh good experiences with uh row code. There are way more clients but I haven't used them so I can't uh recommend uh recommend them and that uh becomes demo time. Um let's see did someone sacrifice anyone yet? that case be strong. Uh well what what you see here uh on the left bin ninja just a normal interface um I've already set in the uh um because we're limited in time I've already set up uh in the background the the connection between uh the LLM client and binary ninja you just have to believe me that that is what is happening here and on the left there is

a Linux rootkit that is loaded um it's slightly offiscated and normally my my job is to okay let's figure out what does this function do what does this function do what does this function do and then rename all the um the functions now because I I'm unable to actually type something when people are looking I'll copy and paste the prompt that I asked and I'm just going to tell the llm reverse binary figure [snorts] it out be fast because of time you know uh and uh now in this case I'm I'm using uh claude And CLA is actually amazing at doing this stuff. Uh it'll figure out, hey, I've got a MCP server because the API request

that I send it will include like all the functions that the MCP server provides and it'll start working on it. Now, I told it to be fast. Of course, it won't do that. Uh but if you can actually look at what it's uh uh what it's doing right here, you can see every request uh uh that's being made and it's trying to figure out um what this file is. Now give it a minute or two and at that point it should [snorts] it should present us with uh an outcome. Now, I know it's a hard to read, but it already figured out, hey, this is a rootkit. It's um trying it's really trying its best to like uh uh reverse engineer it

uh and figure out what it's uh what it's doing. Now, I actually do know what it's doing. Uh I cheated a little bit. The rootkit here is one that I actually wrote myself. Uh so um uh but it's slightly offiscated so that makes it quite hard for an LLM to uh uh to figure out what is happening. But as you can see now it is confirmed this is Mo. Now another thing is the this report it's it's great but to me this is absolutely useless. Why? because uh I can spin through 300 of these reports but uh but for the research I was actually looking for what's Cisco being used uh how does the uh binary actually

look like what functions are used what is the difference between multiple versions so um the prompt that I gave it let's be honest is basically reverse this mola it's the same as a manager coming up to me slapping me in the face and saying hey reverse reverse reverse do it you know it's kind of rude and I know we're not supposed to consider these uh L&M's humans, but in some aspects they are uh uh a little bit like that. Now let's try again. And now let's give it a more specific prompt. Again, I can't type while you're looking. So [snorts] now this prompt um I'm asking it to actually rename the functions. And hopefully if the demogs

are smiling upon me uh we will see on the left that at some point some of these functions might get renamed. Um and what that does is it actually ends up uh giving me a database that I as an analyst can actually look at and extract information that I need. Um I'll just uh uh let it run in the background. But on average um this saved me um going for through all the samples. Uh instead of doing 4 to 8 hours, it's now uh once the ALM has done the the initial work. Um it'll only cost me 1 hour per sample. Now 300 hours to go through all the samples is still uh a lot, but it's way

better than uh trying to figure out it all by myself. Um this however is not free. Uh for every um prompt that you see going up there, my employer is paying uh AWS Bedrock in this case. Um and we will talk about the costs in uh in a little bit. But as you can see now um if you can actually read it, I hope so. It figured out like hey this function is doing sore uh decoding. this function is uh invoking uh the cape probe handler process hooking call even though it is slightly offiscated the the llm is able to figure out what this um what this binary is doing it's also asked it like hey give me some

comments so for every function that it uh that it encountered it'll add a comment so that if I look at the database later it'll [snorts] figure out that I can read uh whatever is is happening here Now, let's waste money on the background. Um, hey, I'm not paying for it. Uh, now one thing I I I did um halfway into running through all the samples is um well, actually I didn't. Antropic released uh claude 4.5. I was using Claude 4.0 and um me as a tech boy and I went on to like, hey, new version, let's use this. Let's upgrade. Switch to 4.5. and it was actually getting better uh uh reversing results but not completely the same. Now the

thing is and for this let's go a little bit back to the slides. Um is there anyone in the room that actually understands uh uh AI models? If so, I'm so sorry for what I'm about to do, but I'm oversimplifying the the process. Um this is an AI model. Basically you feed it a number that number gets uh thrown into a network and all the nodes in the network will do a computation based on uh the numbers that we uh put in there and the output is hopefully the number that we want as an output. Now how do we actually generate this uh this model? We basically start by choosing random numbers. [snorts] see if the input and the output is uh what

we want and if not we throw it away and try again and try again and try again and try again and we keep trying until we finally find one model that will give us the output that we expect. Now the thing is if you go from uh model version 4.4.5 4.5 they will have given the model more information trained it again but um they um and the model will be smarter however it's a different model the decision tree is slightly different these numbers are slightly different and again I shouldn't uh compare this to a human but uh think of it as the personality of the model has changed a little bit so the way that

you ask the question can change slightly in my case uh um um some things that for claude 4.0 were um uh was doing out of the box I had to like specify for cloud 4.5 um in terms of please don't output this or please add this in the and the output. So the context engineering or the prompt engineering changed a little bit. Um, my lesson learned here was uh if you uh if you work with models and you have a prompt ready for this model, there's no guarantee that it will actually do exactly the same on like a new version of even the same model. Um, but we can do better. Um, as I said, I

expected that I want to run these uh um um uh files in in batches. So we use started using NAN to um to automize uh everything. Um and I know I'm running late on time. So um please forgive me if I'm rushing a little bit. Um what I did the same setup that I uh that I had I I've put into a uh anan uh framework. For those who haven't used Anadan, it's a um application where you can visually put together building blocks, connect the multiple APIs together and have it work quite automatically. Um so basically the only thing that I changed was instead of in LLM client I I'm now using NAN and it's still communicating

to Binary Ninja. Bagan Ninja is now running headless in the cloud somewhere. And this is how the uh the setup will uh look like. Let's show that live. And there is um um basically it's slightly different sample, but it's just the same as what I did uh uh right in um in Cortex. There's a system message. I'm telling it exactly what it what it uh what it needs to do. And if you run this, you will see that it'll start uh uh working. And you can see what it's uh uh what it's doing over here. Now, every um request that's being made is uh um a list item here. And if you notice something, the request that I'm

making is and every time that I'm doing it is getting bigger and bigger and bigger. Now, why is that? Um LLMs do not have a memory. The only memory that an LLM actually has is whatever you send it at that time. So if you wanted to remember that all you worried would reverse this function, you need to send the entire chat uh over to uh to the LLM again. Now LLMs aren't that expensive. We're talking um things get translated into tokens. So here you can see the amount of tokens that get used for uh a request. And here you see the total uh tokens that are being used for this conversation. Now we're talking about um

0.3 cents for a,000 tokens that you uh that you pay. The thing is I'm using 250,000 tokens for this conversation and it's uh not done. Um I know from experience that you should not uh load up a uh huge Golang binary and uh go get go get lunch. It is quite possible to go into the tens, maybe hundreds, maybe thousands of dollars if you just have this run and run and run and run. Yes, you will at some point reach the point where the maximum context window is is used, but if you make a config error, you will just keep sending 200,000 tokens in and just keep doing that until the end of time. Um,

and as I said, I'm not paying for it luckily, but we're now up to half a million uh uh tokens. So, I already wasted like $1.50 50 cents just by talking to you. Um, so, oh, stop this. So, let's stop this for a second. Now, this was all good, but I'm basically getting the the the same result. And remember, this is still my manager coming up to me, slapping me in the face, saying reverse. Uh, we can do better. So, for that, five minutes for that. um if I normally start looking at a binary I will get some more information. I will check virus total. I will send this sample into a sandbox. I will do um

stuff to to get more information before I just open and disassembler. And thinking about that maybe the LLM can use some external information as well. So what I did here is a second version of the workflow. Um that instead of just sending it to the reverse engineer and the last node here is the reverse engineer. Um I'm first getting some information from virus total I'm getting some information from uh our sandbox uh internal and feeding that into a model. But if you ever worked with uh virus total, you know that the reports that virus total produces and I'll show you um one here will contain a lot of information. It is like very very huge.

One thing you also have to know is that lamps get overwhelmed quite fast. So if you give it way too much information, it'll start making stupid decisions. So what I'm doing here is before I send it to my reverse engineer, I have an overall uh summarize the entire uh um stream that I get from various total then feed that to my reverse engineer and then the output the reverse engineer can actually uh figure out hey I'm talking about I know it's down there but it's um I'm talking about this kind of malware and it has some more information. It can actually produce some some useful output. Now the last thing uh that I want to

show you and then I'm almost out of time um is another version and that's getting even more uh more complex. Um I've run this uh uh uh execution before we uh uh started just in the interest of time. Um we as engineers uh we've been like reverse engineering malware for years and we produced uh IDA databases gra databases and bodena [clears throat] databases with like marked up code samples for every kind of uh virus that we um uh that we investigated. And if I sit down and I investigate a sample of a family that I uh um know that I already investigated, I'm not going to start by going through the sample all over again.

I'm going to open my own my uh my previous database. So that's what I did here as well. Um I gave the llam access to a code similarity search. Um and how is that set up? I basically um set up a vector search database with all kind of code samples from uh databases that we already have reverse engineered. Um and now the LLM um in this example will basically [sighs] come on let me find a decent example here. Ah here the zoom in so you can actually see it. the main LLM will actually uh uh send over a uh code sample. So just compressed, doesn't even matter uh uh uh what's happening there. [snorts] And in

response, it will get an answer. Hey, I already found this uh function inside the database. It's part of the warm cookie moldware. Um and it will then feed that back to the main LLM just as a normal analyst. it can then look at the code say okay hey I've seen this before I can rename this function and make it more uh uh more efficient now wrapping up um this was like uh the final uh final setup uh so instead of an anidan uh uh just having access to binary ninja and now giving it access to a rack and a vector database and >> [clears throat] >> uh other setups um or other uh um sources of information and that improves

uh the analysis. However, it does use a lot of tokens. Now, we did figure out you can reduce the amount of tokens by doing multiple passes. Uh instead of um giving it like the entire binary in in code and then uh asking it to reverse engineer it, you can first ask it hey do the first five functions, reverse engineer that stop. then asking the second LLM in the second pass, hey uh focus on this function and stop uh sending over the whole conversation uh over and over again. And a few lessons learned um prompts are model specific lams get easily overwhelmed as humans do but uh if you give it too many tools too many information um as with all AI uh uh

projects quality of input matters in means out. So if you give it uh uh bad information, it'll make uh bad decisions and um LLMs can get really expensive. So if you start a LLM based project, uh make sure that you preemptively consider the cost of the thing that you're doing cuz a few cents per token uh translates into uh a lot of dollars if you go into the millions and tens of millions of tokens. Thank you. [applause]

Thank you for the talk. I'm sure we have a lot of questions. Um, just looking in the audience. One up front and back to the here. What do you have to catch? >> Oh, there you go. >> I I think you probably have this obvious question all the time, but why binary ninja overdid and gedra? >> Oh. Oh. Uh this question I can uh uh um I can spend two hours on. No. Um Gida uh have you ever opened it? >> I have. >> Yeah. Well, need I say more? If if you looked at it, no, the interface is just not made for humans. Vin Ninja um is especially after like version 4 started at a level of uh decompilation uh where

it's a huge competitor for for IDA at uh 20% of the price. So just for that it's my reason to uh to stick with them. And if you see the improvements in the API the uh the Rust API that they uh provided now the and I'm not paid by Banger by the way but I just love that project. I've been using it since uh open source and uh yeah you won't get me back to IDA anytime soon. [laughter] >> Thank you. >> Okay. Um I think we have time for one more question back in the middle. >> Sanding over the microphone again. I will be here. I'm open to people coming up to me so if you have questions just

hit me. >> Uh thank you for the presentation. Uh one question how much did the LLM hallucinate results and did you have any point where you took a look at it in binary ninja and thought yeah maybe the function name it was given yeah that's not what happened or anything like that. Um yeah that of course that happened but uh to be honest while using claude 4.5 um it was kind of rare to find it hallucinating. I will say I used it a lot for Linux reverse engineering and as I mentioned in the beginning Linux is very under researched so the malware is rather simple. Uh there are really complex uh window samples that um that

are considered where more hallucination is um is happening in Linux samples. It's quite obvious what is happening. So I I wouldn't say that there's any any really part No. >> Okay. Thank you. >> Okay. Thank you then. Uh thank you. Another round of applause for Ramco. Thank you.