A Novel Approach to Script Language Detection

Name: A Novel Approach to Script Language Detection
Uploaded: 2025-06-04
Duration: 22 min 29 s
Description: Lex Sleuther - A Novel Approach to Script Language Detection Aaron James Join us as we go far off the beaten path in search of strange and exciting methods of script language detection. File signatures? Nope. Machine learning? Nah. Here be dragons, but dragons often guard treasure… https://bsid

BSidesSF · 202522:29125 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Aaron James

Tags

CategoryTechnical

StyleTalk

About this talk

Lex Sleuther - A Novel Approach to Script Language Detection Aaron James Join us as we go far off the beaten path in search of strange and exciting methods of script language detection. File signatures? Nope. Machine learning? Nah. Here be dragons, but dragons often guard treasure… https://bsidessf2025.sched.com/event/a3a271a71f0bb609a3e92ff9ef324a30

Show transcript [en]

Hey folks, good morning. Uh we are going to start off with the first talk of the day. Today we have Aaron uh talking about Lex Luthther, a novel approach to script identification. Uh reminders if you have questions please log into SEO you the codes should be in your programs. Uh we will be taking questions only through that as we go through. And now over to Aaron. Thanks everybody. Talk check. Talk check. There we go. All right. Uh so good morning everybody. Thank you for coming. Um I've got quite a bit to go through today so the pace is going to be a little bit fast. Um everything will be published at the end. So uh and of course you can find the

recording later. Um I'll try to make as leave as much time as possible at the end for questions. Um, if we end up going over, um, I'll let you know where to find me afterwards as well. Uh, welcome, as you said, to a novel approach to script identification. If you haven't heard those words in that order, stay tuned. Um, the tool that I want to show you is called Lex Sleuthther. And some of you might be able to guess how it works based on just the name, but if so, keep it to yourself. Spoilers. Um, who's responsible for this? My name is Aaron James and I like to build weird things and talk about them. Uh, I've been

coming to this conference for three to four years, but this will be my first time on this stage. So, thanks for having me up here. Um, I do security research for the threat intelligence team at CrowdStrike. They're like a little security startup. Um, they have a booth at ARSAC. You should go check them out. And they pay me to build cool tools to support threat analysis at like massive scale. And that's basically me. I'm the cool tool guy. Uh you can find me on GitHub. It's pretty sparse these days. Having a full-time job will do that. If you want to come and find me after the talk, I plan to oscillate between the CTF room and the bar. Just

be warned that I will be busy winning. With that said, let's get into it. Uh quick, what is this? Anyone know? Hey. Okay, great. We're all at the same conference. This is Virus Total. You submit a file, you get analysis aggregated over multiple sources. It's a little bit of an open secret uh that every security company that does malware analysis has some version of this, right? Some internal pipeline that automates this at varying degrees of scale. Uh and based on how virus total is organized, you might think such a pipeline uh looks like this. You submit a file which is then processed with distinct dynamic analysis components. Um let me actually go to here. Um which

independently produce structured data. But this is inaccurate, right? It really looks like this. Static analysis is a prerequisite to dynamic analysis, effective dynamic analysis, right? They inform each other. Often the success of your dynamic analysis hinges on the success of your static analysis. Here's an example. Um it could come to very different conclusions about the type of file being processed. Right? There is no one dynamic analysis engine that is going to handle all of these cases. And beyond that, each of these cases might need to be analyzed in a completely different environment. Which means if our static analyzer gets the file type wrong, then dynamic analysis is also unlikely to yield anything useful. Um, I

don't want to oversimplify the file type decision in and of itself either, right? It is certainly true that distinguishing a binary file from text is basically free. And it's also true that static signatures available for identifying binary formats are kind of robust. The tricky part though here is scripts, right? If we don't have the name of the file, if we don't have the file extension, what tools are at our disposal to tell PowerShell from say bashcript? That's a rhetorical question. If those were our only two options, right? We could probably come up with a simple huristic, but unfortunately malware is a diverse genus. If you pick randomly, you're probably going to get it wrong. If you detect the wrong language, not

only do you not get execution, but you wasted time and money, and the failure itself is difficult to detect. How you distinguish between an incorrectly identified script and a script that's plain broken is not obvious. Uh, fun fact, most malware is broken. And the answer is you at least hope that the script decision part was right and you know the rest of your pipeline is working correctly. Um this conundrum is what Lex Luthther tries to address, right? It takes in your file as input and tells you what script language it's in. Now I'm not coming up here to pretend that Lex Luthther is the only tool that does this. There's actually heaps. Uh which is why I'm going to

briefly go over the main contenders. For example, what about Live Magic, right? That's the deacto standard that's been around forever. You run file on Linux and it tells you what it is. To keep this part short, you know, it's a suite of static signatures that is excellent at identifying most binary formats, but is poor, let's say, with script languages. Its signature spec just is not the right architecture for free form text at all. and it can catch some easy cases but as you can see in my n equals 1 example it is too prone to false negatives to be useful in this context. What about yara rules right? Well the main problem is uh you write them

yourself and maintain them for all of eternity. Uh and this is a recipe for some very unpolished heruristics rules like this one tend to let's say overachieve in the wild. And I know what you're thinking, right? Surely this is a contrived example with madeup numbers, right? So in summary, Yara rules are the wrong tool for the job. Using them to identify script languages is like using a hammer to drive a screw. You know, you can get surprisingly far with it, but you really should go find that screwdriver. Here's the screwdriver, right? Guestlang is a deep neural network with a linear classifier. And as you can tell by the GIF, maybe Oh, nice. Uh, it supports um it's used by VS Code

and it supports over a 100 languages, which is actually kind of a bad thing for our context. We'll talk more about why later. And it's also not nearly as portable as the other options. It uses a lot of resources and is not suitable for constrained environments. Um, and in fact, a study done at CrowdStrike found that guestlang was more accurate than our stitched together heruristics, um, but was also 1,000 times slower, which we made the call that the 6% boost was not worth it. And we kept looking. Now, if everything we've been looking at so far is the black and white portion of the infomercial, what we've got for you next is the magic tool that will breathe

color back into the universe. It's Lex Luthther, right? No, the last word is magicka. You see, early last year, Google used its um bottomless ad money to summon the top minds in artificial intelligence and unified them, the Avengers team, to fixate on one goal, guest but good. And honestly, they they killed it, right? Um it supports every binary file type of live magic and every text file type of guestlang. It can infer both at over 99% accuracy in under five milliseconds per file. The model that does this is small enough to fit on a floppy disc. How did they accomplish this? Honestly, it's not that complicated, but it is out of the scope of this talk. So, if you are interested

in it, please go read the paper. It's short and sweet. And it's also worth mentioning that, you know, Google only created this project to improve a service of theirs called Virus Total, right? So, which of course exists primarily to scan Gmail attachments, which in turn only exists to harvest user data for ads. So really it's just ads all the way down. Uh but I mention this because this is what Google did to solve this exact problem that we're solving. So if Magika is so great, you know, why build something new? Well, uh first of all, Magika was first released in February of 2024. Lex Luthther started development in 2023. In other words, it didn't exist. Just being

completely honest with you. And if it did, maybe this talk wouldn't exist either. But I have other reasons. Okay. So, second of all, Magika does have some weaknesses which I will show and I promise I am not huffing the copium. I have concrete examples. And then third, um it's fun, right? Building things is awesome. While it is true that perfect can become the enemy of good, right? It's also a shame to let good enough be the enemy of better. Um better put, good enough can sometimes create holes. Let me show you what I mean. Here's some numbers from CrowdStrike with permission. In our pipeline, only 8% of the files coming in are entirely text files. And then on top of that, of that

8%, less than 1% of those are executable scripts. Of that 1% of 8%, CrowdStrike can classify scripts with, you know, around 90% accuracy. That should be good enough, right? Well, that's the funny thing about scale. This first number is 2.6 million per day, which means we're processing at least 212,000 text files. And even if only 1% of those are executable, that's still 2,000 scripts. Even at 90% classification accuracy, we're missing out on hundreds of scripts per day that are completely invisible to us. Those last few percents end up being a big problem. That gap is worth trying to close. And this begs the question, do we stand a chance at creating something better than 90% accurate? I believe that

the answer is yes. And I think we can get it done quickly and easily. Here's why. Most of these open source projects that I've talked about support hundreds of file types and languages. But in our use case, we're only interested in about five or six languages for our pipeline. These we can probably create a solution that's more effective than the state-of-the-art by just focusing on these languages only. Right? A novel approach can often address gaps when combined with an existing system. Narrowed scope is in and of itself a privilege. So let's brainstorm. I mean, let's just really just throw stuff at the wall here. As long as we throw stuff in the context of the problem that we're

trying to solve. Remember this diagram? Here's an idea. Instead of trying to figure out which script runtime to use, use them all. Instead, just send every script to the same dynamic analyzer. And then once the script is in this virtual machine, just execute execute the script with every known scripting language. Start six processes, monitor them all. You know, probably only one of these processes stands a decent chance at executing correctly, and the rest will all air out. But at least you're much more likely to get execution, which was kind of the point to begin with. As I'm sure you might be able to tell, there are a few problems with this approach. The big one is that we now have to trace

six processes instead of one simultaneously and kind of disambiguate which one was successful. And how to do that is not obvious. And then pretty soon you're trying to solve that problem instead of the original problem. Maybe we get lucky and it's simple though, right? So what would happen if we tried to execute a PowerShell script with Python? Well, usually we would get like a syntax error like this. And that gives me an idea. What if instead of trying to run the script with six different run times, we run the script through six parsers instead? These parsers won't execute anything. they just tell us if there was a syntax error or not. Surely the parser that doesn't error out is the

correct script runtime to use. And there are still issues with this idea. But something like this is at least feasible for a small project. Some of these parsers are very obtainable. Some are not. For example, for something like Microsoft Batch, you know, the implementation is the language standard and, you know, writing our own batch parser also feels a little in the weeds. We're we're getting lost in the sauce here, right? We need to be less clever about this. We don't need to generate abstract syntax trees or do incremental parsing. We just want to know does this file contain tokens characteristic or uncharacteristic of say a Python file. And you don't have to open your Dragon

book. Sorry about the blinding to know that parsing is complicated. Uh so let's do as little as possible, right? Let's just focus on the first step. We said we only cared about the tokens. So what if we only did the lexing part? How hard could it possibly be to write six lexers? I know that sounds rhetorical, but like honestly it I think it would be pretty easy if we used like a lexer generator. Oh no, I found one on crates.io. Oh look, it uses rust proc macros. So the syntax is easy. It's just like writing formal peg. Oh no, I'm suddenly writing Elixir and Rust proc macros. If I'm not careful, I'll accidentally write 250,000 lines of

expanded code in less than 12 hours. Oops, we just built Lex Luthther, also known as I wrote six lexers and stapled them together. And the trademark is not pending on that, by the way. Um, that sounds cool. Um, but I hear you asking how is it useful for us to do this? Let's take it from the top. um we can now create token streams from an input file through a process that is still a mystery. We want to take those token streams and produce a set of say scores correlating with how likely each language is. Then we'll just pick the high score to decide on a verdict. The only missing piece here is the middle

bit. We get that and we're done. So let's try something simple. It we do not have to get into like data science here. We do not have to come up with something brilliant. It can be dumb and it might be. So taking things from the top, let's say we have some Python, you know, don't worry about what the code does. Um, as our input file, this is what happens when it goes through our Python lexer. Look, you like that? I can go back, too. It goes backwards, it goes forwards. That's so cool. Anyway, I love PowerPoint. Now, there's a lot of information here that might be useful for scoring, but in the interest of keeping things simple. Let's just throw

most of it away and just count token type. I said thumb, right? All of these token counts make up some long vector. And this isn't just Python tokens, but also token counts from the PowerShell lexer, the HTML lexer. We lex that file with all of them, right? And they're all concatenated into one vector. That is our input. And then our output is those six language scores. Sorry about the math notation. Now we know that if our input is m rows in one column and our output is six rows in one column, then we know that any linear relationship between them can be modeled by a six row m column mystery matrix. And I'm afraid I have tricked us into doing linear

algebra. So uh this isn't just handwaving, right? There's an intuitive definition for each of these weights. Quantitatively, how much does each token contribute to each language verdict, right? We need to figure out our weights. We can't just pick random numbers. I tried. We need to mathematically derive them, which we can do if we create the inverse system like so. Now focusing on this system to derive the weights for a single row of the matrix we would need to ingest a corpus of files and then we would populate the system where each row is all of the token counts for a single file. Then we would assign ideal scores based on the expected verdict. We just

make up the convention of zero and one here but we could have used anything. And then the only unknown here is that weight vector which means we are dealing with a textbook ax=b linear system. Provided that n the number of files is strictly larger than m the number of tokens the system is overdetermined and we can just use linear regression to compute optimal weights. Most linear algebra libraries can do this in just a few lines of code. You can call this machine learning if you say something like, oh, it's a feed forward neural network with zero hidden layers. I just call it pre-calculus. This approach does require you to have a corpus of pre-classified files to train with. I wrote a helper

script that lets me preview text files and classify them using my numpad. I can like, I don't know, drink my coffee with my other hand. The script is called LSD, which stands for Lex Luthther data set. Um, and while I'm using LSD, I can classify about a thousand files an hour. So, it's sometimes just a good way to relax. Uh, once that's done and we've trained our model with it, we can start using it to issue verdicts. And to make all this useful, I gave Lex Luthther a CLI as well as Python bindings, publishing pending. Um, it's designed to have a usage that closely resembles the GNU file utility. And in fact, we can compare the results of those side by

side. For the small example folder that I showed earlier, it seems to be working right. It manages to get them all right. That bodess well. And then just as a reminder, this is what Live Magic reported for the same data set. Well, in my mind, this is this is a huge win. Our solution wasn't all that clever, but we're off to a good start. Anyways, the next step here would be to evaluate performance over a much larger data set, right? And thankfully we can already do this with the help of our old friend LSD. With this single command we can compare accuracy between Lex Luthther and in this case file ID which is CrowdStrike's legacy classification

system powered by like Live Magic and Euro Rules and Staples. This will tell us how close we are to achieving 90% accuracy. Here's what it says. Oh well. Whoops. That's unexpected. Uh so turns out we're done actually. Um it's good enough. Uh and I think it goes without saying that we can already prefer uh Lex Luthther over File ID. And I don't want to, you know, understate it. This is a great result. But I mean still, is there any reason to prefer this over something like Magicka? Seemingly no. Um Magika is very very very good, right? The head-to-head is not favorable here, but this is a good example of how evaluating different tools can get a little complicated.

Scope again, right? Narrowed scope is a privilege. Why did Lex Luthther's accuracy go down here? That's because this is comparing accuracy over Magika's supported file types, all 200 of them, right? Lex Luthther only supports six. So, it's going to give a wrong answer for anything that's not one of those six. If we limit our analysis strictly to the six file types that Lex Luthther supports, the numbers tell a slightly different story. In these cases, Lex Luthther is competitive with Magico and accuracy. But a more revealing metric is the false negative rate. See, lower is better here. This is a metric by which file ID is laughable. Uh, Magika is an order of magnitude of improvement, but

interestingly, Lex Luthther manages to be even better. And the way you can think about this is they're both pretty likely to get it right. Um, but Lex Luthther is less likely to get it wrong. Glad we cleared that up. So, remember this diagram from earlier, who won, right? Um, does Crowd Strike end up using Magika or Lex Luthther to fill in that mystery component? The answer of course is yes. The truth of this situation is that we use everything right live magic, yara rules, magicka and lex luther to draw conclusions about what dynamic analysis engine to select. Any mature classification system needs that big ugly block of business logic that makes decisions. Maybe it's uh I don't know

just you know maybe it's a 1400 line if else statement. you know, you can call it that if you want to. Um, I call it the synthesis selector when I'm feeling sophisticated. And by carefully interle heristics and tweaking their evaluation order, you can create a very intelligent decision engine that has the best features of its parts, but is also easily maintainable. Meaning that if there's an edge case, you can easily insert a new rule to remediate it. can prioritize other tools in specific cases and you wouldn't have the ability to remediate if you just slapped Magika or if you just slapped Lex Luthther on those things. But all of that is unnecessary detail. The bottom line is

in the context of CrowdStrike, the presence of Lex Luthther in this system boosts our dynamic analysis efficacy of script samples to 97%. Which is pretty good. And if you want to try it, uh, as of this talk, well actually as of five minutes before this talk, I had a little bit of trouble. Um, Lex Luthther is now open source, so you can go check it out. Um, if you are the kind of person to pipe curl into bash, you can cargo install it and and run it right now. Um, and and you know, use it to tell you if a file is one of the six file types. I mean, it's it's it's kind of a limited

tool, but we're putting it out there um because I think its approach is unique. And that is all I had to talk to you about today. Thanks for having me. [Applause] Thank you.

A Novel Approach to Script Language Detection

Related talks