← All talks

Detecting, Deobfuscating, and Preventing Obfuscated Script Execution with Tree‑sitter

BSides Las Vegas38:3926 viewsPublished 2025-12Watch on YouTube ↗
About this talk
Identifier: LBQDEB Description: - “Detecting, Deobfuscating, and Preventing Obfuscated Script Execution with Tree‑sitter” - Discusses obfuscation in PowerShell, Python, and JavaScript. - Explains AMSI bypasses and detection challenges. - Demonstrates custom AMSI provider DLL using tree‑sitter parsing. - Showcases automated detection of AMSI bypass attempts and obfuscated payloads. Location & Metadata: - Location: Breaking Ground, Florentine A - Date/Time: Monday, 18:00–18:45 - Speaker: David McDonald
Show transcript [en]

Good afternoon and welcome to Bides Las Vegas breaking ground. This talk is detecting deoffuscation and blocking fileless malware with tree sitter given by David McDonald. A few announcements before we begin. We'd like to thank our sponsors especially our diamond sponsors Adobe and Aikido and our gold sponsors Formal and Drop Zone AI. It's their support along with other sponsors, donors, and volunteers that make this event possible. These talks are being streamed live and as a courtesy to our speakers and audience, we ask that you check to make sure your cell phones are set to silent. If you have any questions, use the audience microphone so YouTube can hear you. With that, let's get started. Please welcome David McDonald.

Hey everybody, thank you all for being here with me at the end of the day. Can y'all hear me? Okay. Cool. Uh, so today I'm going to be talking a bit about um, tree sitter and how we can use it to detect and you know what the thing says. Um, so when I first started at Vexity, I had not had much exposure to um, malware in general. So, I came across this ICE ID link payload and I was really confused by just what it was because to me it looks very obvious that it's malware. Um, so my my question was like why would somebody do that? Here's another sample that I looked at which is an emote PowerShell payload.

And again, it's totally scrambled up. To me, it looks like obviously bad, but there is a very good reason for doing this, and it's twofold. The reasons are to slow down your reverse engineering efforts when you're trying to investigate what a piece of code does, and it's also to bypass signature scanners. So, I took this sample. I wanted this to it to be like my benchmark for um can I deopuscate this and this is what I came up with. So this uses tree sitter to iterate over the actual syntax tree of the file put together the pieces and reveal the actual underlying malware. So you can use you can see it now this is a

slow down version. And I put like a call to sleep between every iteration. But what it does is it looks for individual things that it can reduce down and then applies it to the tree and updates it until you finally have something usable.

So there are some existing methods for deopuscating PowerShell as you can see here. Um don't necessarily recommend that. There's also revoke obiscation which doesn't exactly deobiscate. It's more for identification of obusiscated PowerShell. And then there's things like PS detect and Power Decode, which are tools where you actually run the PowerShell in a PowerShell interpreter, but you patch certain functions to try to make the malware no longer malicious. Um, it's kind of heavyweight. It requires PowerShell. You don't want to run it in your like production environments. It needs to be in a sandbox. Uh, you certainly don't want to do it for every single script that runs on an endpoint. and it can't handle corrupt data. So if

you come across a broken piece of code, it's just going to choke. And here's an example of how they would um patch that function. Here we have them o overriding the invoke expression function. So instead of actually invoking the expression, you just capture the parameters and write them out to the host. Some additional challenges are just the sheer number of data sources that you might have when you're dealing with PowerShell. You could get it from a file collection as part of a forensic collection. Uh you might have script lock logging, um a C2 endpoint that you pulled it off of, cached files in memory, just loose data in the kernel or even network traffic in some scenarios.

Additionally, in order to run those dynamic deopiscators, it needs to be at least in somewhere with a PowerShell environment. Um, which makes it challenging for dealing with crossplatform solutions. And again, just the sheer scale of data for every piece of PowerShell that runs across an an enterprise is massive and you don't want to do that. So, as far as obuscation goes, there are several popular open source debopuscators and attackers are lazy and they're going to use these things because it works. And who wants to actually go through and chop up a file bit by bit and put it together. If these things are out there and they bypass signature scanners, they're going to use them.

And they're still effective against signaturebased scanning. uh even relatively sophisticated signature scanners can be bypassed pretty easily if you just apply enough obuscation. You iterate, you experiment. There's tools like um I forget what it's called. Uh I don't know. There's a tool that lets you like run AMC on a file and it'll show you exactly which portions of the file are the ones that the signature is flagging on. So, Windows has the anti-malware scan interface, and this can detect many malicious and opuscated PowerShell payloads, but AMC bypasses continue to evolve. Uh, you'll have a bypass, it'll work for a while, they'll put out a new signature for it, you tweak it, it works again, they put out a new signature, and

so on and so forth. Um, so those are signature based. They're not syntax aware. they're just relying on bite patterns and that's really their fundamental weakness. Here's an example of that. This is from the trolls repo. Um you can see here the Windows Defender detects this payload. It has the uh AMC util string right there and that's kind of what it's flagging on. But this is not detected and it's really just as simple as splitting the string up into different components, concatenating them together and substituting them in with a variable. This is a snippet from the trolls repo which says that system.automation.cutils is being flagged now. Just do a basic obsciation of it where like the example

below and it works again. And they're not even trying hard here. Uh so it's kind of a sad state of affairs that you can do an AMC bypass with something that that trivial. Here's an example of how you might try to detect offiscated PowerShell with Yara. Uh you look for the common components of something that has been run through invoke offiscation like these format strings, format specifiers, uh invoke method calls and then you have to hit on really a lot of them like uh more than 11 and more than 10 and more than 10 of the three. So for smaller snippets of PowerShell that have been obiscated that's just not going to work at all.

So again this was the challenge and through tree sitter I found a solution and the key was what I was calling at the time atomic operations. So what are the smallest discrete transformations you can make on a syntax tree in order to deopuscate something? So let's say we have an operation like this just adding two strings together in the source code. Well, it's really easy to deopuscate this if you have the required information which is that at byte range 0 to four you have a string literal in the middle here. In bite range 5 to six, you have an addition operator. On the right side, you have another string literal at range 7 to 11.

And then this is all encompassed as part of an additive expression in that range of 0 to 11. Now, this is where tree sitter comes in. Uh, tree sitter is really awesome. It's part of Neovim and it was originally developed for the uh Atom text editor. So what you do is you write a language grammar in JSON and that gets like converted down into I'm sorry you write the language grammar in JavaScript, it gets converted to JSON and then it finally gets converted down into an actual parser in C. There's a ton of support for most mainstream languages. If it's a language you use every day, there's almost certainly a really, really high quality grammar for it. And

there's even an active development community of less common languages. The tree setter API allows you to do queries against the actual syntax tree using something called S expressions. I know there's lisp is not the most popular language, but this is like very much related to that. So in neo this lets you do fun things like language injections and advanced code navigation. This is really how I found out about this in the first place just because I saw some YouTube video where a guy was like oh you can TJ Deere was like you can get your strings like formatted as SQL embedded within your Python source code files if you learn how to do all this stuff. So,

I was like, "You know what? All right. I'll I'll spend two hours after work doing this. That's fine." And sure enough, it works. And it's it's kind of cool. Uh, so you can see here we have some Python and it has a query and inside of the multi-line string, the text is formatted formatted as SQL. So, here's what that looks like more concretely uh in Python. Didn't use PowerShell yet because the grammar is not very nice, but Python's a lot cleaner. Um, we're just going to query for an assignment operation to this variable here. We're going to label it statement. We're going to check to see if the identifier is equal to fu. Oops.

And you can see right here it has said, hey, I have an identifier named fu that's part of an assignment statement. So that's that's the query syntax right there. And on the bottom is your tree inspector. So the queries that you run look very simple to the very similar to the trees themselves which makes them a little easier to write. You just can like grab a little bit out of this, paste it into your uh query editor here, start adding your conditions and then you have some success. Here's another Python example. Uh if we wanted to find a binary operation that was two strings that said like hello world um we have a binary operator

left side's a string right side's a string left side is equal to hello right side matches as a regular expression wl

and so when I was looking for a power shell grammar I actually went to Microsoft's repos since they owned the original version, but they had kind of just like let it die. It wasn't being maintained. And I noticed that Airbus had forked it and had an active fork. So I was like, this is really promising. An incident response team has forked the Microsoft PowerShell repo or the tree setter PowerShell grammar repo and is like making it work. That's pretty cool. So here's an example of how tree sitter can be used for like really powerful PowerShell scanning. Um we take 744 clean PowerShell management scripts, run them all through token obiscation and invoke obiscation and then design a query to target those

obuscations specifically. And when I say query here, I mean those tree setter queries like we wrote on the last slide and then we see how it did versus Yara. So this is a query for the PowerShell format string command. PowerShell has a very recursive grammar. So you get these like deeply nested queries which makes it not very friendly to deal with. But we're looking for this essentially and I found a tool called tree greer that lets you instead of using regular expressions gp across files with tree setter queries. Uh, so Yara had 86 hits out of all the files that were obiscated. TreeGrapper had 616. And this is the output of tree. You can see it like figured out exactly which

parts of the file we were looking for and printed them out. So just right off the bat, that's really awesome PowerShell scanning. Uh, Treitter has great documentation and a nice Rust API as well as Python bindings. There's actually bindings for quite a few languages. So, invoke opuscation is the most well-known of the PowerShell opuscators. Uh, uses several different methods in order to scramble up the code and make it hard to read.

Token opuscation is pretty common. You'll see them injecting back ticks, which is an easy way to throw off signature scanners. uh splatting and concating, breaking it up into strings and then using this u amperand operator which will basically execute that and then also reordering um using format strings and these are really tree like trivial to detect the tree sitter as you saw in the last slide. uh the tree the the syntax trees of opusiscated scripts have characteristics that are rarely found in benign scripts and a lot of these things don't get reconstructed in script lock logging as useful as it is for finding things that have been like loaded through a gzip cradle or something

here's an example of u finding something that has been obuscated using token obusiscation we can actually find write a really simple query that just looks for uh commands that have a back tick in them and boom, there it is. There's more tree gripper output for those splat concat type obuscations where uh you break things down into different strings and then um execute those. A lot of these are using the static environment variables as well.

And then you have a opuscation which is u manipulation of the actual a nodes themselves just reordering things shuffling them around like arguments to to functions. And these don't if you're if you are syntax aware and writing queries this doesn't matter to you that much because you can account for that and sometimes it won't even matter for tree sitter queries. Then we have string opuscation which is basically let's break this whole thing down into a string and then um execute it as a new script block. You can guess here what this character or character X is going to do. And then you have the static environment variables. Uh they are going to use the command invocation operator to basically

build commands out of strings and then run them. And there's actually a a really special node type for that operator that you can query directly. So while writing a regular expression to try to find these would be a nightmare and you get a million false positives, if you just query for command invocation operator, it'll point you to every single one like with perfect precision. U more static environment variable nonsense. Um these are just fixed values that are generally going to be the same on every system. So you can use those to get an i and an e and so on and so forth. Uh you can also do type casting to scramble up strings. So uh we just cast

numbers down to characters and then format them into a string. That's really easy to find. Cast expressions with a query. And so you can see what I'm getting at here is like if all of these individual components are very easy to find with queries, then we can actually build an engine that will locate all of them, edit the actual document, and uh work it towards something more readable. So what I came up with is a bunch of these operations that you just execute in a loop until you're no longer finding any more work to do, you know? So, um, that was how we got to the solution I showed earlier on. Uh, you can even do cool stuff like

dealing with random variable names where they'll give it some kind of junk name and it'll be kind of hard to hold in your head as you read the script because um it's it's just nonsense. But you can rename all the variables like spiteful kit and using something like the docker container naming conventions and that'll at least make it a little easier to read. You have um compression which will create compressed commands and those uh yeah are just going to make a gzip compressed blob uh decompress it from a B 64 string and then launch it with once again invoke expression here. So this is Microsoft's benchmark for PowerShell detection tests. U new object web client download file some .exe and

then start process with that file. If we were to use string obuscations from invoke opuscation it would end up looking like this. And if you use that engine you can get right back to where you started. And it even has some statistics on how it was done. There were nine format string expressions to pause it. Uh 11 string literal pipelines, two casts and two string member usages. So you can do even more than just detect. You can do really advanced unpacking of some kind of crazy payloads. So this is a script block built from a gzip stream that is built from a you know a B 64 string that has a whole bunch of um glued together strings.

But because we're executing those queries, uh, we can work it down towards something nice. And our actual big query that looks for the gzip compressed blob won't fire until there's only a single string in in that parameter and not like a whole bunch of concatenations.

So it just goes through, works it down, and then you get your nice payload there. You can also extract binary payloads. Uh so if you see a conversion from B 64 string, you can query for a single string parameter to that and then just like unpack it.

One of the really nice things about Treesitter is that it can handle incomplete or broken scripts. And this is extremely useful when you're dealing with sources that might be incomplete or corrupt like uh if you're finding strings loose in memory as part of a memory scan or something. So here we've got a broken tree. It's tree setter uses what are called error nodes to make it robust. The reason it does this is because it's actually designed for use in text editors. So it needs to be able to do keystroke updates when your syntax tree isn't actually fully valid without like breaking all the syntax highlighting and things like that. So you can actually although this one is

broken still work it down to a you know a useful payload that you can examine. And then one huge advantage is speed and portability because it's written in Rust and C. uh you can make static binaries that can do this very quickly. This is it running at full speed against 745 samples. Again, PowerShell's grammar poses like some serious challenges because it uh is highly recursive. In a Python program like this, you'll see like a very nice flat structure to um an assignment of a list of integers. But in uh geez in PowerShell it's going to be like very deeply nested um of assignment expressions because an assignment an assignment expression is actually an assignment expression glued to

another object and so on. So when I did gave this talk the first time at besides San Antonio the very same day I was like okay talk done I'm going to like go back clean up the code and publish this tool and of course the Airbus C team that um had recovered the PowerShell grammar released their tool on that very day. So, I was kind of like "Oh man." Um, kind of a shock, but uh it ended up being really cool actually because they did a really fantastic job with the the library that they wrote. And what was really interesting was that you'll notice that my talk was entirely focused on like writing these tree setter

queries. Theirs does not use a single query. Theirs is instead based on um recursive traversals of the entire syntax tree. So you write these things called rules which are basically akin to the um little steps in my engine that I'd written and when you enter a node you can do something and when you leave a tree node you can do something and you can do it either with a regular rule which doesn't update the tree or you can use a rule. So you can achieve the exact same effect and this is not quick time for questions or comments yet. This is just where the slideshow ends. Um, so what I did was at this point I was

like, okay, there's a really cool Rust library for this seen that AMC really struggles to, um, deal with it like bypasses against its own defenses using these PowerShell opiations. So, um, what if we wrote our own version of AMC that had these capabilities? And I I looked into it and you can actually using Rust for Windows implement the MC interface through the I anti-malware provider interface and um install it on a system and you get when you know when a script is launched from PowerShell your code will be called and then it will pass through your scanner and the requirements are only one thing that you scan a buffer and then you return one of three enumerations AMC

result clean, which is this is a known good piece of data. AMC result detected and AMC result not detected, which means we're not sure, just pass it on to the next AMC provider and let it make a decision. But in the meantime, you can also do you can have a whole lot of interesting side effects like you can log things, you can extract the payloads. So I built a system that does this and I'm going to demonstrate that for y'all. Um here we have a PowerShell interpreter on the right obviously and on the left we have our program data directory for this AMC provider. So let's start by doing something like really simple. Uh write

post try to make this readable fu bar and you see it gets executed. But up here we have our actual linted version that got passed through the engine. And you can see that it it did the transformation. We didn't find anything bad. We allow it to run, but we made a note of it in our log here. Um, which is just like logging all the payloads that come through AMC and like a little peek at their contents.

All right. So, what if we did something that is something that you would more commonly see with fileless malware, which is um invoke rest method. I did that wrong. URI GP 172

1 8000 gzip. PS1. And over here I've got my little C2 server running. And

so for this rule, I basically just wanted to find out if something is uh a gzip payload. And the way that this works is that uh using the minus one library, you can actually access inferred values of nodes. So when it finds like from B 64 string here it it leaves it alone but it also is going to like store a inferred value within that tree setter node that you can access. So you say hey show me that data and um let's like actually do things with it. So when we run that, oh I forgot to pipe it intoex.

The provider blocks it because it contains malicious content. And then we should see within our log an actual hit on the rule which is a binary gzip payload here. So we're able to like introspect the inferred values of nodes and uh do cool things with them. So one thing that I think is like really amazing about that is that you can actually empower your signature scanners to do cool things again. So instead of you know them not working because they uh the script is obuscated, they can't see the contents, it's a string that's been like split or replaced or reversed. Uh you can just work through all of those things and then check the inferred data

and pass it off to Yara. Let's see. Uh, another cool thing you can do is check for like contextaware entropy. One thing that people will do to bypass signature scanners is affect the entropy of their files by doing things like uh giving variables really simple names. So for instance here we have these ridiculous variable names that are very long. This is like from the actual PowerShell obuscation bible repo that's like hey if you want to get past the scanner uh you need to like tweak your entropy and this will bring it down from like 16 matches on virus total down to 11. But in this case it's actually the tell because um we can look at just the text of those

nodes calculate their entropy and write a rule that says hey if the Shannon entropy of a variable is less than one report

So again, we blocked it because uh it has giant weird variables which should show up in the log here. Yep, there it goes. Variable entropy. uh we have highlighted exactly which variables are there. So if you weren't sure if it was a false positive or something, you have instantaneous intelligence on that.

So I was at work one day and I was looking at the um the R77 rootkit which is a big open source rootkit on GitHub but one major component of its infection chain is an AMC bypass using these like polymorphic string obuscations. So I was like oh man that's a perfect thing to practice on. Um, the mistake I made was that I tried doing this on a fully patched version of Defender, which has obviously seen this open source root kit a million times, so it'll blow up in your face. But, uh, I applied extra layers of obuscations for the sake of this example.

So, you get a, you know, a giant nasty thing like this. And the rule that I wrote for this was one that basically says I want to find when var when strings appear that we know people would try to hide. So things like mc.dll or mcutils or virtual allec or virtual protect.

So,

uh, a lot of output there, but I think if we refresh this log,

we got the naughty string rule. MCDL

and in one of these files, actually I did this earlier to help myself, uh B635.

And there's your, um, your Rukit component right there, fully deopiscated. uh you can see all the strings that had been hidden before um load library a virtual protect mcdll and so on and so forth. So it's a really cool powerful way to um enhance your scanners and keep a little bit safer from all those really trivial bypasses.

Here's

another one that I wrote up. I had to write a special rule for this one because this is basically the equivalent of like Python's unhexifi right here. But you can actually examine function statements and figure out, hey, that's exactly what this thing does. Um, and then when you find its use later in the syntax tree, you can set the inferred value of that node to the results of what would be a call to unhexify.

And so again, we find it, we block it.

because it has a a binary portable executable embedded in the payload. Um, that was one of those rules that was used in conjunction with Yara. It's the simplest Yara rule of all time is PE. Um, and because Yara X is in Rust now, you can just like super easily throw that in the mix.

And uh yeah, that's my demo. I'm glad it did not blow up. Does anybody have any questions?

>> Is there anything you can get through tree sitter? >> Yeah, I've looked at like >> I'm sorry. Where's the weak point? >> The weak point of freestitter is that there are like many many many different ways you can obiscate code and it's by no means like the minus one engine is not feature complete yet. But the good thing is that it doesn't have to be because it's just it's an improvement over the existing scanning engines. And what's really great is that uh every rule that you add is kind of a force multiplier because it can rely on the inferred value within nodes that have been produced by other rules, you know. So if you've got like a

from B 64 rule that will capture binary payloads and say, hey, this node has binary data, then it makes future rules that you want to write able to say, hey, if this rule has binary data within it, you know, let's do something. Um, so yeah, it's you can still get past it, I'm sure, but uh it's a really cool tool and um they keep making progress on it all the time, so it's just getting better and better. >> Yeah.

>> Interesting. if it's like importing other things into the scope kind of. >> Yeah. >> Um I have not experimented with that yet. I don't know. I'd be curious to find out. >> But you can see it it runs it scans a lot of different things. But uh yeah, I don't know how it works when you have imported modules.

a bunch of those obviously.

>> Right. So to recap the question, um there are a lot of uh you might have like a lot of imported modules and a very small snippet that uses code from them and can tree sitter work with that right now. Uh I would think the answer is no, not right now because it's like working purely on the syntax tree of a single document. Um you could build a larger framework around it that like captures things from so in this example like you have all these logged entries. Um I'd be curious to know if they come from the same session like each AMC scan has like an associated session identifier. So if when you run an when

you import a module >> Yeah. Yeah. You might be able to capture outside information from the imports and then use it in subsequent uh like scans as part of the same session. I'm not sure though. Yeah. Yeah. Thank you. Any other questions? All right. Well, thank you all very much for your attention. I appreciate it.