← All talks

Revoke-Obfuscation: PowerShell Obfuscation Detection (And Evasion) Using Science

BSides DC · 201752:07182 viewsPublished 2017-10Watch on YouTube ↗
Speakers
Tags
About this talk
Attackers increasingly use PowerShell obfuscation to evade detection, bypassing signature-based and AMSI-based defenses. Revoke-Obfuscation applies statistical analysis techniques—including character frequency analysis and machine learning—to detect obfuscated PowerShell scripts across command lines, .evtx files, and script block logs. The talk demonstrates evasion techniques and how defenders can apply scientific methods to identify malicious activity at scale.
Show original YouTube description
Attackers, administrators and many legitimate products rely on PowerShell for their core functionality. However, its power has made it increasingly attractive for attackers and commodity malware authors alike. How do you separate the good from the bad? A/V signatures applied to command line arguments work sometimes. AMSI-based (Anti-malware Scan Interface) detection performs significantly better. But obfuscation and evasion techniques like Invoke-Obfuscation can and do bypass both approaches. Revoke-Obfuscation is a framework that transforms evasion into a treacherous deceit. By applying a suite of unique statistical analysis techniques against PowerShell scripts and their structures, what was once a cloak of invisibility is now a spotlight. It works with .evtx files, command lines, scripts, ScriptBlock logs, Module logs, and is easy to extend. Approaches for evading these detection techniques will be discussed and demonstrated. Revoke-Obfuscation has been used in numerous Mandiant investigations to successfully identify obfuscated and non-obfuscated malicious PowerShell scripts and commands. It also detects all obfuscation techniques in Invoke-Obfuscation, including two new techniques being released with this presentation. Daniel Bohannon (Senior Incident Response Consultant at MANDIANT, A FireEye Company) Daniel Bohannon is a Senior Incident Response Consultant at MANDIANT with over seven years of operations and information security experience. His particular areas of expertise include enterprise-wide incident response investigations, host-based security monitoring, data aggregation and anomaly detection, and PowerShell-based attack research and detection techniques.
Show transcript [en]

the besides DC 2017 videos are brought to you by threat quotient introducing the industry's first threat intelligence platform designed to enable threat operations and management and data tribe a new kind of startup studio Co building the next generation of commercial cyber security analytics and big data product companies so good afternoon my name is Daniel Bohannon and today we'll be talking about Revo coffee station I'm kind of looking at some of the depths of powershell obfuscation and how we can use science to detect it a quick blurb about me again my name is Daniel I work for Mandy and I'm here in DC so it's nice to be at a local conference here the commute was awesome mmm I've been

with me in the it for about two and a half years started out doing IR consulting for almost two years and have recently switched to an applied security research position so I really focus heavily on office Gatien evasion and really applying detection at scale to all of our clients and customers at fire I'm India and there's my Twitter handle there I want to give a huge huge out to Li Holmes he's not here today but we both did this research together and Lee is just an awesome awesome guy he's the lead security architect for Azure over at Microsoft and he was actually one of the original PowerShell developers so it's pretty cool to be talking about

PowerShell and writing PowerShell tools with a guy who literally wrote the language it also is very humbling to us how little I know about PowerShell but fortunately he's a really humble guy so he he let me know gently the things I didn't know and I'm a better person for it so we're gonna talk about today it's kind of looking at a treatise on blue team Follies as a blue team er I've made all these mistakes and I hope to share them with you in such a way that you can take it and learn from it and then some of the cool new stuff that Lee and I spent several months on this year to try

to tackle this problem of detecting office gated PowerShell so like at a very high level when it comes to actually investigating attacks or PowerShell is used which PowerShell is used in almost every single investigation I've ever been a part of because it's native on Windows 7 and later attackers love it copy and paste it's easy there's tons of PowerShell from offensive PowerShell frameworks out there and so attackers are using it a lot also to be noted there's a ton of Blue Team PowerShell frameworks out there that defenders are starting to pick up on so a lot of really good things on both sides of the fence but when it comes to actually investigating attacks where PowerShell

is leveraged at the very least you want at least command-line argument information so you can get that from your security log forty six eighty eight or system on Eid one the the really juicy stuff you want those at the bottom which is power shells specific logging particularly PowerShell five and this is a blog post for a Microsoft and one at the bottom for one of my colleagues at fire I basically looking at kind of where the benefits you actually get with PowerShell five and how does it actually stack up against this kind of obfuscation that we'll be talking about so how am I an attacker break our assumptions when it comes to detecting PowerShell well let's say that we're

going this route of looking at PowerShell Dec see when it runs I'm looking at its arguments and looking for malicious things now obviously echoing out the word success is not malicious but just kind of showing in a really small example how you can get code to execute and maybe undermine some of our assumptions about it appearing in argument logs so in this case typically will see command or something else launching PowerShell and you have your command that follows you see it runs perfectly fine here however you don't have to actually specify the PowerShell command in its arguments if you just look at Power Cells help you can actually push stuff in a standard input so you can do something like this and

have command echo your command your PowerShell right host and then pipe it into PowerShell - or PowerShell ie X input and PowerShell runs it just fine now why might this matter well you can see the top power sells arguments only contain that - or that ie X input but the actual command itself if we're looking for right host success for ground green then that's actually in the parent process so we've just shifted arguments into the parent process so as a defender if all your rules are based on Power Cells arguments then this might get by you so maybe we start to look for any processes that have pipe followed by PowerShell well the problem there is

that depending on whatever process you're launching it from you can obfuscate in that process as language and kind of started stacking different languages of obfuscation so the example at the bottom I'm basically chopping up power and shell and the two command variables which are a process level environment variables and then echoing it into p1 and p2 such that you can see the parent process has pipe p1p2 and not pipe powershell so that rule wouldn't really work if we're looking for piping than powershell and we actually see attackers is doing this in the wild this is from mid-february this year this is from financial threat actor known as fin eight where they're basically at the very bottom you can see

they had this PowerShell command they were putting in one environment variable called Microsoft Update Catalog the second environment variable shell - and then they were echoing var1 and devar - which looks really really familiar to what we were just talking about and they were launching all this from a windward are from from a word macro so you actually didn't see a lot of this on the command lines which was really really interesting another way that we can basically kind of move pieces of information around and keep it off the command lines is through just straight-up process level environment variables cope doors are really popular piece of malware that does this typically launched from office gated

MSHDA and run keys and stuff like that but basically you'll see PowerShell invoking with IE X the environment variables some you know random name var environment variable now typically we'll see dollar sign EMV colon to reference environment variables but there's tons of ways that the bottom is just some of the some of the few ways that are really uncommon but that you can actually extract environment variable values out another way that we could do this is we could basically pipe our arguments into clip and then just pull them off the clipboard so maybe we're thinking why don't we just look for these arguments in PowerShell about XE than in its parent process so the problem there is

what if we introduced a third process in the mix so instead of command spawning PowerShell what if we said command setting the the arguments into environment variables and then a second command echoing does environment variables into PowerShell and then power shell launches it well so technically this works but we don't get any we don't get any benefit out of this because as we can see we still see the full command in the parent process however if we take notice of this pipe if we escape it one layer and this is this isn't PowerShell we're doing command level escaping so when I escaped it for the first command but not for the second one and if we do

that then we're golden PowerShell runs with nothing and it's arguments about what happened and the parent process only has the name of the process level environment variable and not the contents itself so we'd actually have to go all the way up to the grandparent process to get this and this is kind of a tree structure they're a little outline of what this looks like so maybe we can say okay well why don't we just take any process that any powershell process and recursively go up the stack all the way to see all the arguments to see if any of them contain these malicious arguments well the only problem there is that there's tons of ways where you can pass information back

and forth with no actual relationship between the processes themselves or no recursive relationship up and this is one example of basically spawning command where pieces of code are stored in the title of each command window and then WMI is used to query them out and reassemble them back together and then invoke them clipboard is another example you could use user level environment variables you could pass information back through a file or registry there's tons of ways you can pass information back and forth such that it won't show up on the command line as you'd expect so the good news in all of this is that PowerShell script block logging catches all this no matter how it's run no

matter how it's passed in PowerShell 5 is just insanely awesome in terms of the preventative and detective security controls that are in place for a scripting language it's is stupid awesome the bad news is is that there's token layer off the station that persists even in of these script block logs and that's one notice as we go this PowerShell shield in the top right each office keishon techniques we talk about when that shield shows up that's saying hey no matter what else may have missed it this is going to show up in PowerShell logs now it may look slightly differently than you'd think and I'll try to basically touch on those cases but it's really really awesome to have

that kind of data available so let's look at an example of office gating a PowerShell command in some really crazy ways so we're gonna start with the cradle or the remote download cradle which is basically this one-liner command which is in memory execution of some remotely hosted PowerShell script attackers use this all the time because frameworks use this all the time and we all like to use frameworks and copy and paste code from one another so veil power sploit Metasploit you name it we see this all the time so let's kind of plate red team blue team which is kind of how I define my life basically basically breaking my assumptions as a defender and then how

can I bolster those assumptions so at the top we have an attacker command which is this remote download cradle with this bitly link which is totally legit and then let's say as a defender if we want to catch this on the command line how could we do that well we can start to build out a list of terms of the where if we see invoke expression new object system net-net web client and then download string HTTP this would catch this exact command so now let's go through and see how we can obfuscate the command at the top and adjust our defender terms at the bottom to kind of you know stay head to head with all the

attackers doing so first of all system dot is not necessary almost anywhere you see it in PowerShell PowerShell is gonna automatically pre pin that underneath the hood so an attacker doesn't have to put system dot so if an attacker doesn't absolutely have to have this in their command then I don't want to make that assumption as a defender so I'm gonna pull it out of both the URL in this case is just a string and it's a string token we can do things like concatenate it in line we're also not really limited to double quotes we can use single quotes whitespace push it around stuff like that we can also set it as a variable so

let's go ahead and move that HTTP from the download string portion download string now this is the most common method that we see so download string lets it take a step back new object net that web client is creating a dotnet class for the net-net web client class which has many methods download string is just one of them but there's quite a few downloads string download file and download data are some of the most common ones that we see download file will hit disk so it's more of using commodity but download string and download data data will be a byte array as opposed to an expression itself but maybe we can just say okay why don't we just say that download and

we'll catch all of these so let's do that so now for this dot download how might this little parentheses bite us if we have this as part of our detection well you can actually take a new object net that web client and you can set it in a variable and then have variable dot download string typically you'll see some frameworks you know calling that variable WCC have variable name dot download strength there's no parenthesis so let's go ahead and remove that from the arguments at the bottom now that dot how could what could an attacker do to make this dot really problematic for us as a defenders well they could put single quotes around download string

doesn't make it a string but it it still breaks that dot but then we can put double quotes around it and what I found is that if you have double quotes if you look at download string you can actually do is then put a tick mark in download string and it still runs crazy right now why does this work well the tick mark is the escape character for powershell and it has meaning behind these escapable characters you're from know you know beef or backspace' new line all this kind of stuff but if you escape something that has no escapable meaning it just runs and this is not this is not just true of powershell if you do the same with

command with a carrot if you escape something that doesn't have any meaning to be escaped it just doesn't do anything and goes ahead and keeps on running so you can put a lot of tech characters and a lot of tick marks there as long as you stay away from those special eight characters that have the special escapable meaning but if you're like me you really want to put tick marks in front of those characters so all you have to do then it just uppercase them and then you're good to go so now any method we can put double quotes around it put ticks in front of any character that we want as long as it's not a zero because unfortunately

there's no uppercase to zero so you know how to lock on that one but the problematic thing from an offensive perspective is that this persists on command line arguments as well as all the way into the script lock logs which is PowerShell is the latest and greatest 4104 eid really really awesome really sexy logs but these tick marks are still there and if you'll take a closer look you'll see the concatenation is still there so at the end of the day script lock logging will basically log every kind of layer of the onion if you have this command encrypted and encoded like 100 times every layer gets its own script lock logging even if you're

running something remotely even though you'll never see that in the command line arguments it'll all show up in PowerShell script block logging but this token layer office keishon will still persist so how do we deal with this as defenders well we're not done yet we could try to regex all the things right and basically take into account any tick marks if we go that route I would recommend also adding open read because we looked at download string file data open reads another one that returns a byte stream instead of a byte array and we've seen attackers use it a couple times but it's not too common so if you want to go that route make sure you have

dot download and open read but I would advise against it because if you put parentheses around any method then you can actually treat it as a full-blown spring so you know as string so you can basically concatenate or reorder it do a ton of crazy stuff and in this case you'll also see there's this dot invoke after that method that's only required in PowerShell to PowerShell 3 and later is smart enough where you don't need that dot invoke so a lot of the obfuscation frameworks that I've worked on I make sure to include that dot invoke by default so it works on any version of PowerShell but as a defender I want to make sure that we're not

focusing on that dot invivo for for detection because it's not necessary in later versions so let's just remove download string from our defensive kind of list alright net at webb plant for this we can put double quotes and ticks we can put print season again treat it like a string concatenate it reorder it chunk it up into small pieces whatever we want so let us go to the first option and say yeah let's just go ahead and get rid of net dot web client from defensive perspective and keep going now new object so we're left with the two pieces were left with our invoke expression and new object and these are both commandlets now PowerShell is super super friendly

and inviting to people who have never written PowerShell before but it has tons of aliases which basically means PowerShell has a commandment called get child item and if you want to list everything in a directory you just say get child item on it but if you have no clue what PowerShell is then you might just type Durer and it will also work because that's an alias for get child item if you come from the linux world you can just type LS and that works because they have all these aliases that translate into the thing you actually mean and there's shortened forms of that and so power Shelby's you a lot of options that's great from usability

perspective but it's terrifying from a defensive perspective because now we have to find all these different options and make sure we have coverage for all of them the good news here is that new object doesn't have any of those there's no there's no alias systems there's shortened forms of it so I was really excited when I came across this and Roz this is gonna be a really solid detection the only problem is that PowerShell is too helpful sometimes so there's so many functions in command what sometimes you kind of forget you forget which one you that you're looking for and so you can basically use get command and say get command show me all the commands or functions to start with

a new dash P and just put a wild card at the end it'll list all this information now the cool part here is that this is not text that's being returned as PowerShell objects now why does that matter well we can basically take the result of a single if we basically have it return just a single command let we can then pass it in to invoke expression which will automatically convert that object into the string of new object and then invoke it but we can actually be more creative since we're dealing with objects instead of invoke expression we can use the dot or an ampersand which are invocation operators now here's our list it's really fun let's move those wild cards

as long as we're returning a single object this works and this is new object so instead of specifying a new object on the command line we could actually replace that T with a wild card or keep replacing a lot of characters and as long as we have a query that will return only a new object that's now new object and that doesn't appear you will not see new object appearing on the command line arguments or in script block logs the one place you will see it though is in the 4103 Eid which is module logs so basically think about any kind of huge screen module logs are the ones that I've not found really good

ways to obfuscate it but the problem is is that if you have a huge script you would have hundreds or even thousands of module logs for just a single execution of a script because it's looking at every every command line every parameter binding all this stuff so really really helpful really robust but a lot of logs so how do we actually meaningfully piece that back together to the tech this kind of stuff we're not done here yet command also has an alias of GCM and there's actually an undocumented alien called command because PowerShell will automatically basically take get - and prepend it to anything you do to say I didn't find when you type the word

command I didn't find a function I didn't find a commandment maybe you may get - command oh you did that's what we're gonna do so again rewind every detection that we've ever built as a defender for powershell commandlets that included get - we totally don't need to get - so we want to make sure that we don't define that in our detection otherwise we can get burned on this another thing we can do if we don't want to use wildcards we can always just say basically take our command let name and chop it up in the sub strings whatever we want and then specify it there with get command these are several different powershell one

gato syntaxes for doing this exact same thing if you ever look at posh code org it's one of the first powershell gallery like distribution kind of sharing platforms and it's been around for a while which is awesome because there's a lot of PowerShell one dunno syntax which still works today but no one really is using it because it's really verbose like this well if you're an attacker it'd be great if you could find some things that no one is familiar with and no one's looking for because this might do exactly what you need but no defenders looking for us so if you notice this execution context that's an automatic variable within PowerShell and we're going to see this coming up again

and again so you want to make sure you're looking for that for all this stuff with git command we can do the exact same thing with git alias but instead of dealing with a full Commandment name we're dealing with the alias name itself so we have get alias it has an alias of gal and then then get aliases alias is called alias because of the git - beforehand makes sense all good all right let's keep going so never we add all these things let me say ok I'm going to look for new object I'm gonna forget command command TCM get alias all this kind of stuff and at the end of the day it's not

really gonna be feasible to find all this because whenever you can get any piece of a command into a string then you're golden you can manipulate it in all kinds of crazy ways but if we did want to go down this route we have to keep in mind you put tick marks in front of all these things you can also use the you can just instead of using get command you can use those invocation operators again and to start to treat the Commandant name as a string and so on the Left we have concatenation on the right we have reordering and then reordering is really really awesome because some people will say okay well concatenation I don't care about because

whenever I look at logs I remove all the whitespace and special characters such that new object will come back together and then I evaluate at that point well new things like using the dash element operators start to reorder commands new object will never come back together no matter how many special characters and whitespace you remove because it's absolutely reordered and this is where it gets really really challenging so let's just give up on this be a real list here so we're left with invoke expression now as a defender if you don't know where to start with PowerShell and you have command line logs look for IEX and invoke expression pays crazy dividends if you're it's like

the best place to start just start there so what's problematic about invoke expression well you have the alias of IX the ordering doesn't matter you can have at the beginning at the end of an expression you also can put tick marks in it because it's a command lot like anything else in PowerShell you can use invocation operators and start concatenate or reorder ie X and invoke it with the Dobbs the ampersand invocation operators then let's step back and say okay well what why why all these ways to execute code in PowerShell this seems really problematic right and what we'll talk about in a second is oh-lee and I basically assemble this huge PowerShell corpus of scripts to

kind of study what's normal what's not in terms of finding evil which we'll talk about in just a couple slides but basically three percent of scripts in the wild use invoke expression in the script itself which sounds like a small percentage but it's actually a lot of scripts out there so you can't even feasibly say I don't want to let invoke expression run in the script context but again on the command line it's pretty uncommon but then you have its cousin invoke command now invoke command is typically used to run a command on a remote system and it's it's expecting a script block whereas a invoke expressions is expecting an expression but if you don't specify a remote system

it runs locally so you have invoked command ICM that invoke or invocation operators you also have invoke returns is invoke with context tons of things out there so okay let's say we're adding this to the list of like invocation options right well we also have powershell wonder whether there's that execution context again we have invoke script but really what I'm worried about is how do we deal how do we deal with this ampersand in this dot how could we feasibly look for this on the command line without getting insane false positives well in this case we're dealing with script box and maybe we say if I see a combination of an ampersand or dot and I see the left and

right braces which is what you have to have to denote script locks then maybe we're good uh it would help with false positives but at the end of the day you don't have to have curly braces to define script locks you can actually convert expressions to script locks using these are two options here using the script lock dot in that class of the create method or PowerShell one that has some text again and then for each of these you can apply all the office keishon we just talked about and get crazy results there so there's still no silver bullet at this point kind of sucks invoke cradle crafter is a tool that I released earlier this year and it

has over these are all the memory invocation options but has over 10 different invocation options that no random we go and generate including some ones we looked at and then some other crazier ones so if you want an idea of some of the crazy things you can do to basically form invoke expression or IEX to invoke PowerShell code then this is a tool that you can use just to randomly generate that kind of syntax all right take a breath that was a lot i've been joining up here for a while the good news is is that this is really the extent of how you can off you skate powershell the bad news is that I'm

totally kidding there's a lot more you can do so we have our commands we just did all that write all this crazy tick marks and wildcards and all this stuff well what if we just said let me treat the entire command as a single string and then invoke it that means I've got the entire command as a string so I can do crazy string manipulation on the whole thing like maybe we'll just reverse the entire command and on the command line it's in Reverse but then the memory it riri verses it and we invoke it you can also put garbage delimiters just all throughout the command and use the split command to basically pull out those delimiter so

you can use replace operations to replace the delimiter zuv nothing you can then just do sheriff concatenation to basically chop up and reorder your command into a lot of pieces so these are some options that again are gonna evade the command line stuff but it's going to be reassembled for for everything in script block log so script block logs you're gonna have it all put back together which is really really nice so would it suck if someone wrote a tool it did all the stuff automatically as defenders we probably hate the person who did that right so unfortunately this is what I got myself into last year and our releases tool called invoke obfuscation that basically took all this

research that I did and automated the office Gatien of PowerShell code any arbitrary command or script you feed in it'll basically give you the whole slew of options of what you want to office gate so in this case we take the exact same one-liner download cradle that we just walk through then at a push of a button we can have it for do something like this which is going through and reordering all the components and invoking it with the invocation operators and stuff like that but then we can take that and say let me treat the whole thing as a string and do something crazier like this and just reorder the whole command as a string

now I put this flag up here because a couple months ago we saw a PT 32 which is a Vietnamese threat group using using my tool to do their nefarious activity and they actually wrapped their commands in like a lot of layers like at least five or six layers of this like string manipulation stuff so I get a lot of dirty looks in the office when those samples come in and they just kind of slide that over my way for me to go and decode so people say you should eat your own dog food I do a lot and the dog food doesn't taste that good anymore so hold the tomatoes then earlier this year a

released in vote Cradle crafter which is not taking any arbitrary code in it's basically saying feed me or remote URI and any post kind of cradle commands you want to run and then let me show you some really weird cradle syntaxes you can use that maybe the defenders aren't looking for and that hopefully defenders will use to start looking for them and so anyways we take that same URI if it legit and we can produce something like this looks totally different than in BO coffee station there's no tick marks it's basically doing like method and member and command let enumeration to basically produce strings like download string out of this like crazy just long and nasty nasty command so then you do

something like this and just say well who needs alphanumeric so I'll just use special characters now I got to say the the guy who came from with this concept is a Japanese researcher who in 2010 wrote hello world and PowerShell with nothing but special characters the twitter handle is muda muda Gucci and it's just a fascinating blog post google translate it if you don't speak Japanese because I don't but basically took three steps into taking special characters and coming up with any arbitrary PowerShell code it's a really beautiful idea and this produces not beautiful code but at the end of the day it's a ton of variables that are all stored in dollar sign curly braces

because when you put curly braces you can put special characters as variable names so you could also put whitespace as the character names and get something like this and I was talking with Casey Smith or subti early this year and he said oh that reminds me of whitespace encoding and I said I don't know what that is and you got to tell me about it and so he said oh yeah you just take whitespace and tabs and basically do that for the encoding and stuff like oh sweet and so I wrote this and so this is that same command where the whole command is either in white spaces delimited by tabs or tabs delimited by

white spaces do you have the stub decoder at the end now for both of these options again it is completely reassembled in script lock logs you don't need anything fancy to detect this in script lock logs because it all comes back together magically it's a really beautiful example of the power of script block logs but let's say you don't have to get block logs then maybe you're looking at me and saying wow I really feel like there's no hope here what could we possibly do so this is kind of the turning of the tides of this talk hopefully hopefully all the depressing stuff is out and like let's talk about some good news for defenders right so

let's try to do a little science now I'm not a data scientist and nor do I try to pretend to be one but there's some simple statistical analysis techniques that we can perhaps apply to this problem area and get some good results out of so as a human if we look at this code no one in there no one here would look at this code and say yep let's let that right on through my environment like totally okay to run that as humans we can look at this and say this code is not well so how might we tell a computer to figure that out so one of the first things last year almost exactly a year

ago it was a couple weeks after our released invocation at Derby con Lee Holmes wrote a blog post called detecting more office gated PowerShell and he actually referenced this these examples in the blog post and he did something really really cool which actually started this this entire conversation with me and Lee Holmes about what other ideas could we apply to detect this kind of stuff and led to this collaboration which is kind of funny looking back on so anyways he took all the powershell scripts off of posh code org which is 3400 scripts at the time and he basically did character frequency analysis on all of them and so on the right is the average of all the

posh code scripts which as you'd expect powershell the language itself is English and so you'd expect the English characters to kind of match like English you know like grammar instruction or like on stuff so we see ETA this is the most common that makes up you know 20% of all the PowerShell scripts the first special character we see is the dollar sign at 3% on average the 3% of any powershell script on this site is a dollar sign now you look on the left there it's insane 20% is is tick marks 8% from all this stuff and then over here I always say it's all special characters it looks really really different so again we can look at that

and say okay yeah these are really different but how do we tell computer to do that so this is where we get a little nerdy here for a second talk about cosine similarity now I'm not this math whiz person so if you're not please don't check out when should I make this fun and really exciting but basically the concept is if you have two points two coordinates to make a line right you have line a and you have line B and they have angles right and so cosine similarly basically saying what these lines to be very similar because the angles between them are zero but if you have 90 degrees that's like dead different right so we can actually

extend this idea beyond two coordinate lines to three coordinate lines four coordinate lines maybe 26 coordinate lines for all the Alpha characters so that's that's what Lee did in this blog post he basically said okay why don't I apply this cosine similarity to technique and then apply that to all the scripts in posh code and what you'll see is that almost all the scripts are really really close to the value of one so I 95 percent symbol or 97 percent similar but then we have our two samples at the bottom which are 37% similar and 15% similar now if we took all the codes on posh code and do the scatter chart we'll see most of them are right up

there at one really really similar to each other but then you have all these really randos just falling out at the bottom and so we can do is take 3400 scripts and immediately just say let's look at everything below 80% and that's like 30 scripts so this is great on on just you know 3400 scripts but as Lee and I start through this research we said we need a lot of data we need as much data as we possibly get our hands on so a year and a half ago Microsoft put out this contest called underhanded PowerShell basically inviting people from the community to basically submit obtuse and often skated PowerShell to evade certain checks so that was a little bit of data

but then we we really wanted a huge corpus of data so we start with underhand PowerShell data github posh got all this stuff but the important thing to keep in mind for us is that Lee and I felt it was really important to highlight that you can actually gather all this data and be really polite in the process and what do I mean by that well Lee wrote the script to basically scrape github and all the blue portions are the actual pieces of code that are scraped and github and all the red portions are a hundred percent Canadian because Lee Holmes is Canadian and he's the nicest person you've ever met and it's just like throttling and all this

great stuff so he basically was he was basically scraping for like a month he looked at the crater to get hub ID and said I there's 10 million IDs this should take us a month or something like that and so anyway it's getting close to a month his uh his scripts are still running and he sees 10 million ID 11 million 12 million he's like wait a second and went back and double-checked it was a hundred million in github he's like oh my gosh the clocks a-tickin like how are we gonna get this data and so apparently the most polite thing you can do is just ask the great folks at github hey would you mind telling us what all

the PowerShell projects are and zipping those up for us and a lunch break they went and did just that so kudos to github for getting us that data this is every single contributor in github that has ever submitted a PowerShell script as of a couple months ago I just want to ask is anyone in here ever submitted a PowerShell script to github or TechNet or any these places can you just raise your hand real quick awesome thank you very much because like you actually are part of this research your name is on the slide your name is buried in the code in some interesting places but this wouldn't be possible without all the open-source PowerShell scripts that are

out there so seriously thank you very much I don't read these right now if we have time at the end we'll go through that so we got a lot of scripts right now when we have a lot of scripts we have to do something with it cuz it'd be irresponsible not to right so now if you start looking through all these scripts you're gonna find some weird shenanigans out there and one of the ones I'd like to highlight was just a really sad one remove games that ps1 someone wrote this to go through and kill games running process even nuke the directory noise sad major buzzkill but in all seriousness we got a lot of scripts how

many four hundred and eight thousand scripts over 28,000 authors and the fun thing that we did people like to think that hacking is like really cool mr. robot yeah it's really fun like wear a hoodie at the end of the day what we had to do was we had to manually label 7,000 scripts which meant that on the weekends you know sit in front of the computer and notepad would pop up and you scroll through the script and be like yep that one looks pretty office gated let's mark those off you skated next please and go through and do that 7,000 times that's not fun now why in the world would we do that one reason they said that someone

else wouldn't have to so they could just use the label and we did but if you think back here on this approach of character frequency if we looked at everything under 80% then we could say okay this was pretty good most of what it was under 80% was office gated right what we don't know is what about are there any office gated scripts above that 80% mark and short of looking at all 3400 options when it comes to actually trying to think differently about this and applying some different techniques we need to actually have labeled data so that we know at the end of the day did we detect all the things that we labeled as bad I'm going to keep

you on Ryan Cobb put out a blog post basically taking taking some of these ideas and applying it and basically showing that as you as you start to creep up and look at higher and higher percentages you're gonna deal with more false positives but that's where you're gonna find some more of the wins some more of the obviously hated scripts that are more similar than not so how does this character frequency analysis scored the two things to look at here are precision and recall and it's 89 percent and 37 percent so precision means let's say we have a hundred thousand PowerShell scripts right and 10 percent 10 thousand of those are bad their office gated now if we ran if we run

this tool and it says ok these are the thousand scripts out of the whole hundred thousand that are office gated let's say every single one at return it was correct its precision would be a hundred percent so basically every time this model says it's bad it's bad but it missed 90% of them out there that would be a really low recall score it'd be a ten percent recall score so in this case ninety percent precision is pretty good but 37 percent recall means it barely got over a third of the office SCADA scripts out there so we need something better so surely we can do better so how might we do better well PowerShell there's a lot more

features in the data science community information retrieval community they would say these characters counts and frequencies are features so PowerShell when it looks at itself it basically sees I know that this is this command line I'm calling a generic I know that these are the left parenthesis right parenthesis and this is an object all this kind of stuff but in PowerShell 3 and later what became exposed to anyone is what's called an AST or the abstract syntax tree which basically looks like this so not only does PowerShell is able to tokenize each piece of the command but it actually knows the relationship of those commands now why is this matter for us this allows us had to get insane

features and really focus in because you could say ok character frequency makes sense you can have really obfuscated stuff of always tic marks and have your evil code and then you can literally just copy and paste the entire work of Shakespeare and then that's going to really throw off your overall you know character frequency analysis because it's going to look like normal English right well that's the whole point why don't we drill down and start to piece and start to gather features from each kind of command letour each kind of token so that we have to force someone to basically put in bad data in every single place that we're looking there's a great tool there called a ast Explorer

which is just a little GUI or you can basically paste in any kind of code you want and explore kind of how the ast structure is so with this we were able to extract things like distribution of ast types so again out of the whole script let's say 90% of this entire script is nothing but commandlets or strings or arrays or stuff like that so we did just a straight-up distribution of those types assignment binary invocation operators we did an array size range again a lot of arrays aren't that big but if you have rays of like thousands of characters maybe that shell code that's kind of odd right and then for every kind of AST type again

commandlets string method all the stuff function comments we basically did character frequency on each of those individually as well as the entire script itself so we did character frequency we also did things like entropy we did the length maximum medium mode range average white space density character casing again randomized casing and a lot of frameworks if you see 50 50 uppercase lowercase that's randomized casing because typically it's gonna be like 12 or something all these different things and we basically push them all together so that for any input PowerShell command or script we get almost 5000 features there's a lot more than just the alpha numerics right and so we were really excited about this

spend a lot of time just getting as many features and possibly could never once did we say this feature is bad or this feature isn't we just said I just want the data blue neurons crap what we're gonna do with 5,000 features all right like how do we actually you know make meaningful use of this data so basically we have here is that of these 5000 features some of these features matter a lot more than others and instead of us going through and figuring out which one matters more than the other we used a little thing called logistic regression so basically you have the coloring is a little bit unfortunate but basically you have a logit function to say like again

if you have like thousands of Commandments in in in one script but only like ten comments like you can't deal with the range like that so we use something called a logit function to basically bring all the values between 0 and 1 and kind of smush it down on this plot and then we use linear regression so what does that look like basically we go through and we take all 5,000 features and we have a weight that weight is basically saying feature 1 is this important or feature 2 is this not important or feature 558 is this important and so for every single input script or command we extract all 5,000 features and then for each feature we

multiply it by its weight to say how important are you we then add up all those results and then if it's over a certain point then it's octave skated if it's below a certain point it's not obfuscated so now again it comes back to well how do we decide the values of the weights how important is one feature over the other we let the model tell us that I want to mean by that is we take that labeled data you know 7,000 scripts and we basically through thousands of iterations send it through and say ok pull all the features and we start with all the weights being exactly the same everything on a level playing field and

it goes through and it pulls out all the features does although multiplications and it says are you over one or are you less than one are you off you skated or not it makes that decision and then it says okay so that this is labeled data I happen to know whether I was right or not now whenever it was right it's all good but whenever it was wrong it looks and says okay what weights contributed the most to me being wrong about this and then adjust those the most either up or down and then the weights that were released it adjusts those the least and so it keeps kind of tuning and tweaking each thing through thousands and

thousands of thousands of iterations and as you can kind of see in this concentric circle what kinda gets closer and closer to the truth as it moves all these weights up and down so how does this actually work at the end of the day we'll gain remember our cosine similarity we had 89% precision and recall is 37% now for anyone who's not a blue team err in the house this next slide is our equivalent of getting da so we're gonna get really excited about this so just let you in on this our model performed at 96 percent precision in 94 percent recall which we were really excited about and false positives were 0.01 so this model is literally ten

times better at finding off you ciated code with half the false positives so we're really really excited about this now you may step back and say okay what about what about sketchy stuff right like occupation there's kind of a spectrum right like something to be really really off the skated like the thing on the left whether it's just all special characters but then there can be some other things where it's like I don't think that's bad but can't want a second look so we actually went back and kind of labeled things as like sketchy and if you think about from this way it's like if you had the world's best in turn this in turn would be the one to go

around and say hey I don't think this is bad but I'd appreciate it if you took a second look that's kind of the approach that we took and so if we start labeling sketchy stuff and we'll actually see here is that overall it performed at 88 and 89 percent so basically it was wrong a lot but it actually captured 97 percent of the evil that was out there as opposed to the 94 so what that means is that if you want like the super high confidence of the stuff that was labeled is really really obfuscated then you're gonna get that 94 percent 95 percent and then the tool that's basically the default it goes for high confidence

first but then if you want to cast a wider net and say yeah I'm okay to get some sketchy stuff as well I'm willing to weed through some false positives and find some more stuff that maybe I'm missing in the other model then you can kind of do that wider net approach so what about other algorithms so again not being a data science person with this labeled data set you can take this and plug it into any kind of like machine learning studio that you want so Lee Hom spent a little bit of time doing this and testing out tons of other algorithms to say like did our approach actually fare well against all the others and it

actually did it was ended up being almost tied with one of the other ones but just a little bit better and so again you don't have to have some fancy studio like this to do what we're doing we actually did all this by hand and it's all out there open source on github for anyone to be able to take that same data and do cool stuff with it so a quick demo of the tool so there's three parts of the tool this first one literally has nothing to do with finding evil it's just all about ASCII art because he doesn't like ASCII art so it's colored ASCII art it's just doing like a quick little demo of like here's

the command here's some of the kind of extractions of feature extractions that we do over 5,000 features less than 300 milliseconds for much powershell scripts blah blah blah and here's the interactive menu here so basically the things to keep in mind are you have some options here one is tutorial which is a glorified colored readme well she doesn't like colored read Me's right so if you don't really read me on github you just do tutorial fun facts again we spent a ton of time going through a ton of data like 4 gigs uncompressed PowerShell scripts like it is a lot of data and there's a lot of fun things that you'll learn about that data and so

there's some fun facts in there some random phone backs ASCII art and you go through that much data there's some really creative people out there who are way more artistic than me so we just went and kind of you know wrote some URLs to go through and find an ASCII art stuff I spent an entire weekend doing nothing like like on my exercise bike at my gym like going through ASCII art like I feels like you're such a nerd so anyways there's random samplings I asked you Arthur you can check out you can also do a fun set of quotes and then also credit so again every single of github contributor for any PowerShell project as of a couple months ago your

name is in here and I'll go and randomly come display hey thanks to these 10 contributors and there's also a sub list in there of like our favorite like to contributor names because there's some really creative contributor names so some fun stuff in there so now if you actually want to find evil just throw that out the window and then get to the really good stuff which there's two components here one is reassembling scripts out of script block logs somewhere I'll talk about how PowerShell 5 is really awesome and you have these CID 4104 vents well they're really verbose right and so if you enabled it some environments people are saying well we can't possibly index all this data

into our sim and it's like that's understandable it's a lot of data so how can we meaning meaningful use of it today well if you enable that logging you can either acquire the raw EVT X files or if you get one event we include a lot of ingestion options so basically we'll take the full event log and pipe it into this function and it'll basically go and reassemble the script out of the script block log says if you have a really long script then it can be spread across hundreds of script block logs and this will reassemble that for you and it turns it all in PowerShell objects which is nice so you have PowerShell scripts now either command

line from whatever sources you have maybe you have a directory of scripts you've pulled out you know off the wire other things or you go out and you query your event logs and you reassemble all the scripts and now you have it in memory so the last piece is pushing it through the magic which is measure rvo obfuscation and here we are chewing through 39 scripts right here in labeling it as off the skated or not obfuscated and again this is basically taking all the scripts its extracting 5000 features multiplying multiplying the 5000 against the weights adding them up saying are you office gated or not and then coming back with the full results of the script saying true or

false is this off you skated it gives the full time of how much it took to extract the features measure the features all this stuff and then also all 5,000 features are listed in that object for every single core soul so you have everything at your disposal do you want to dump at the disk or do whatever else you want to start finding off the skated code immediately now keep in mind this does not find bad code for say this is office caddy code there is still a very really important role for looking for indicators of compromise in whatever frameworks that you're seeing out there so I think it's important to add this as a supplement so we're looking for these

known bad syntaxes or module logs or other things like that and then for everything else that we're possibly missing just run this against it and see what it comes up with

this is the very last thing I'll say is that Lee and I we really wanted this tool to be used by anyone we want anyone to after the whole really discouraging first part of this talk from defensive perspective they'll say yes like there's a tool we can actually start using today and to be able to go back to your organization's and say to your manager hey we really need to get the powershell 5 and they say why and say well there's crazy off you station you can do write and say ok what would we do with all that data we can't index it we don't have to like if you have a syslog server just anything you can just dump this

data to and then point this tool to it then you can actually start detecting this for zero dollars out of your pocket and with that being said we want this to be as user friendly as possible from an operational perspective not just for the research community or anything like that because we are operational security practitioners so that means is that if you're using this in your organization and as stuff comes back and it gets flagged it's off to skate it you can say ok this actually this isn't obfuscated this is good I never want to see this again and you can basically whitelist that and so we have a couple options you can basically drop scripts into a

whitelist directory it'll hash those and then never alert on this exact hashes again you can also write less based on strings or regex matching which you want to be a little careful with so you're not too open with it but again we wanted it to be really user friendly but kind of operationalize this in your environment so with that being said you can get the code it's on my github you can also it's also hosted in the PowerShell gallery which means on any power shell prompt you can just type install - module revoke - office keishon and have it up and running within 30 seconds finding office gated PowerShell and in the tutorial if you if you follow

on github it will take you to tutorial I actually included some EVT X Files and some other files on there so actually all the sample commands in this tutorial are actually running against like real data that we produced and it's not it's not actually malicious but actual office gated data so you can actually see right away how it works and kind of the speed at which it works in conjunction with this research Lee and I at at black hat and DEF CON released this white paper which is hosted on the fire I've logged there which is everything I just said but without having to hear me say it it's all written there's less jokes let's be honest so so you can take your

pick there and then a couple references like into some of the previous tools of power shut off the station some of the vlogs about detecting it and then again the last last one here PowerShell hearts the blue team by Microsoft is you've not heard that blog it's a blog post in white paper that is just a phenomenal overview of all the insanely awesome preventive and detection based the capabilities that Microsoft has put in the PowerShell over the past several years and it's just really really cool so that being said that is my in the end of my talk so thank you very very much for listening

so I think we have five minutes or so for questions so yes so if we say of all the things that flag is office skated from our corpus and what was almost yep Oh most of it actually wasn't a lot of the stuff was off he skated was like code golf competitions and stuff like that there's a lot of there's a lot of the same stuff like if you just like search through the corpus which the corpus is available also we we made that public if you just search for things like invoke mimikatz just the way that github starts to fork repos and there's like everyone has their own copy of it and so some stuff was off you skated

that way we did see some PowerShell ISE steroids which is probably one of the original PowerShell office caters which they weren't that they were doing it more for like IP protection but we do see some of that data but but in in the end this is looking at data that people have uploaded to public sources so not we didn't find anything it was like groundbreaking this is really really bad but I've definitely used it on several engagements at man dia and it's been really cool to see it find some stuff that that we had some of the stuff we'd found through other means but it was cool to see this model at work against just like massive datasets so really

good question yes

character talent by just attaching like Shakespeare to it hmm so what happens if you attach a sufficient amount of well-structured well you know PowerShell code and you have a small amount inside you know yeah something yeah yeah really good question so the question was in in the character frequency analysis if you have evil code and have all Shakespeare's works and could you could actually kind of counter that and it would all holistically look good well could you basically take with this model take a really small piece of evil off you skated code and have just massive amounts of full code that's not evil yeah that would be that would be about the best approach to breaking it

other than the other concept is why don't I minimally obfuscate it and you don't need a thousand tick marks like if you know exactly what defenders are keying off of you don't have to put tick marks or obviously ate those small pieces a really good project that was just released two weeks ago at Derby con by Ryan Cobb he's the guy who basically put together obfuscated Empire and he really still called PS am Z which uses a MSI or the anti-malware scan interface to basically identify AV vendors that are using a MSI to inspect PowerShell and actually see what portions of the script of keying off of so that you can basically drill down and find of this

huge script here are the five pieces that are part of the signature that this AV vendor has and those just off you skate those and he actually uses invoke obfuscation to do just that is there it's a really really cool idea super smart dude so definitely a talk worth checking out in terms of like evading the detection now at the end of the day you might think well that's kind of stupid for you to say like here's exactly how to evade this thing well I have some amount of humility and realize that like this is only so much and we're really impressed by the results but at the end of the day all it means is that

as we find in the wild or come up with our own examples that bypass this literally the only thing we have to do is just label those is off you skated or not add it into the back into the data set and literally hit a button to retrain the model go get a cup of coffee in five minutes there's a new weighted vector that we can plug in so yeah yeah but but the exciting thing is that anyone in the community can do that so it is people find different models or different vectors that make more sense like hopefully they show that back out with the community say hey this actually works better than what these guys

with so that's our hope that real data scientists will go and do some cool stuff with our people who aren't data scientists and started to take this data and play around with it and not scrape github for like a month and a half so but really good question yes it was definitely my code yeah yeah kind of a PG 32 using off you see a PowerShell it's it was pretty pretty evident to me pretty quickly that it was my code so it's kind of a that was the first time I'd seen an apt group using my tool which is kind of a happy sad moment but I don't know unfortunately there's no way to give this stuff out to only the

good guys but uh but yeah that I think the most responsible thing is sharing as openly as we can of the community and saying like be aware this is a thing how do we come together to detect it and this was really my best attempt at an answer to say like open source what can people take how can people take hundreds of hours of mine lee's research built other people's research and use it immediately to help better protect themselves but be a really good question any other questions awesome thank you again very very much [Applause]