Paul Melson - How To Write Good YARA Rules

Name: Paul Melson - How To Write Good YARA Rules
Uploaded: 2022-10-04
Duration: 59 min 6 s
Description: If you have ever used grep or Ctrl+F to search a file, you can write a YARA rule. But can you write a good YARA rule? YARA is VirusTotal's open source detection rule engine and language, and it is increasingly being leveraged by other security tools and teams because of its usefulness. This presenta

BSides Augusta · 202259:061.4K viewsPublished 2022-10Watch on YouTube ↗

Speakers

Paul Melson

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

7-Zip CyberChef YARA

Service

VirusTotal

About this talk

If you have ever used grep or Ctrl+F to search a file, you can write a YARA rule. But can you write a good YARA rule? YARA is VirusTotal's open source detection rule engine and language, and it is increasingly being leveraged by other security tools and teams because of its usefulness. This presentation will go into depth on YARA rule syntax, performance, YARA modules, and use cases including memory and packet capture analysis.

Show transcript [en]

all right hey all right I see you see I see we emptied out after the keynote that's about what I expected but that I'm going to take on faith then that everybody who's here in the room has a distinct interest in getting better at writing goodyara rules so we're going to go deep a little bit of this we're going to go fast but hopefully this will be fun and useful for those of you that are interested in in the topic of writing good Yara rules emphasis on good um the original title of this talk was the Derek Zoolander Center for analysts who can't write rules so good but they told me that's too long for the bulletin

and no one would get the joke apparently they were right that's fine um so for those of you that are interested in playing along and want to do uh Yara things while we do Yara things uh there's a GitHub repo out there I just put it up this morning um so go out to P melison besides Augusta 2022 um hey you can go ahead and clone that repo it has a bunch of sample rules that I'm going to be going through in the talk there's also a file out there a 7-Zip file with a password if you've been doing this for a minute and you know the password then you also understand the value of that proceed

with caution sign I'm not going to publicly admit to violating any GitHub terms of service it'll be up to them to figure out what's in that file um but we might talk about it at the end so that's the other thing by the way stick around because I think the best part is the live demo at the end uh where you're going to help me write a Yara rule for malware that's never been seen before so but first a little bit of grounding um so uh what's Yara Yara is an open source uh tool developed by virustotal they call it the pattern matching Swiss army knife but what it really is is a rule engine uh that provides a human

readable like rule language a mnemonic rule language uh and it can and it supports sort of three key characteristics so user-defined patterns uh strings bytes regex we'll get into all of that in a little bit uh parsers they call them modules but what they really are is file parsers that give you um you know sort of direct shortcuts to be able to interact with particular types of files and file types and we'll play with some of those later and then a Boolean logic engine and that's really the Magic in this right um it's like grepanoc on steroids because you can do kind of all the things with that logic engine and some of the things that they

expose in the in the conditional part of the rule um so let's ground a little bit on Yara syntax so this is an example rule this rule doesn't really do anything I wrote it specifically so that we could talk about rule structure so you'll notice the all rules begin with the verb Rule and then the name of the rule that's how you declare a rule and then open curly brace and everything that happens next is the rule until the following closed curly brace right fairly simple kind of straightforward markup and then there are sort of three sections meta is a is a section where you put your metadata fairly simple it's a key value kind of thing right so on

the left is your key name equal sign and then whatever's in the double quotes is your value input or whatever you want in there this is completely free form but there are some things to think about when you're writing rules that you should consider um so uh does anybody have a thought as to what the most important field in The Meta section is I'll give you a hint it's not the first one ooh hash is a hash is a good one I like people that put hashes in their meta field but that's not it description right absolutely what does this do the second most important field nope not author not hash date exactly when did you do this what are you trying

to do and when did you do this because the difference between if I'm looking for Yara rules to help me detect a particular piece of malware a rule that was written two months ago is different than a rule that was written six years ago not to say that the six-year-old rule doesn't necessarily have value but it changes my assumptions about what I'm looking at how effective it's going to be and what I might need to do to be able to leverage it right so I do think author is sort of an important field not just for the credit aspect of it um because frankly nobody ever has paid me for uh like releasing Yara rules but

moreover because it gives the person who's looking at that rule down the road a touch point to come back to you if they have questions um another thing you'll notice so that the hashes what's the most important thing about that hash if you're going to put it in the rule well sure okay it's got to be accurate No but but the most important thing is it's got to be on a public repo right that that file needs to be on virustotal or abuse.ch or VX underground it's got to be somewhere people can get to it if you put a hash value up there that no one can get to um you're a jerk and I'm not going to

name check uh any uh companies um that uh that do this on a regular basis but but there are a couple that like to release your rules with iocs that no one has access to and that's the most useless thing in the world because I can't check your work I can't validate what you're saying is actually true or not and now you're going to make me best case scenario you're going to make me burn a retro hunt to try and find something else so that I have a sample to work from right all right so enough about meta you can do all kinds of things with it you know ad tagging you know categorization you know

I've seen our internal team has a whole uh standard for Yara rule headers uh that they leverage that makes it easy for them to index cross-reference you know actors and things like that right so you can put all kinds of data in the in them that you know that make them useful for whatever your internal Ops are but um so those are some some sort of Standards strings um fairly straightforward it's just pieces and pieces of values that you want to be able to reference these you'll notice the names start with the dollar signs variable name so uh and then whatever's in double quotes or as you move down to my first example um is just a is it just a regular like

text string a utf-8 string you can also and you'll notice some of the lines have um additional modifiers after them that indicate what type of string we're looking for if you don't indicate something it's the same as putting ASCII in there so you can you can use ASCII to differentiate this is an ASCII string but if you don't that's the implication the implied value the next one is a wide string so wide sometimes also called long or utf-16 it just means that when Yara looks at that there the compiler is going to insert a null between each of those characters before it begins its search no case fairly straightforward case insensitive I camel case that just to to

emphasize that but obviously the camel casing inside of a thing that I put no case in is sort of redundant and silly again you wouldn't really do this in a real rule regex regular Expressions pcre ish um I say ish because there are some things that happen with escaping characters in in Yara regexes that make them a little different um but I would say you know for the most for the most part it's it's pcre compliant you can do I would say 99 of the things you would want to do in a regex and probably a hundred percent of the things you should do in a regex but uh but so regular expression support and then byte strings and bite strings

are fun because you can you can essentially you know you have full access to the entire character set and bite strings also have some cool features they support you know regex style alternates and wild cards and some of those things which are which are Super handy so especially when you start getting into op codes and things like that bite strings are the way to go um and then conditions um I just put the you know so you would never actually use um so any of them so them refers to every string in the string block so any of them would mean if any of these strings is present in the file the condition becomes true it returns are

true all of them it means that every string listed has to appear at least once in order for the condition to be true and obviously if you're any good at Boolean logic you know that this is stupid because any of them supersedes all of them if any of them or all of them can be true then it's just in any of them statement but I did it mostly to show off the parentheses and the fact that you can chain conditions together with ores and ands uh and we'll get into some more interesting versions of that later and this is what it looks like when we run it so our test file is hw.txt says hello world our

rule we were just looking at is hw.yaro which is our hello uh hello world rule run it against it you can see the match the output you get is Rule name followed by file name and that's the default output I also added the dash s switch which is like my favorite debugging switch it just shows you all the string matches at position so you can see we matched on everything but the utf-16 the the wide string in that rule because our you know our file doesn't contain that value some interesting things to note with the case insensitive match we didn't get the value of the string we got the value of the block from the file and the same

with the regular expression string we didn't get the value of the regular expression output we got the value that it matched on interestingly enough with the byte string we still got the value that it matched on but it gives it to us in HEX format and not in text format I I'm assuming that's just a default to make processing the output easier because obviously you can access you know non-ascii values and non-escaped values in byte strings but so interesting that it would that it would output that way but it's worth noting for those of you that don't speak hacks that that is in fact the exact same hello world string that the other strings are matching on

oh and uh 0x0 that's the offset of the beginning of each of the matches um and that's a super super handy feature when you're doing rule development to know where in a file you can find the match strings that your rule is is connecting on all right um so a little more um this one is uh this is Hello World take two the point of this is really to um to demonstrate some of the more advanced functionality in the conditions block so we'll just skip straight to that um so file size is an internal mnemonic in this case we're going to say any file um so by the way hello world hw.txt is exactly 12 bytes so file size less than

13 is a condition any file over that we're not going to match on and so we can start calling into the variables directly so hello at position zero and world in the range of 4 to file size so file size just becomes a number it's essentially an integer that is exactly the number of bytes that the file that is currently open and the rule is but you can use it in a Range we could have said four to a hundred four to twelve if we knew what it was going to be before the file size lets us access that you know it just means end a file or hello world and or hello and world or all of

and then this is a cool example I love doing this you'll see some of this later where I classify the strings like I'll give them like a like a grouping name and then usually an underscore to separate the individual value so that you can use the asterisks to call groups of things in a single item so you can wildcard your variable names inside the conditions and that's super handy too and so here's an example of this one and you'll see now with the with the output and the offsets we get hello is it zero but because hello and world are now two different string matches you see the offset for world has moved to the sixth

position in the file um well the world was always at the sixth position but the match starts at six now because it's a separate value okay um a little bit about modules um so like I said we're going to start moving uh deep pretty quick um so modules are awesome highly recommend using them they save you a bunch of time somebody spent a lot of time writing a module because it saved them time so if it's going to save them time it'll save you time um that top rule is an example of a dll check you'll notice there's no strings it's just a single condition we import PE kind of python style laptop so PE is

the module name and then that gives us access to all of the functionality within that module and typically the way that you access that is in the conditions and it'll begin with module name Dot and then the capability and you can have essentially subordinate functionality right where there's you know multi layers of functions each separated by a period but in this case is dll open and close parens so and all this does is if a file is in fact a Windows portable executable Dynamic loadable Library a dll file it will return true it'll match on all dlls in order to do this without doing without using the PE parser you end up with this condition stack down here and I'll walk

you through kind of what we're testing for so the first is we're going to look for the magic number so everybody not everybody but you probably know that Windows executables the first two bytes or the capital m capital Z sometimes called 4D 5A in Yara you're going to interact with integers inside of conditions as as big Indian and you'll see this a little later if you do it in strings you do it in in byte order if you do it down in the conditions you're doing this big Indian so what you have is uint 16 so declare an integer at a position in this case we're taking position zero so the beginning of the file Unit 16 says give

me you know a an unsigned integer a length of 16 16 bits which is two bytes right so the first two bytes of the file at position zero equals big Indian 5A 4D which is 45a it's MZ right so just just to walk through that to explain what that's doing literally that first condition is just what are the first two bytes are they MZ the next one is looking at offset 3C and if Unit 16 is looking for two bytes U and 32 must be looking for four bytes right because 32 bits is four bytes um and again it's it's big Indian so what really all that is is that the letters P and E because that's how you

declare a PE header followed by two nulls but you know again big Indian style so zero zero zero zero forty five fifty um and then the last piece is we're going to add an additional offset of 16 because uh 16 16 bytes off of the beginning of the PE header starts the PE headers characteristics section and those four bytes at the characteristics section tell you what type of file it is and if the the you know essentially if if 2-0 um you know if if the if 32 you know decimal 32 is set you know hex 2-0 is set in that position within those four bytes that indicates that it's a dll right and the whole point of showing you

this is that in order to do this manually you have to really really understand at a deep level what the files do how the layout you like you actually have to go read that that Microsoft link that I included there uh that gives you the full breakdown of how image file dll gets set in the PE header flag or you can use the module and write that one line and never think about it again all right um and just uh just a proof of life um you know Randy so dll.yara contains both of these rules in it run them over the same files and you'll see they both match um and I just ran that against my Zoo so

these are just some some loaded dlls in the in the in the wild all right so let's talk a little bit about Yara use cases right that syntax that's how to use it what can we do with Yara right uh and there's a variety of different use cases I'm going to move through these fairly quickly but I think it's worth understanding the four important use cases you know you're going to wonder why why is this guy that has dedicated so much of his career to detection saying detection is not a use case because that's all Yara does Yara Yara is a is a true false engine for matching that's detection done and done right like it

either matches or it doesn't that's detection so let's talk a little bit more specifically about the intent that you need to have authoring a file especially when you're thinking about how to write detection so identification rules so an identification rule would be an example of a rule where you want to positively ID a file let's let's pick on malware right like I definitely want to like I want this rule to say definitively that what I'm looking at is in fact a plug X sample or a bonitu sample or whatever the the malware family is right and so the goals are going to be a really positive identification I'm going to want to be confident that what I'm looking at is in

fact the malware family I'm looking for I also am going to want a really low false positive rate the downside of that though means that I'm engineering for specificity I'm going to include tool marks I'm going to include metadata I'm going to include functional code in those things that I'm looking for when I'm writing that rule the downside is there's a chance for false negative rate that rule is going to be a little bit brittle it may also be easy to bypass if I publish this rule the author may go out and say oh he's you know similar to uh to the the talk that Tim medine just gave on the The Defender bypass that was

hooking on Tavis ormondy which by the way I'm proud of myself for not coming on hot mic and just saying yeah well Defenders get to be lazy too because if it works it's not if it's works it's not stupid um so and Tavis isn't on my list but there are a handful of offensive tool authors who I absolutely have detection just for their name just occurring in a file because it's always worth looking at um and bad guys don't strip out comments as often as they should so anyway but they might leverage that same information if you release a rule into the into the public Stratosphere you know they may go oh I can exchange that

value and now your rule doesn't fire anymore so um when I think about identification rules I definitely want to function focus on those functional artifacts and required code to reduce the likelihood of false negatives you really want to hook into what those capabilities are so that in order to bypass a piece of detection the author has to change not just the the metadata and arbitrary values that don't mean anything that you're you're identifying it's not that you shouldn't use those but you shouldn't use only those because if you can hook into functional code well now they have to change how their tool works and at the end of the day if you can make an adversary have to change how

their tool works every time you write detection you're eventually going to exhaust the resources and win classification rules so this is where we're looking for the existence of known values right we want to be able to say you know with a high degree of confidence that this thing does a particular thing right um you know so this is this file is is an executable that loads a series of you know functions like maybe from you know winsock 232 dll that you know establish outbound junk socket connections with uh with you know connect right that would be a really good one to have some detection for why because no modern application is using winsock 232 for

outbound socket connections that's a terrible idea we quit doing that in the early 2000s but the functionality still exists and you know who does it all the time Cobalt strike boom and interpreter and sliver and right like so there's a there's a but that's just it right that's a that's almost like a code TTP like I don't want to say I I shouldn't have said TTP but it's a it's a code technique that isn't necessarily going to positively idea a thing but if a thing does that thing it's probably worth looking at and so that's the goal of a classification rule obviously you're going to run into some false positives because you know if you're if you're if you're um if you're

capitalizing on the the functional selection and the bad coding practices of your adversaries who by the way most of your adversaries have really bad coding practices um if you're if you're focusing on those things eventually you're going to come across a developer who's not a criminal um or a spy who also is bad and wrote a bad thing and you're going to hook their stuff too oh well um and then you know this is also where you start to run into performance trade-offs right because one of the other things you want to do in this scenario is account for variables and variability how many different ways could this thing happen I'm gonna walk away from some of the specific context

I've seen in specific samples and try and broaden my horizons a little bit when I think about how I write this rule the downside is is that that means you know the number of strings and the types of encoding and things like that that I'm going to include in the rule are going to go up some so that leads to the pro tip I have for writing this kind of rule if your rule is more than 20 lines it's probably not one rule um you know with with uh with the identification rules um and hunting rules go nuts one because you can sort of uh accept the perform you know you need the specificity for identification

when we get to hunting with hunting you have uh more Cycles you can be more expensive but uh for classification rules especially if you do intend to use them for alerting purposes um it's it's important to keep them kind of tight and constrained and it's better to build groups of rules or families of rules rather than one rule to rule them all um so hunting rules you know distinguish anomalies or suspicious files um you know you want to match on suspicious content you want it to be flexible um you know potentially High false positive rates on things like this um and uh you'll never be done tuning these um and uh the other the other piece of

advice and this is just this is just uh let me let me show you one of my scars uh is this is why this is why comments exist because you'll deploy a hunting Rule and it won't fire very often and you'll forget about it and then a year later it'll fire and if you just you know kind of pack that one up and shot it off into the stratosphere and you come back to it a year later and man what was I thinking right do you do future you a favor and comment your rules especially your hunting rules you should comment all your rules anyway but really really with your hunting rules if like there's no

limit to the number of comment free tags that doesn't have any performance impact just go nuts be like well I saw this thing and I was thinking this thing and I'm not really sure but maybe this could work or maybe that'll work I'm not you know right like go nuts um and then finally triage rules these are my favorite and they're the most undervalued use case for Yara um but the whole point of triage rules is and this is the dash s flag use case right we're going to uh we're going to use Yara to extract interesting things from files before we even start our analysis of the file right um so the you know the downside to

triage rules of course is they're completely useless for alerting you wouldn't put these out in the world right so I'll give you an example I have one that's my go-to that I call extract indicators it has some really tight regular expressions and a little bit of good Logic for finding um both you know encoded obfuscated and just plain URLs and fully qualified domain names and file paths in files right so if I'm looking in an executable file you know if there's right if there's you know pdb um you know references or there's a URL that's potentially really really interesting info to have as you're getting ready to look at a file and if I can get all of that in whatever context

and you know instead of having to spend 20 30 minutes you know slicing it up um in cyber Chef or whatever I can just run Yara over at Yara just pukes out the the things and tells me where they are in the file gives me some hints where to get started the downside of course is you would never ever use that for detection because how often is there a URL string somewhere in a Windows executable you might think to yourself well only if it does each no everything every sign binary everywhere has a URL in it minimum so don't terrible for detection never do that all right a little bit about Yara performance so some performance Concepts I want to talk

a little bit about anchoring so some of these are really fancy fun academic words I'm going to break them down anchoring just means use a big string uh instead of a bunch of little strings why because the bigger string gives more context more positionality it reduces the the likelihood of that value being misinterpreted it also has some very real performance benefits I'll talk a little bit about specifically how the r compiler deals with performance and how you can think about writing your rules to help there um but so anchoring just means use a big string and minimize the amount of kind of wild cards that you're using in in those strings if you can right so for

example in the hello world example using hello world as one string with the spaces and the you know and the the punctuation is far better than hello and world because those matches can happen in any order any place in the file versus you know hello world tells me this comes before this and is separated by this that's super useful cardinality cardinality is a really fancy word I recommend you learn it so that you sound smart but it just means Counting um so the cardinality of a thing is the anticipated number of times that we will count it so high cardinality means Lots low cardinality means not lots so anyway we're looking for things that are that

are anticipated to be of low cardinality which means they shouldn't happen a lot of times in a given file right and I'll give you some examples of good and some examples of that regex performance uh you know use ranges if you're writing Yara rules with ranges the Stark curly brace one comma curly brace get out um you can do that for you on your local machine never deploy that never ever deploy that because that means literally every time it tries to match on that it is trying to match from a single byte to the end of the file the number of cycles and the amount of memory you're wasting is obscene you're you're you're the

whole in the ozone layer okay um Wild Card same thing dot star please don't because dot star does the exact same thing if it's unbounded and then use alternation sparingly so for those of you that that know regex alternation right is parentheses value pipe value pipe value parentheses right and it means if any of those values exist in this position in the string that's valid it's okay to use that and actually I highly recommend it especially in the bite strings that's a super handy feature to to have but keep in mind if you have multiple alternates and a single regex you're that's not um that's not a coefficient right that's an exponent to the number of strings

that have to be pre-computed in order to do your in order for your rule to run so keep those to a minimum and quite honestly if you're if you're if you're looking at your regex and you've got one that like line wraps and has a bunch of you know alternates and different tokens and things what you probably should have is a couple of smaller regular expressions and then work out the hard stuff in the conditions um and then finally order of operations uh you might think order of operations matters that Yara doesn't care about your order of operations um so a little bit about the r compiler and credit where credit is due um so I I

will link it at the end but all of this information comes from uh uh you know it's a paper that Florian Roth wrote that's sitting out on GitHub it is the paper on Yara performance and how it works if you need to go deep on this especially if you're going to be scanning tens of thousands of files or trying to do real time you are a highly recommend reading it but I'll try and give you the high level overview so Yara does two things when it compiles your rule it optimizes strings and it does condition optimization with strings optimization there's this concept called atoms where it it will take a larger string so let's say a high a long string

you know with good anchoring will it'll take three or four byte chunks out of it now it's got its own algorithm as to which you know it will try and take right but like if you you know if if you had a long string that on one end had a wide amount of variety and at the other end was like all A's right it's going to pick stuff where there's more variety right where there's distance between the bites and then once it builds that list of atoms it searches the entire length of the file from position zero to the end of the file and creates an index of all those atomic matches it doesn't matter

what's in your condition statement so you might think your condition statement that says you know hey if this is if this isn't the first bite of the file quit and go away that seems like a good optimization strategy but it's actually not for Yara unfortunately it's still good for you from uh from a you know an assurance perspective but it doesn't make the rule faster doing it that way um and then finally then it after it builds that index it goes through and searches for each of the full string matches by expanding the strings looking at wild cards and regular Expressions so that's how it optimizes the matches right it picks what it thinks are going

to be some good bite chunks comes up with an index and then expands each of the searches locally within those positions so that it doesn't have to run your terrible regex over the entirety of the file um and then condition optimization so it picks the which checks it runs first so simple logic right so this equals this this is present this is true uh imagine if you already and this is why it does the simple test first right if you're just checking for the presence of a string well it already has the the atoms in index that make up your string if any of those don't exist in the indices that are referenced in your condition then it

can quit right then it then moves on to expensive operations like math and you know comparative so it'll do simple things like equal and then eventually moves on to comparatives like greater than less than and then eventually other things like ranges and then finally it does module calls last so uh and that and the re and so that's the performance trick for modules right is is that modules that last if if the other conditions aren't met the module never runs all right so example of good rule bad rule um so these two rules do the exact same thing they detect PE file similar to the dll rule I showed you earlier so in this case we have the good rule has a

single string this program cannot you could do this program cannot be run in DOS mode except I will tell you that there are more variations of this string than you would think out there in the wild so I like I actually my personal for hunting is I use his program can that sounds silly but you get some variation in casing and and wording um in in DOS subheaders for whatever reason you know Delphi does some weird stuff things uh anyway but it's a nice it's a nice string High degree of cardinality or sorry low degree of cardinality good anchoring and then uh we're not creating um and then and then we go directly to the you know the byte comparisons for is

it a is it you know are the first two bytes MZ does it have a PE header and then does this string exist and we give it a particular position the bad rule we declare these as um as strings so they all get searched first well what do you think happens when we search you know for two bytes right it can't atomize it so it's got to do a full strength you know full search for for everything across that we break those up which means that we lose the context and positioning and ordering among those strings and the PE header and then trying to do PE header in a Range rather than being specific about

where it should be based on offset um you know so interestingly enough that uint 16 equals 5A 4D is significantly more performant than just declaring MZ and saying MZ at zero literally logically identical but from a performance standpoint because Yara doesn't care about your conditions until it's done searching the file you're running an expensive search before you get to the logic piece right so hence the optimization so it's a little counter-intuitive but this is uh trust me the the two rules uh perform very differently so but let's talk about a rule that performed even more differently so Yara added support for xor matching on strings which is awesome if you've ever run into xor as obfuscation

this is super handy to to be able to uh to do this kind of uh detection this is actually a use case that I ran into not that long ago um and it was like oh well I'll just look for you know I'll go back and look over so what I found was a DOT lnk file right uh uh you know a Windows shortcut file that contained within it um some Powershell that actually uh took the file at a particular position and you know xored it by a single byte key and extracted a whole ass rat like on the disc and then ran the rat which is yo that's a nice trick if you can get it I

had I had seen the Cobalt strike version of that where the where the the payloads actually in the base64 Powershell script but those are pretty easy to detect because you know it's base64 and a link file um this was a workaround for it which I thought was pretty novel but I thought you know hey well how do I how do I go about this right because I don't necessarily have a file structure to work from so I'm just going to look for some header artifacts and uh I got yelled at by Yara as you can see there at the bottom um so this particular link file is 88k it's a pretty small file in fact I'll be

specific it was 89 878 bytes right that's a little file how many times do you think that the xor extrapolation of MZ appeared in that file it was a couple times um so if you guessed that it was 526 matches in an 89k file then you must have downloaded the the GitHub repo at the beginning and be playing at home but yeah 1.2 percent of the file matches why well because M and Z you know capital m capital Z they're they're at a distance of 13 and basically by X or encoding them by all 256 potential values you essentially end up with um you know every two byte sequence that has an exact distance of 13 matches and

happens a lot we'll just say a lot um it's a really terrible performance bad rule so if we were to go back and revisit this like how would you tune this rule

the answer is pretty easy we just take that out we leave the the we leave the lower cardinality string in that only had one hit um and we take this out a better solution is to do this so this is my Pro tip on xor if you're going to be using Yara for xor they had this really cool feature where you actually get to find keys and key ranges so you can say xorbit just give the specific key or you can actually you know give it a range and so in this case I'm intentionally excluding two xor values um the first one is zero anybody know why we exclude X or zero if you X or a value by zero it's just

the value it's like multiplying by one right and the other one and this one I only you know this is this is just me showing you another Scar the other one is hexadecimal 20 which is decimal 32 and if you xor uh an ASCII character an alpha an alphabetic character by xor hex 20. you will get the case alternate so you will take lowercase characters and make them uppercase and uppercase characters make them lowercase I there's there's probably a PhD level Computer Science History paper on how we got there like going all the way back to like ancient Greece and then decisions made you know by antsy and things like that when we decided what characters

were what you know what by values but anyway so those are my tips though don't use either of those if you're if you're looking for obfuscation avoid using those two particular X or keys all right cool we're doing all right on time um so so now it's about to get weird um we're gonna write a Yara rule today well first let me uh let me let me let me introduce you to my guy uh so this is jaywow um I I think I think the the appropriate uh uh Brazilian Portuguese pronounced in the Asian is joao which I think is like an equivalent to to John um but uh uh but I've I've been stalking

this guy's malware for a few years now um so he's he's JWoww this is an old picture of him it's the only one I've been able to find um but uh but he's got a YouTube channel and some other stuff he's uh you you can find him on HF and a few other places and this summer he started playing with a new dropper that he wrote in uh in.net uh and I didn't have any Yara for it I hadn't seen it before and I knew that I had to write I was working on this presentation on Yara rules anyway for b-sides Augustine I was like you know what YOLO right like why not why not why

not write a Yara rule together and if for no other reason than to kind of walk you through how I approach this process which may come as a disappointment to some of you because it's not as hard or as cool as it sounds but um so let me see can you all see my VM nice all right so these are eight samples that I pulled you're probably like well how are you attributing to this guy I'm gonna just ask that you trust me if you want to see receipts later I'll walk you through it um but you know I can tie most of this stuff back to his username in a few different places and and uh he hasn't

done a video on this one yet so the fun thing is we're actually going to get to see a couple of Dev versions that he's been deploying in fact the other implant that this drops is named teste which is just the you know Portuguese for test right and he's he's got a dummy implant out there that he's been deploying all over the place just to get his attack chain working so that's kind of fun right it's neat to watch an adversary work uh live in the wild um so first you know what do we know about these samples um I just hit them with file and so uh can everybody read that okay or is it too small

all right um see if I can so anyway if if you're if you're not able to see um it's you know uh they're they're all dlls so these are drop dlls they're they're coming in through it's a it's a it's a ugly combination uh Windows scripting host to Powershell back to Windows scripting host uh dropper to run32 dll or run dll32.exe sorry so anyway so but these are roughly the same size and have a similar origin so the first thing that I like to do when I'm looking at these especially now that I know okay it's a dll and it's a DOT and it's dot net so something that I already know about.net is in a lot of cases there's a

there's a strings object uh that you know and a bunch of different like resource uh blocks in the in the.net directory and when it stores strings it stores them as wide strings so the first thing we'll do is just extract all the wide strings um and in Linux that's Dash El so uh Linux you know it's it's long strings as opposed to wide strings but it means the same thing um all the files will just sort them and then we'll use unique to get a count and then we'll sort the output from unique in Reverse numerical order and pipe to less and so what we get is this list of strings and you can see so the first

thing we know is if the count on these strings is less than eight because we have eight samples then those strings don't appear in all the files so we can exclude them from detection doesn't mean that they're not useful for later analysis and intelligence purposes but they're not going to make good detection because they aren't Universal but everything that's got an eight or better next to it is fair game so now we're going to look around and see you know I don't know how many of you looked at the tail end of.net binaries before but you'll see a lot of things you recognize in there like legal trademarks legal copyright internal name right that's all

just the visual studio does for you when you name a project if you decide to so that just lets you know that he's using visual Studios no surprise but tons of stuff is going to match on those so we're really only interested in a handful of strings here that might be unique so the first one at the top up there that's odd that that sort of random looking string would occur 32 times which would tell you there's probably four copies in every file of that um so that's a good one we'll grab that string

there's a couple others over there I mean there's the name of the dll and then resources so that's interesting we'll grab those

and kind of looking around I'm not seeing anything else in the eight count that's necessarily super useful there so let's switch over to the ASCII strings or in this case s for short strings so you see all I'm doing is just changing the E to an l and now we're going to get out the non-uh non-null padded string values there's quite a few more of these but again we're only looking for ones that have eight or more 15 is always scary because that would tell you that there's two in every file except one that only has one and but in those cases in this case those aren't particularly great strings either the dispose instance create instance we can see pad

pad and a few other things but we see we get a repeat of this y i p p h b so we'll take that that's interesting um

just Mark these up here another interesting one that kind of pops out to me is this version number right here um so we'll steal that one oops

what am I doing wrong here I'm failing it copy paste that's doesn't bode well happy all right um so and as we as we look around you know I see a bunch of other kind of code related stuff that's potentially interesting we do see that uh you know further down we see that ppps XT qiu does repeat so maybe we grab that one as well

because that's interesting and it shows up in both cases we can see let's see what else have we got there's quite a few oh there's some interesting stuff down here killed hide module attribute name that's that's potentially interesting especially if it's actually being used instances probably not super useful in this case um we've got some uh debugger modes debuggable attributes by the way if you see things here that you think are are potentially interesting um you know as we're as we're kind of scrolling through some of these um don't hesitate to shout them out okay ooh right process memory that's a good one but there's only seven of those oh but there we go here we go

get processed by ID that's maybe helpful

all right well that's a good start so what we've done though is just go through and look for a list of strings that kind of exist across or what probably exists across most of the files the next thing though before we start writing a rule is let's get an idea of what the context of some of these things is so I'm going to flip over to my handy flare VM machine where I've got dnspy loaded and you know so it's a live demo but it's not totally uncooked there are two there are two different samples from that folder set shown here so you can see one up top and one down below and you'll notice there's a little bit of a

difference between the two where one of them has pretty like crazy looking like um you know resource names there and the reason for that is that our our adversary our actor has run um confuser acts over this which is a which is a packer um and so it's gone through and taken a bunch of the variable names and so on and renamed them to kind of ridiculous unpredictable things and we could anticipate that if he's going to rerun confuser X over this file every time he launch it you know every time he creates a new build these names are going to be you know uh changed each time right so what's actually useful here is we get a

sense of which code artifacts you know down in um uh down in in uh in the unobfuscated version right where we can come in here and actually look at the class and see you know this is pretty much gonna you know this will be fairly close to the code as as the actor wrote it um but as you can see in the upper example right a lot of the a lot of the useful um you know code names and strings have been have been blown away which is why they don't exist in all of the files so the important thing then is to start looking at what are those pieces of the file that do kind of persist and repeat

right so we'll start by looking at the the actual module itself so that yipp hb.dll that's what he named the the module in the uh in the visual studio project and we can flip around and see a few things here we do see module type is dll so it sort of validates what we were seeing earlier um oh we had seen that string before that is in fact the.net version string so we can actually use that in constructing the file and yara's.net module will let us actually key in on that so it isn't just the presence of that string but we can actually specify how that string is being used which is potentially helpful a few other

things we can look around at well again we know it's a dll oh let's see what else um well we can get down into this class uh here so that Pol wkb um and we can poke around in here interfaces um here we go edit the method it's not setting a lot of custom attributes on these things oh we can show you characteristics while we're standing here so you can see it there about Midway down the screen on the left and there's that's what the the hex bytes look like that 2000 or you know two zero zero zero in X in this case it's two zero 22 but that 2-0 still indicates it's dll um

and then another thing that we can do

Ctrl shift K there's my keyword search

I'm going to be in here here we go

all right so a couple of the things that we can compare as we're looking through one thing you'll notice in the obfuscated version of the code is there um so in the de-obficated version of the code right we can we can see that he's because it's VB he's manually calling dll import rather than using native.net inclusion for these functions so he's got to call the function you've got to call the dll and the function Name by name here using dll import which is an interesting artifact because when he goes to run his obfuscator on this we also see that he has to leave those values alone right because if you obfuscate those particular values now it's worth noting

right like and I'm he'll probably watch this later online and and then fix this and and I guess I just accept that that's the cat and mouse game we're playing here in the name of education but um the way that he's implemented this now we can take the create process and kernel 32 dll strings use those as Pairs and basically call out all the functions that he wants to use in here and some of these are pretty useful right I mean it's a dropper so you know read process memory write process memory unmapped viewer section virtual Alec resume thread right and it's the it's the same functions that he's got to call for each of these so we

could just build that out as a string list because the obfuscators he's using aren't going to blow those away because if they did because of the way they're implemented it would break the functionality right now he could do something else where he creates another function where he reconstructs those strings and instead you know populates them from an obfuscated string decodes them in Crypts them whatever and then puts them back in there as a variable name inside of dll import he could do that but he hasn't so this is still a pretty good technique for us to use and get a list of some of these other things the other thing that I was trying to

kept working

there we go so the other thing that we can do is we can just come back and take some of the interesting things that we found so we already found that one that's a module name but this guy here this string value um that we've been working with

to copy here oh there we go

that's whole words created so um where is ah there we go that's why I was trying okay now I'm now I'm feeling a little better um so you can see that that ppsx that that's how he's actually filling out the like the the company product name whatever so it's metadata that he's picked um so we can go ahead and use that pull these values back together and put them in some context we're running a little long so I will just go ahead and take you straight to the cheater which is this rule here who we wound up writing

and you can see we've got a we've got a solid matched rule there but um I'll walk you through what I did real quick and then we'll wrap um so essentially took um the the meta build values you know so for the different assembly name the custom attributes and the rest of that and created a build meta section in my variables and then took the function names that he couldn't successfully obfuscate and put those into a set of functions then we use uh dot net and PE so p e is dll and then the.net version that we exploited and the one thing I was unable to find and show you though it shows up in strings is the type lib

which that's going to be exclusive to our guy until he watches this video and then blows away his Visual Studio install but uh so quite honestly just the just uh just those two.net uh version and typelib uh pretty decent for being able to follow our guy around the internet right now as he's working um but yeah this uh this rule should probably burn his uh his dropper for uh for some period of time to come uh so so anyway thanks for for letting me uh live work one up in in front of you uh trying to write a a rule in front of a live audience in uh in 10 minutes is a it was

a challenge and uh thank you for indulging me while I tried it so um and that's uh that's gonna be it

oh yeah um I got two things to give away um so the first um first question uh why don't you ever xor something by the key uh hex 2-0 um I saw that Tom sand uh correct all right you get your your choice you want a bash bunny or the blackout python the which one the book all right awesome and then uh second question uh where is our guy JWoww from who said Brazil all right all right awesome thank you so much for hanging out um please uh the links on the uh on the the last slide there um you know take you to the docs take you to Florian Ross awesome Yara tuning paper and uh and the GitHub repo where

these rules are and the new rule will be shortly so thanks so much foreign

Paul Melson - How To Write Good YARA Rules

Related talks