Labeling the VirusShare Corpus: Lessons Learned - John Seymour

Name: Labeling the VirusShare Corpus: Lessons Learned - John Seymour
Uploaded: 2016-08-28
Duration: 30 min 21 s
Description: Labeling the VirusShare Corpus: Lessons Learned - John Seymour Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 03, 2016

BSides Las Vegas30:21614 viewsPublished 2016-08Watch on YouTube ↗

About this talk

Labeling the VirusShare Corpus: Lessons Learned - John Seymour Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 03, 2016

Show transcript [en]

okay so next up we have John Seymour from zero Fox so put your hands together for [Applause]

John and on and we're good awesome cool so today I'll be talking about labeling the virus Shar data set and indexing it so that people can actually use it for machine learning and things like that um before I start I just kind of want to get a feel for people here how many people are like in the data sciency side of infosec you know okay and how many like reverse Engineers or malware analysts or anything like that do we have some okay cool I would love to speak with you guys after this talk so feel free to come on up um as uh as was mentioned my name is John Seymour um here's my UMBC email because this is

part of my PhD work and uh my Twitter handle here follow me I'll tweet out some links after the talk you can download some cool stuff so who am I well um I'm a PhD student at the University of Maryland Baltimore County um I'm also a data scientist at zero Fox and I do a lot of security machine learning work uh my interests are mainly in like data sets and proving that they don't suck and can actually be used for gaining information on what we're building machine learning models on so we'll start today by talking just basically introduction for some of you on the security side of things like a very very high level description of Mau

classification in general what works what doesn't um then we'll segue into labeling the vir share Corpus why that's good how it helps um building an index why we need an index in the first place uh what that means uh using a cool sexy tool that I wanted to learn about called py spark and we end up with some pretty graphs and words of caution and some useful extensions for these things so the main problem here is there's more ma variants created every day than we can possibly ever analyze right there's this thing called polymorphic ma which basically means you run the ma and it changes its you know shot to 56 you know hash or whatever

signatures that are being used to classify these things right um so we want to automate this process and how do we automate it well one thought is maybe we can apply machine learning to it so what is machine learning well at a high level it's finding patterns in data right you have all this data you want to find patterns in it so you can maybe say hey it's likely this thing is you know malicious maybe we should check it out or maybe this thing is obviously you know standard procedure we don't need to look at it at all right um and what we call these sort of patterns that we're looking at we call them features um and

we use models which uh are statistical things just to make sense of these features and there of course libraries in in every programming language you want to use uh I love python I know a lot of people like R Ina we just saw a talk with W um and here are two links that I love for you know just getting started on the data science side of things um yeah so cool if we want to do data science we have to have data to actually work with right and so where do we find data in the maare domain um here's some of the best places that I've found for malware um I know there are

lots of others contagio share things like that um these are the things that I've found to be used most you know in in the industry in Academia things um all of them um so the first place uh I love is malware trffic analysis. net um it's only got about 600 samples so it's not very useful for like full scale you know productionize machine learning right um but the it's got a couple really good points to it um first there's lots of exploit kits in it and exploit kits were you know all the rage like a year ago or something like that and and uh it includes the analyses for every single sample so there's this guy

who actually goes through you know writes up you know uh takes a peap takes the executable uploads its fire share tells how he reverse engineered this uh executable and he actually writes it all down so you can go back and read it and learn actually how to do all this reverse engineering stuff which is very useful when you're creating models because you actually need a little bit of domain expertise to know like what's going on what's you know 192 you know 0.0.1 or whatever is uh uh totally fine don't worry about things that call out to it things like that um so so you can get a little bit of domain expertise then there's the kaggle data

set for those of you who don't know kaggle is a data science competition hosting site and last year Microsoft hosted a competition on kaggle for classifying Mau um I'll get it into that in a bit but they released about 11,000 samples of malware they neutered bit um and which is 500 GB on your hard drive can fit pretty well um and the task was to say hey here these nine families of malware figure out which executables um in this test set that we don't know uh belong to which families and uh how they distributed this data was they gave hex dumps and disassembled files from Ida Pro um that they removed all the PE headers from so

nobody could actually execute the files but uh basically um people use you know standard natural language processing techniques and stuff like that in order to actually fulfill the competition the next data set that I'd like to talk about is VX Heaven um so VX heav has about an order of magnitude more samples than kagle actually had um each file in uh VX Heaven is actually named with kaspersky's antivirus label so for example if kasperski labeled it a Trojan or something like that then it'll have trojen and and hash and stuff like that in the file name and it's really nice because it's really well organized but it was last updated in 2007 which if you know malware uh things have changed

a little bit since XP was still alive and uh yeah so it's pretty stale um but it is the most used academic data set so it's nice to know about what we'll be talking about today is buyer share um it has I think it's increased in size actually since I made the slide but it has about 25 million samples which is quite a bit um it's split into chunks and this is actually kind of important I'll get back to this in a minute um but it's split into chunks of 65,536 samples each so basically chunk zero has the first 65,536 files that this guy found then chunk one has the next 65,536 files etc etc etc um this

actually ends up being pretty useful but we'll talk about that in a minute um all the malware available by torrent it's a simple password to get in to um the guy is awesome you know he didn't pay me anything to give this talk but um I like him a lot and uh the main issue with this data set though is that it was unlabeled so as the last talk talked about ground truth is really really important when you're doing machine learning but it's also the most expensive part of it it's really really hard to do and so what we've actually done today is label this Corpus and then I also want to shout out to virus total and malri which both have

methods for collect in malware um as well um they you know there's no limit to how many pieces of malware that you can grab from them there's a couple issues with both of those um first virus total generally you need the private API in order to do useful things with it um such as you know download malware and uh there's there's a lot of a it's really really freaking expensive but B there are these licensing issues where you're not actually allowed to release raw data that you obtain from the platform and my research is into like reproducibility of ml you know experiments and things like that so it'd be nice to have an open data set that we can work with that

bench you know we can Benchmark all our machine learning tools on and stuff like that so virus total ends up I I think probably being the wrong way to go in terms of actually obtaining your data set um it is still really nice for their labeling and things like that and if the licensing issues you know change obviously and then uh malri as well um basically there's this crawler that goes around downloads all the malware you want um but what I've found in my own research is that when people collect their own data sets in that sort of fashion they end up overfitting pretty bad um there was a a study in 2012 from that actually went to black hat um that

adobe made and uh basically they said hey we have these nine features we can classify malware with them with 98% accuracy and it turned out that it was just um overfitting to Microsoft specific coding standard so anything that wasn't written by Microsoft ended up being labeled as malware um these are things we want to avoid and this is how we do it so some of the features that are commonly used uh in in this domain um there's P file metad dat which is basically things inside your uh executable that tell Windows hey this is how you run uh this executable this is how it should actually you know be interpreted by the machine and uh I'm

not going to get too deep into into it uh but there are headers and sections and people use these to grab features from there's you know it's rich with data and there's a very good uh python F uh sorry python Library called PE file which is excellent for actually scraping all this from your executables another feature that's really commonly used in this domain are engrams so uh engrams are just if you think about like a sliding window over text um every two bite sequence in the executable um you just count those up and use that as features so for example in dead beef uh de a d adbe and be f are all two bite engrams and so your

features would be those three engrams with a count of one each um this is actually mainly what was used during the kaggle competition since they had stripped all the useful information out of the files yeah did you make an outrageous speaker request I did did you what was your outrageous speaker request uh no brown M&M's no green M&M's no green all right whatever thank you so much all right cool awesome um so some other features that are commonly used um op codes Imports you know assembly instructions things like that people are just trying to find data everywhere in these executables um if you'd like to learn more you know there's a lot of references come talk to

me after the talk and then finally of the top performing models in this domain um support Vector machines are kind of old school at this point um but a lot of people like to use them they're robust against overfitting in certain ways and they're nice um there's this new thing called XG boost extreme gradient decision tree boosting um it pretty much kills everything uh if you know random forest or decision trees think of XG boost as a random Forest on super super steroids um and then finally deep learning is a cool Hot Topic that people are starting to use in this domain too um there are libraries in Python for all of these I can't speak for w

r all right so let's talk about labeling the vir share Corpus and what went into this um so my motivation was there's this kaggle competition and I'm trying to see if the models in this kaggle competition overfit and so what I need to do is I need to find in some disparate data set a lot more of the families that were in the kagle competition so we have like things like this label ramnet we want to be able to find lots more examples of ramnet somewhere else so that we can run the models that were built in the kago competition on these other uh other executables that we find and so what we've done well first

we labeled this huge Corpus using virus total so that we can actually you know uh use it for supervis learning things like that but also so that we can create a search index on top of it so that we can find lots of a particular type of M and this could actually be useful for not only machine learning but I think it might be useful for pen testing and reverse engineering practice as well so uh that's one of the reasons why I'd like to talk to you guys after the talk so why did I choose vir share uh instead of one of the other you know corpora that's out there for me well first off obviously the size it's huge

27 million samples uh that's you know several several orders of magnitude bigger than the kagle data set size and furthermore it's just the all executable so it's not neutered at all um it's consistently updated um as I've said like there's 20 more chunks uh since I actually started putting these slides together of virus share uh Mau that um are already up there so we can we can you know consistently update um the data set but uh really what I want to do here is make future uh machine learning research more reproducible right so I already talked about if you scrape your own you'll probably over fit and you'll end up with one of those Titanic level

disasters that was in the keynote but uh um virus total you can't release any raw data from the platform and that as as a data science paper writer that sucks for me because I normally have a methodology section in my papers and I need to say like where I got my data from and it's really hard to actually like download a data set from another you know academic paper or industry paper for machine learning because it's all secret sauce um this is actually really nice because since fire share is split into chunks I can just say look I downloaded all the ramet executables from chunks 25 60 and 90 and that was the data set I used for

my machine learning Corpus here and so it's really really easy to actually reproduce the machine learning models that other people um make and it it eliminates a lot of the like stoas uh stochasticity that's just inherit in the machine learning process and so if we actually like start to you know reproduce machine learning research we can actually figure out hey you know this machine learning model works or this machine learning model doesn't work in this you know other domain that I tried right so uh we chose virus total to label the executables right um obviously because virus total is one of the you know the leading vendor in the space but uh also it has an awesome API so we can

just do this programmatically we don't need to pay lots of malware analysts to you know go through every single executable in the virus share Corpus and say hey I think this is vundo or I think this is ramnet or I think this is conficker or some variant of ransomware that we don't know about or whatever um so virus total has two different types of apis it has a private and research API and a public API and the difference is that the public API is rate limited but the private research API either costs money or has these licensing agreements where you can't release raw data from the platform and so if if we labeled using the private API then we

couldn't actually distribute our results you know to the public and other people actually use them right so we ended up using the the public API and uh actually this is all the code to um take in a batch of executables and label it using the virus total API I'm going to wait okay cool and so this is actually what you get back right and so um I've just formatted it pretty this is just one line of Json um but you basically get back each different antivirus and what that antivirus detected the execut as so these are basically going to be the labels that we have for ground truth with working with this uh with this

Corpus and uh so generally antivirus labels are pretty inconsistent but this is definitely in a step from the right direction all right so using the public uh public API it actually took 30 people and uh around six months to label the entire Corpus uh because of rate limiting um we we uh used a lot of undergraduates wanting extra credit from UMBC which was nice um also uh some people from the mlc project uh actually helped out and beta tested the tools and stuff like that so that was really cool too um this is actually all the labels from chunk 0 to 233 I think um again there have been more since then I'll release those after

Defcon when I actually get access to Wi-Fi again but um basically it's I think it's seven gigs compressed of just those lines that you saw earlier um and it's you know really easy to look at really easy to find things but uh but what we want right is we want this tool that we can go through and say hey look um we want to find lots of ramnet instances right and what we have right now is we have this thing that says hey in chunk zero we have uh you know win32 HFS adware 166 or whatever uh we got some pups things like that right um what we really want is something that we can just go through

and say okay like we have this type of label that we're looking for how many are in each chunk and this actually makes it a lot easier to search over this data set right um if we wanted you know only find malware and chunks 4 60 and 90 what we do is we just say hey like um grip through it and be like what's the count in chunks 4 60 and 90 so this is called an inverted index and the way you do it is basically by counting things right and apparently I've heard the map reduce framework is awesome for this um actually the py spark initial tutorial is exactly this problem so it works out pretty well

especially since I wanted to learn pypar and this ends up actually being the entire script which is really really nice um so it's like what 20 lines of code everything else is just formatting and it's really easy to you know like learn and use and it's actually more work to install py spark than to actually use it which is pretty cool and this is what the inverted index actually looks like after we finish right so you can sort of see um here's it's a CSP with the label how many are in z00 001 Etc and it ends up being really easy to use you can just GP for vundo you get all the results for vundo this is how

many in each you know chunk and you can even see by inspection look those chunks have a lot of vundo um we should if we want vundo we should download those chunks right and then it also this index also lets us easily explore this data um so I really quick in like two minutes threw up um basically most frequent malware in the last recent chunks and uh yeah you can see that there's about like a thousand of riskware adwar screen saver you know 132 El Dorados um you you can find that out out in like less than 2 minutes which is pretty nice and you can also make a lot of pretty graphs which is nice too um so

this is one of the cool things with virus share is he actually posts each chunk uh temporally right so he post chunk zero first chunk one second chunk third you know third Etc um so this means we can actually use the chunk number as a proxy for time and we can see how many you know instances of vundo this guy collected um you know in kind of over time and uh so this is uh this is just a graph again made in like less than two minutes of how many instances of 132 vundo this guy collected and you can see like there's a spike there and there's a you know dip here um there's a

lot of assumptions about this sort of temporal uh you know um plotting right like for example a that the chunks are you know um sort of a good proxy for the time period but B that the number of instances this guy collected is um actually like proportional to how many are floating around or something like that um but it's uh it's definitely like as long as you're aware of those assumptions and things like that um it's useful and we'll we'll talk about you know maybe how to fix those things in a second um so some words of warning um first off don't use this to compare antiviruses um um so that's one thing that uh I definitely want to say just to

try to get virus total maybe off my back for this um so they have a big disclaimer on their site um and there's a couple reasons that I've thought of why this might be and I think the main one is if you've ever actually submitted data back to virus total saying like this is malicious or this is the file there uh the antiviruses choose to do that different ways um one is you can actually basically give your product to virus total and say hey run it whenever you run into something new but the main way that most people use is if they encounter something in the wild they wild they send it back to virus total

with the correct label so this ends up actually meaning a lot of vendors run into sort of different sections of Mau if that makes sense and so for example at a social media company you might find links on social media that you know kasperski might not or something like that um another thing that you want to keep in mind when using this sort of index or data um ground truth is very very noisy in this um first off we've already seen that the anti virus labels are sometimes different even though the specimens are similar just with that vundo uh uh index here right all of these are part of the vundo family but um they get labeled you know as soon as

the signature for vundo a uh no longer works then they increment it and now vundo dob um is you know a new specimen even though it's a very similar one um but also antivirus labels are sometimes similar even when the specimens are different so like here's an example of that right um here's a lot of different chons and I have absolutely no idea whether those are actually similar executables or not based on the name right um so like heuristic looks like chojan is I don't know maybe a troj is basically what that label is telling you right so so some useful extensions here um one thing is remember that graph where we used the uh chunk number as a

proxy for time um if if we wanted to do that right what we'd want to do is add when the specimen was first seen rather than you know just how many this guy collected over time and uh that's only available on the private research API so it wouldn't be able to release to the public if we did that um so definitely maybe find a way to do this uh that might be nice but uh the other major useful extension that I can see from this is some sort of stemming so stemming and lemmatization in machine learning is basically like a way of compressing um words that have similar Roots together right so that you know

generic. vundo DOA and generic. vundo dob we'd like them to both be called one label vundo um but that actually ends up having some issues um especially when you have things like you know um Trojan ABCDEFG and Trojan a bcdefg you know H or something like that it ends up actually being compressed on the the actual incremented counter which is kind of funny um but uh it it if we could somehow compress those labels that might help with the antivirus labels being different even though the specimens are similar problem um which which would be really really nice um and then finally antivirus labeling is is very inconsistent so this would actually help us get higher granularity on our ground

truth labels um so I know I blaz through that but uh if there are any questions feel free to step up to the mic refering the cve models back um so I have not yet uh that would definitely be a cool uh extension um I would love to do that what only have one number yeah um but yeah like I I would definitely definitely like to look into that too any other questions so um with the names uh did you consider kind of splitting the names at the brakes and analyzing that way since most of these are divided by either a period or a slash or something so I you find if you did like most of the ways I've tried to

split naively so far um I I've run into issues where like there's all these exceptions that don't fit the mold um kind of like you know the malware classification problem in general but uh and and there's just not enough time in the day to actually go through and fix all those exceptions um so I'm actually looking for maybe some sort of statistical technique that might help what a also so one of the things I've tried and maybe you've tried and better is looking at the multiple names for the same malware and building the relationship graph between the names and then trying to kind of get clusters of these names are all generally the same piece of

malware you know and there'll be overlap and things like that but kind of try to use that to group names into a more into a way that helps you get over the you know one two 3 four five problem yeah so um that's actually the approach that I'm thinking I'm going to take in the next few months um it it I I think it's probably the best way to go but it ends up being really like expensive for me um so I'm I'm G to try it and see how it works out and maybe see how did you handle the cases where virus total labeled things as two different virus families or multiple virus families that don't share similar

code bases right so right now um I release the labels with all of the different like it's literally this uh this uh LD Json um so you can actually like build the index even on a single antivirus or something like that I haven't really found a good solution for that actual problem yet again like antivirus labels are very very freaking noisy um so uh yeah I'm I'm open to suggestions there actually is what I can say right now I've built the index with all of them because my major use case is just like finding lots of vundo Mau and so if uh if I actually um you know find a trunk that's not exactly the optimal

chunk because two different antivirus uh vendors labeled it differently uh it it's not majorly a problem for me have you considered taking the data and running it through uh like a relational database as well to see where other clusters might exist that don't like readily come to the surface yeah that would be a really cool idea um I'd love to do that um so um it's a little further in like sort of the labeling process than I I intended to go but it's sounding like I probably should um but yeah no actually that's a really good idea and have you used something like a freak doy that does like language frequency as well on the names that it's

that it's giving back to you to see if that provides any sort of analytics so um I have tried like counting you know different labels different um like substrings within the labels and things like that um I just still get into the issue where there's all these exceptions where like again ABCD FG like ABCD FG ends up being the string that occurs more frequently like not in that exact example but uh so with all these exceptions it would actually I would have to use Amazon's mechanical TK or something to get rid of them um but yeah and would you mind sharing how you did Microsoft Caro last year um tried very hard but we are only the 44th so I

actually I did not spend much time on it at all I actually had a few friends in fact Gabe were you on my team yeah in in only the most technical sense because like we got on it and then it became DB season and like my available time yeah exactly and I was like presenting my Master's uh actual project and stuff like that so it it ended up we didn't we didn't uh we didn't try as hard as I you know if it came out again maybe cool is that all questions anybody else all right again I'd like to talk to malware analysts and reverse Engineers up front or you know outside or something we'll figure out a space um

but thank you for having me [Applause]

Labeling the VirusShare Corpus: Lessons Learned - John Seymour

Related talks