← All talks

GT - Building a Benign Data Set - Rob Brandon & John Seymour

BSides Las Vegas19:2637 viewsPublished 2017-08Watch on YouTube ↗
About this talk
GT - Building a Benign Data Set - Rob Brandon & John Seymour Ground Truth BSidesLV 2017 - Tuscany Hotel - July 26, 2017
Show transcript [en]

all right so greetings everybody so who we are I'm Rob Brandon I'm a security researcher with Booz Allen Hamilton dark labs and I just graduated from the University of Maryland Baltimore County a couple months ago where I've been working with the John Seymour for the last few years on this topic yep and I'm John Seymour I wear two hats right now I'm a senior data scientist at Xerox which is a social media security start-up and I'm also a PhD candidate at University of Maryland Baltimore County just in this proposal it's this just a quick overview on what we're gonna be talking about today so kind of the two key takeaways from this and be a

one-minute kind of a overview is negative examples in machine learning are just as important as the positive examples you know which aren't what you're trying to detect so when you're doing machine learning it's important that you teach the classifier what's not what it's looking for as much as you teach it what it is looking for and in order to do any kind of reproducible machine learning we need large representative and diverse data sets of the benign data and for the most part those data sets don't exist today so kind of some of the motivation in just a quick overview of how a machine learning works you know I like to think of machine learning algorithms as being a

lot like little children you know they're very eager to please authority figures such as teachers you know and per machine learning algorithm its authority figure is gonna be its loss function you know it really really wants to make that loss function happy but they don't have understanding of the context for why they're doing their tasks you know a machine learning algorithm that's classifying you know hot dogs versus not hot dogs it really has no idea why you're trying to teach it about how it dogs all it knows is you really really wanted to classify these things as this and everything else is this so in order to do that task its gonna build some kind of internal model

of the world which is basically its concept of reality in that model is going to be constrained by the data that you give it to learn you know all the only thing it knows is what you tell it it doesn't any kind of outside understanding of human motivation or the world in general outside of what you show it so you know give him some kind of a task but when you are telling the Machine how to do what it's doing it can come up with some really amazing creative ways to model the data you know to represent the world in order to accomplish the task you know on the other hand sometimes it can also

fail very spectacularly and usually that's going to be because the data that was given doesn't really match the data that you're wanting to work with you know they in this case the training data doesn't match the real-world problem data so kind of vague a real-world example of the type of type of area this might show up in so you know a lot of people might like to have a bowl of M&Ms on their desk and people being the type of entity as they are you know somebody if everyone saw somebody might come along and throw a few skittles in with their uh M&Ms you know just kind of Surprise them a little bit so somebody who might want to

take a what might want to create a machine to go through their candy and kind of say okay is this I want to dump my bowl of candy in here and I want to end up with a pile of skittles on one hand in a pile of M&Ms on the other so the way that the machine is going to be trained is you're gonna can't it a bowl of skittles you're going to say okay take one out and put it in a pile and I'm going to tell you whether you put it in the right pile or not so you might have a pile for skittles we'll tie off M&Ms now say your ball has nothing but

green skittles and all the mmm colors other than green so it's going to pick it pick something out you know if it's a skittle you're going to say put it in that pile if it's them and then you're gonna put it in that pile so fairly quickly your algorithms going to say okay anything that's green is a skittle anything that's not green goes in the non-skilled pile and it's kind of it's going to work great on the that you've shown it you know it's gonna be happy to can say hey I know everything about the world skittles are green M&Ms are not green and then you can't in a real Bowl where you've got all colors of skittles no balls of M&Ms

and it's gonna look at and say you know I have no idea what how to do how to handle this problem and that's why the data set problem is so crucial in machine learning alright cool so we'll go ahead and talk through some of our favorite types of favorite pitfalls in dataset creation and I'm gonna go through each of these in a little bit of depth so so the first is like selection bias um everybody here has probably heard the textbook example like a study where they're you know studying I don't know coffee consumption and they interviewed everybody from a you know college that's right down the road right so nobody can really think of any reason

why a college person might drink coffee differently than an average person so we actually see this in a lot of different ways in information security for example if you're running a honeypot and during you know that entire water Christ you're probably going to get a lot of samples that I wanna cry that's not going to be representative of the actual entire malware problem as a whole there's also something called capture bias where basically you preform at your instances in such a way that actually changes the the actual problem a little bit and so an example of this if you go to Google and you search the coffee mug say you're trying to create an image classifier to

determine whether pictures of a coffee mug or something else um basically almost every single poppy mug you'll get on Google Image Search is centered in the picture with the handle on the right side and so when you build a classifier it's going to pick up on these things and so things with the handle on the left side it might not actually determine is a coffee mug and so this can actually be also seen in information security right if you I don't know neutering your malware before you actually run your data science problem on it or if you for example run it through some sort of you know item Pro or something where your analysts are

actually making comments inside of the malware itself right if your model is actually using this extra information and in you know making assumptions about the real world they're not they might not actually translate into the real world the type of bias that we're most concerned about right now and actually is sort of a special case of the selection bias that we were talking about earlier is negative set bias so what we see in information security is a lot of effort being put into finding interesting things like for example malware or bad network traffic and then when people are you know going ok we actually need two classes in order to create a binary classifier we're going to I don't know

just grab some Windows executables or use Alexa top 10,000 or whatever and we're actually going to argue a lot of these you know simple very simple approaches are actually not so beneficial right they're not representative of benign software in general or benign URLs and we actually have a case study in a minute where they'll talk about that and then the final type of bias that we're really interested in here is called category bias or label bias right so oftentimes labelers disagree on what the actual definition or what's interesting right here in the tweet link frozen water is it liquid is it solid we don't know science doesn't know um so like this this happens all the time as well right

like if you're creating a benign data set everybody has to agree that every single sample in that dataset is actually completely benign and furthermore or or in some other contexts like in malware families for example a lot of antivirus vendors actually don't agree on what family a particular sample exists fo and so so what we've been finding is like most prior work actually uses datasets that are like scraped together based on their own network or industrial proprietary data sets that can't be shared with the research community or they sample from larger data sets like virustotal but they don't actually say how they sampled from these data sets and all of these actually contribute to a much

larger problem of reproducibility right it's really hard to test for these you know simple pitfalls if you don't have access to the data set you can't actually reproduce the results in the literature an example that actually was a positive for me basically there's a 2012 paper and black hat USA as well as InfoSec Southwest that claim basically we found seven features that classify malware with 98% accuracy and so first off I actually I do want to say this was a good paper because it did explicitly say here's our you know here's our mount where samples here's where we got them you can download them here's our benign samples here's where we got them you can

download them here's the features we used here's some analysis on why those features we think make sense but they're benign set they actually only used Windows PE files from from a clean windows install so basically what ended up happening was this model was actually capturing on the fact that there's debug information so when the actual you know program crashes Microsoft has decided it might be useful to have some information about why it crashed but that's not actually representative of all benign sod-all benign programs right lots of programs don't actually have you know debug information don't have debug tables and so what we did was we went out and grabbed some other benign samples from places like cygwin and you

know package managers and source orbs and like basically this model completely failed on these new instances right it got 0.5% accuracy and it just you know I think it actually broke on on several of the samples right so so we need to actually focus them on are benign data set and actually the earliest sort of conference presentation that I saw that like basically talked specifically about data sets and information security and some things that we actually need cset 2009 um actually gave some guidelines on you know what we'd actually like out of the nine data set right we need data sets that adapt over time right if you just have your like 2012 state of

malware and in 2020 everybody's still using that then it's not going to be useful anymore because the landscape has changed we also want something that actually other researchers can actually use right it's no fair if I'm just this big bad a mean company not of my own proprietary data set that I report from but no one else can actually reproduce my results because they don't have the amount of data or they don't have the type of data that we have right and then this is actually compounded especially in the benign software case because benign software people are writing to make money so they don't necessarily like people just you know training around benign software all

willy-nilly without paying anyone anything there's a lot of licenses and issues involved so that's another consideration that we have to solve in this sort of sense but but finally going back to all the types of bias that we were talking about a few minutes ago we'd like something that's sort of representative of benign software in general and that's kind of a moving target and hard but we're trying to sort of make progress toward that end so I'm just going to talk a little bit about some of the common sources of bias that we find in executable software so just within you know compiled C binaries there's a lot of potential area sources of bias that aren't necessarily

accounted for in a lot of the datasets or research that's out there so one example is programmers everybody writes code differently if you give two people a problem to solve like if you tell two people to go write a quicksort they're gonna write it differently another area of bias is the compiler different compilers given the same source code well output different binary code so for example say you want to zero out the EAX register one compiler might do a move of zero into EAX another one might XOR EAX with it to up with itself so that's you know a case where you're gonna have two things doing the same things you know it's another source where you can

introduce bias and invert Li if your data set only includes code from one compiler and say you're malware's code has totally different compilers you end up with a basically a classifier to classify compilers are they then benign versus malicious impe another area is optimization settings depending on the optimization setting the same source code can be totally different even within the same compiler it might decide you know for example it might decide well this recursive function you've got I can read I can go ahead and implement is just a loop you know or might say hey this this function is used in multiple places but it's a really simple function I'm just going to inline it now like you

don't have the overhead of a function call and that codes gonna look totally different than the non optimized code and that's just the stuff that goes into the code section of the binary if you look at the bar Tariq as as a whole it gets even harder you know you have stuff introduced by the linker you know for example the visual studio will introduce something called a rich header whereas other compilers won't you know GCC has totally different resource and data sections as opposed to visual studio and then Borel I'd have something even totally totally different so I mean you're gonna get a lot of artifacts that you aren't necessarily going to know are there just based off of via different

tools that you use so it as one small attempt to start trying to solve this I'm releasing the data set I'm calling the multiple architecture machine language data set right now it's only 32 bit P&L files the original intent of this wasn't so much for a malware versus non malware it was mainly for doing code analysis so building models of a x86 code rather than models of all executables but it's still a good start so it contains files from Windows 10 Arch Linux and you know finally this was actually one of the best uses of react LS I've seen because the react to us is fairly unique in that it gives you build scripts for a visual studio playing GCC

so you can take the same code and build it with multiple compilers and optimization settings and get some good ground truth on how each of those things actually affects the code and affects your model so unfortunately I'm not reaching releasing the Windows 10 executables I did email Microsoft they never got back to me and said it was ok to release so this is a the data set - the windows binaries but really this is just a start for the column and like I said there's still a significant amount of bias if you're look if you're use these full executables for a malware versus non Mauer classifier you know you still built with a limited set of

compilers you're still gonna have a for example some of the debug information is going to point back to you the same cup hot saying user that it was built with so you're gonna you're really gonna want to supplement your data set with some other sources if you're going to use this for InfoSec type problems it's going to talk a little more about some of the other sources yep so so we're looking to extend this we've been working pretty hard at it and so other than the about some places that we've been look to grab some more benign samples so there's actually a lot of good windows package managers like chocolaty and one-get we've already been

using things like utilities like cygwin and putty ninite.com of course a lot of people used to to download you know a lot of software and keep it up to date as well unfortunately like we can't necessarily just release things again because of that licensing information but we're gonna try to make it easy so people can go okay like I want to download a benign data set where can i grab stuff um and then the final thing is actually there's this this list of hashes of basically old software that that NIST has actually created called the NS RL and there's a lot of places where you can go to download all the applications so also another place that

you can go to grab benign data is actually try to grab all the data sources from that list and unfortunately they don't make the actual software available to download so it is kind of like a roundabout way but but we've been working on that front as well so that's basically our presentation you know ultimately this is a hard problem and no two people are going to be able to solve it so really this is going to take a community effort to get behind and really start looking at how do we build a good reproducible data set you know and how do we store it and maintain it and that augment it because software is a moving target you know the code that's

written today it's not going to look like the code that's written next year or five years from now so that any questions

it's uh how are you handling hacked or you know Packers that are used like zip or non zip are even harder one like I'm I'm I'm Rollo so so that's so the question of packing is one that I've thought a good bit about right now we're not doing anything with packing because the intent of mammal was to deal with code in functions rather than with malware versus non malware my personal opinion is when if I was building the actual of malware versus non our classifier I'd want to take the benign set and pack it with placards that I thought would be used for in benign software so that the model isn't picking up to us font packing you know on the

other hand there are some Packers that are really only used for malware but that's really also a I think an really interesting and unexplored question that's going to require building that benign data set in order to look at what's the prevalence of various Packers in malicious versus benign software so the data sources you look used were mostly UNIX or Linux based and Windows based right or did you use any Mac data sources or anything like the Mac package managers so we haven't that's something that's long the list of thing okay look at at some point in the future but just that most of most of what we were interested in was just proving the concept in general yeah absolutely

cool

okay thank you very much [Applause]