Fully Automated Luxury Malware Detection

Name: Fully Automated Luxury Malware Detection
Uploaded: 2019-06-13
Duration: 38 min 59 s
Description: Fully Automated Luxury Malware Detection: the good, the bad and the ugly of malware detection with machine learning. The old days of examining samples, creating signatures and deploying rules-based detection engines will soon be relegated to the dustbin of history as we welcome our new machine lear

BSides Oslo · 201938:59234 viewsPublished 2019-06Watch on YouTube ↗

Speakers

Camilla Montonen

Tags

CategoryTechnical

StyleTalk

About this talk

Fully Automated Luxury Malware Detection: the good, the bad and the ugly of malware detection with machine learning. The old days of examining samples, creating signatures and deploying rules-based detection engines will soon be relegated to the dustbin of history as we welcome our new machine learning security overlords. Or will they? While there is no shortage of articles about how AI (the marketing speak for machine learning technologies) will revolutionize everything from beer brewing to the way we consume cat videos and secure our computer systems, the reality of doing malware detection and analysis with machine learning models is far from a fully automated luxury security utopia. This talk will give an introduction to the field of machine learning from the perspective of malware analysis and detection. We will examine previous attempts to build and train classifiers that can classify previously unseen malware samples, the lessons learnt and the hurdles that are currently preventing the field from progressing. As a closer case study, we will examine Endgame’s open source EMBER dataset - a collection of data about 1.1 million malicious and benign Windows PE files - and learn about the kinds of features one can compute from PE files, how these features can be used to train classifiers that can predict maliciousness of previously unseen samples and what this all means for a potential future of Fully Automated Luxury Malware Detection. Camilla Montonen By day, Camilla works as a software engineer in the Machine Learning group at Elastic. By night, she pokes at curious looking Mirai samples and polishes her reverse engineering skills.

Show transcript [en]

ever hand everybody

thank you very much can the people in the back hear me all right okay great so I'm really excited to be here this is my first time in Norway and also first time at a security conference and I'm going to be talking to you about what's gonna happen to all of us when our AI overlords take over or as they've referred and his previous talk the AI nonsense so a little bit about me so I work on the data team at a company called elastic the data team is part of the bigger machine learning group and we do a lot of data analysis and that's sort of my day job and at night I wear a

giant bat costume and I fight crime but in my dreams not really by nighttime I do some reverse engineering on the side and sometimes I look up some interesting things so as always all the opinions are here presented my own and the mistakes are definitely my own so if you see anything then just please shout out right so I'm gonna take you back to a nice warm day in June in 2012 when humanity solved one of the most important problems we've ever had how to identify cat faces in YouTube videos this was a this was an article well this was a piece of research published by Google where they trained a machine learning algorithm called a neural

network to identify cat faces and I think the data set that they used was YouTube videos so after this monumental discovery where humanity was forever changed a lot of people became very worried because all of a sudden computers were starting to get good at the things that we were good at so watching cat videos and so people you know I typed this into Google and and this is what people are are worried about is gonna happen you know what is our world going to look like when automation is gonna replace all of us so this was this was some people's worry sorry others were very happy and are gonna welcome our AI overlord for example into our ovens so it was quite a

surprise to me that a lot of people are very excited about putting an AI algorithm into their oven to adjust an optimal cooking temperature for muffins personally I would be a bit scared to put software into my oven because you know those things get very hot and I'm just dreading the bug where someone like forgets to set a safe limit for the oven but anyway there's this oven called Jun and it's actually already in the second edition and it's sold out so apparently it's very popular you can also use artificial intelligence to brew beer I haven't tried it but maybe it's good and some people even say that in the future when AI algorithms are doing all the

work for us our society will be transformed and the nature work will be transformed and as iron Bustani published this book will be living in a fully automated luxury communism which inspired the title of my talk so I was wondering in this future where our AI overlords have taken over what will happen to malware analysts will they just be pushing buttons while the AI algorithms perform all the important work so as you know a lot of this is hype and I sort of want to take a step back and just look at a little bit of the landscape from 10,000 feet to understand what's beneath the buzzwords throughout history since we've started trying to develop an artificial

intelligent intelligence there have been several different techniques that people have used and the one that's gaining popularity at the moment and probably for the past 30 years is a set of techniques known as machine learning where we use various statistical algorithms to try and produce computer programs that learn from data without being explicitly programmed and actually a prime example of this is something that all of us benefit from every day probably one of the first examples of successful application of machine learning to cybersecurity which is spam filtering if you think about this problem this is a computer program that would be very hard to write using traditional approaches you couldn't just create a rule-based algorithm that would

try to match all the possible words that could occur in spam because if previously you were just filtering out all the emails with the word enlarged something and you would spam writers would try to out trick you by you know substituting letters numbers for letters and then your program would break and you would have to retrain it but with machine learning your algorithm will automatically adapt to changes and it will start to associate that new word with a higher probability of being spam and therefore it will automatically sort of retrain itself so this was definitely one of the first successful use cases of machine learning for security now if you look at the general landscape of machine

learning we can roughly break it up into three different types of sort of movements there is supervised machine learning where you are training your algorithm to classify data points into known classes or to give you some sort of values so for supervised use cases you will need to provide a labeled data set for example you have a labeled data set of binaries that are malicious or normal and you could use this to train a supervised algorithm the second big trend of machine learning algorithms are called unsupervised so here you're not using any kinds of labels but instead you want the algorithm to tell you something interesting about the data clustering is a prime example here where

you simply give your algorithm a big chunk of data and you tell it to to group that those data points into interesting clusters that might be useful for you so a prime example here would be let's say using an unsupervised learning algorithm to find out to group your malware samples into different the third area is something called reinforcement learning and I haven't seen much of that in cybersecurity applications here you basically have an intelligent agent that tries to make decisions about data points and then gets feedback from its environment so it can learn more but as I mentioned for cybersecurity the two main things that I'm going to focus on is the supervised use cases so here your algorithm is

trying to take data points that are labeled with known classes and then learn some kind of distributions about them so that when you have new unseen data points the algorithm can see oh this data point is closer to that cluster so it must be it must belong to that class and likewise in unsupervised learning you would have groups of you would have just a big blob of data and then the algorithm tries to figure out how many clusters you have and this is a sample from a very popular open source machine learning library called scikit-learn and you can see in this picture they've applied different kinds of clustering algorithms to this data set and then it

has produced those different kinds of clusters so in both of these classes the supervised and unsupervised learning algorithms there are many different versions you could choose and there is a concept that we have in machine learning called no free lunch which means there is no single algorithm that will perform better on all data problems than all other algorithms so you can't just walk up and say ok now I will always be using support vector machines because they are the best that's not the case instead you have to do a lot of experimentation and this is where part of the so-called dark art of machine learning and data science comes in is that you have to do a lot of

experimentation and a lot of trial and error to really get this right now this is also where someone like me who mostly works with data can learn from someone like you who work in the information security field because algorithms are context agnostic they don't really care if the data point they're trying to classify is coming from a malware sample or if it's coming from an intrusion detection system or if it's coming from let's say some customer data from marketing department all they care about is the numbers that are coming in so it's really up to the malware analyst and the data scientist to collaborate together to come up with a good way to represent let's say

malware samples so that we actually work on problems that are meaningful this brings us to our next big problem when we're trying to apply machine learning to cyber security how do you actually if you want to do classification on binaries let's say you have a set of binaries that are coming in through your honey pots and you want to determine which are malicious and which are not how do you actually take that binary and transform it into a data point what kinds of things about it do you compute do you take the virtual size do you take the number of strings what do you do this is what we call feature engineering so you want to compute statistics about

your binary is it would give you a good separation between malicious and benign classes so suppose you would say ok I'm going to use the size of the binaries to train my data if the distribution of sizes is the same in the malicious and benign classes you won't get very far so what have people used in the past when they wanted to classify binaries into malicious and benign I've looked at a lot of literature surveys going back to 1995 and some things that people have used there for example function names the names of libraries that binaries import byte histogram sequences of function calls and strings present in binaries now most of these are great except that for example for Strings and

function names they are very easily derailed by the fact that people usually use parker's or obfuscators so this is also something that we have to take into account now after you've selected your features and you've trained your algorithm you of course have to find a way to evaluate your results and if you are ever working in an organization and you're trying to evaluate a machine learning solution this is something that you definitely have to take into account how the hell has the vendor evaluated that their product is really working so this is not actually as easy as it looks you could say okay I'm going to take my features my algorithm and I'm going to train it

and then I'm going to see how many times it gets the classification right but let's say you had a dataset with 99 good binaries and one malicious one your algorithm could simply be saying benign all the time and it would be right 99 times out of 100 so accuracy alone doesn't actually tell us much about how well we've trained our algorithm or how well it's performing on new unseen data samples so we actually do in the machine learning space is a bit more complicated algorithms don't usually output simple labels like benign or malicious instead they output a probability that a given binary is malicious or they output some sort of score so we take these scores

and we compute what's called a receiver operating characteristics graph that looks like that which tells us the true positive rate and the false positive rate at different trash holds and then we compute the area under this curve and we try to get it as close to one as possible so if you're ever going through literature on machine learning you might see these kinds of metrics being used to evaluate classifiers when you actually try to put this into production as I've learned in my daily job things get complicated very quickly something that works well on your laptop does not work well when it's scaled to terabytes of data you have to train the data you have

to clean it you have to gather it you have to evaluate it and the way the sausage is made is usually not very pretty this is this is also something that you in mind if you're buying a vendor solution that uses machine learning how well are you going to be able to interpret the results that it gives you usually in this space when we want to push performance we sacrifice interpretability the state-of-the-art algorithms that give us the best results are usually the ones that are hardest to understand and in many in many industries for example banking and insurance it's very important for compliance purposes to be able to understand how the black box is thinking

and I think the same probably applies to cybersecurity as well if an algorithm classify something as malicious you want to be able to at least have a good idea or an understanding of why the does so definitely something to keep in mind so beneath all this hype i think it's it's worth for us to ask is this AI nonsense that we're trying to introduce into our systems really worth it because ultimately what you're doing is you're unleashing a massive Hulk of complexity into your system and you don't even know if it's going to be worth it in the end so I try to go through the literature and see some reasons why people might want to introduce a Mau into malware

detection and one of the reasons that kept coming up in paper after paper is that in the past 30 or 20 years we've seen an explosion indifferent in new ways to generate malware samples so what 20 or 30 years ago might have taken an expert to write a virus or someone really dedicated is now actually pretty easy I mean you have a script kiddies selling Mirai botnet variants on Instagram so because of all this leaked source and actually even opens Hearst malware we've seen an explosion of new malware samples there was an estimate by Symantec that in 2010 alone there were 286 million unique malware variants so as you might imagine if we're using the

traditional approach by antivirus antivirus engines where we have to examine the malware and use heuristics to develop signatures for detection this is going to get really hard really quickly and we're going to have to start hiring more malware endless if we're going to cope with this kind of explosion in in malware so essentially one good argument to try and automate this is that we simply have too much to do with the people that are currently working in this industry so I decided to take a look and see how practical would it actually be to automate malware detection with machine learning and I looked at the literature I looked at the literature going back to 1995 now we all

know that the industry and academia are often very far apart and industry has definitely had a very optimistic outlook on introducing AI nonsense into our stocks if you look at venture capital funding that's flowing into cybersecurity you see that a lot of this funding and very large amounts is going to different kinds of AI solutions not just solutions that automate malware analysis but you know all kinds of things user behavior monitoring anomaly detection you name it autonomous threat hunting machine it actually sounds a bit scary like Skynet or something but anyway yeah there are some people that are saying that this hype is actually a bit a bit dangerous because if you think about it you're really introducing

something into your into your security stock that you don't fully understand and I sort of agree with this a little bit is that you should exercise caution there anyway let's move on to academia so what what do the academics think and what kinds of what kinds of breakthroughs have we seen in academia in this space so actually this is not a new thing by any means the problem of automatically detecting malware has history has its roots in at least 1996 this was the earliest reference that I could find to any kind of research that was using machine learning techniques to detect malware so this was a group at IBM they were trying to develop a new

kind of virus engine that could classify boot sector viruses apparently that that was a big thing I actually don't remember this far back so maybe some people in the audience will know but yeah this was the first paper where where people were using these techniques and they actually were not very successful because the way their implementation was done they could only work on very small viruses so then fast forward five years and we have a paper by a research group led by dr. Schultz that was really the first paper to to take PE files so executive Bo's that can run on Windows take the strings out of them convert them to data points and try different kinds of classifiers

like Ripper and naive Bayes the sample size is used in this study were very small in the thousands so they weren't sure if this could really scale higher although they did get some promising results in later papers people have sort of questioned how well this could work in practice so five years forward from that we have a really cool paper that uses lots of techniques that we use for spam classification and text classification for malware and I was actually quite excited to see this paper because it was the first paper that I found where people weren't using a strings or function names that they extracted from binaries but actually the byte sequences that were found in

binaries to create these feature vectors so what the researchers in this paper did is they used hex dump to convert the binary code into what they call engrams so little bits and pieces and then they put those together into giant feature vectors that were then used to train these classifiers and they actually had a bit of success doing this so with the I think 4000 or 3,000 samples that they used they were able to get accuracies which is not a great metric but still they were able to get accuracies of 92 ninety-five percent so that's actually quite good let's move one year forward and we have this really interesting paper that didn't really get much attention but

this paper made a hypothesis that you can actually use the op codes in disassembled malware to predict which binaries are good and which binaries are bad and I thought this was very interesting because I had never thought about it this way so this researcher Daniel Bell our he took a set of known good binaries from his Windows machine and then he took a set of viruses and he disassembled all of them and he looked at the byte code distributions for benign and for malicious ones and he found that there were some slight differences and moreover what he found is that you can actually the presence of really rare byte codes can actually be a good indicator if something is malicious

on or not and this was actually quite cool because as a data scientist I often find that features that we use in machine learning don't have much correlation to the samples but here you can see that for example nob which has a high correlation of being associated with a malicious virus is actually something that you know you has a functional meaning for the virus many virus makers use knobs to pad their binaries and I thought this was quite cool so this this research has been used in other papers as well and I think there are some attempts to to use more op code classifiers to predict malware so fast forward to where people are doing now is the trend and research for

this kind of stuff is using more visual methods to classify an hour and this paper was really cool because the researchers computed different kinds of statistics from binaries and then they used them to create RGB images so actually you're now where were converted to different kinds of visual representations and then you can see from the images the different malware families produce different kinds of images and then this was fed into classifiers that could further classify this so what I've learned through looking at a sample of these papers is that there are lots of challenges in applying this technology even in an academic setting so as I look back at my research I wonder if the MA

if the vendors who are claiming that this works in production today are have really done done their homework in the in the papers that I looked at most of the samples were really small you know in the sizes of maybe maximum up to a thousand binaries so I really wonder if if this can scale to production and if you're ever evaluating a solution that claims that it can do this I would definitely ask some serious questions for example how has the vendor made certain that their training data is representative of the malware samples that are out there so moreover what we really have in this field is lots of challenges in making this scalable and

making this to production and one of those challenges is that we don't really have a big public data set of benign binaries in the papers that I looked at most of the data sets were proprietary and even though you can probably get lots of malicious binaries from places like virus exchange or virus heaven we don't really have good places to get benign binaries because most of them are copyrighted so you can't really share them in public which is usually what's very important if you're trying to do data science development is being able to share data sets the second challenge that we have is labeling a data set if you if you remember what I was talking

about in my slides on supervised learning most of these classifiers require some kind of label data set to train them so we don't really have a large-scale automated way to analyze malware to train it it usually requires someone who is an expert to take a look at it or you might want to use an external service like virustotal but it's known that it's flaky and those classifications do change so that is definitely the second challenge and the third one is you don't want someone like me running around with 10,000 samples of a virus on a USB stick and you know sharing that with my colleagues definitely definitely not end well because unlike you I'm not a

professional in handling handling this kind of hazardous material so there were a bunch of researchers at a company called endgame that saw these challenges and they decided to do something about it so they developed a data set called ember the endgame malware benchmark for research and this included one point 1.1 million portable executable files some of which were benign and some of which were malicious collected in the years 2006 to 2017 and they computed some statistics some features from each of these binaries and they released it out into the public and as I was doing research for this talk I decided that it would be interesting to take a look at this data set so unlike the data sets used in the

previous papers this didn't actually include the binaries themselves because most of the benign binaries are copyrighted by Microsoft so you can't distribute them but they all include the check sound so if you're doing research you can take this checksum and go back to virustotal and look at and you know track down the sample to see what it is so what kinds of features that the amber researchers compute so first they had just some general statistics about every binary such as its size how many imports it has and so forth they also extracted a lot of information about the strings because if you remember back to my section on academic research strings are one of the

important features that people often try when they try to develop classifiers and something that I found very interesting personally is that they computed byte histograms so they took every byte from zero to 255 and they compete counted how many times the bite occurs in the binary and then they published this histogram so I started wondering if it would be possible to develop something called a byte histogram fingerprint to see if the bite histograms between malicious Windows files and benign Windows files were different and the reason that I started doing this is that I had seen this tool published by cert it's an Austrian website I believe and the stool uses this technique of byte histograms

to detect if a malware has been packed by a packer or not so on the left hand side right hand side here you can see that an unpacked malware has a bite histogram that's not very even whereas if it's packed you can see that it's it's completely even and this is sort of a great way great statistical way for you to detect if something has been packed so I I decided to see if we can get anything interesting by computing a large scale view of all the bite histograms in this dataset so because there were 1.1 million binaries I wasn't able to process all of them but I took a sample of about I think 130,000

malicious Windows binaries and 160,000 benign binaries and I created a picture of all of the bike histograms so on the horizontal axis you have the number of the binary sample so each one of these vertical lines represents one binary and they're on the vertical axis you have the bytes and the the lighter the shade on the picture the more of that byte is present in the binary and the darker the shade the less so I was very excited when I first generated these images because you can clearly see that there is a difference in the byte histograms between the benign files and the malicious files so I started to take a closer look at some of the interesting

features I saw for example what is that it's like completely it's like a binary that has zeros on most bytes except a few and likewise if you look at the benign picture you can see that there is something that looks like a track going right across the image that's an increased increase in counts of bytes between a hundred to a hundred and twenty-five so I thought that was interesting too especially because you don't really see it in the malicious binaries so i zoomed in to this picture and i extracted the sample that's represented by the dark line and i decided to see what it was so this turned out to be a variant of cosmic duke which is like a spyware

trojan i wasn't able to find out why the by distribution was so skewed that most of the bytes are actually zeros i suspected that maybe there was some kind of pre-processing going or some sort of packing but what i wanted to see if if if all of the dark lines in this picture above were the same kind of samples so i looked at another one and this was also cosmic Duke so it's actually quite interesting because I wonder now if if this kind of visualization technique could be used to maybe automate some of the detection of this kind of malware another thing that I noticed when I zoomed into the picture is that it looked like some of the lines were

completely grey so there was no variation in the byte count and I started to wonder why this was so if you remember a few slides back I showed you that when a sample is packed the white distributions become very even so I wondered if this was the case here as well if you see a completely grey line it means that the sample is packed and I was actually excited to find out that I was right so this line that I picked out represents a Suzy I don't know it's a very creative name for a Trojan but it turned out to be packed so this was also quite exciting because it shows you that this kind of visualization technique can

give you a quick overview of a large number of samples and you know can tell you something meaningful about them right so another thing that I mentioned before was this interesting swim lane in the middle of the benign now where binary's that's kind of present in the malicious ones but it's not as clear and I I wondered why it was that we saw this in the good memories but good binaries but not in the bad ones and when I asked my friends on Twitter they they hypothesized that this was because this byte-range represents the principal ASCII characters and those are often stripped out in malicious files to help avoid detection and I think that could actually be true so

this is something that I'm going to investigate further so in summary what we've seen so far in this whole talk about AI nonsense is that doing this kind of stuff is very hard the academics haven't been able to get it right and I'm not sure the industry is at a point where this can be reliably deployed to detection the reliably deploy to production but however the number of new malware binaries is increasing quite rapidly so we definitely can't probably rely on malware analysts alone to help us detect new samples so some sort of machine learning methods would probably definitely be introduced in the future we've also taken a look at the process of developing a machine learning

classifier so we have to be careful about our feature selection we have to select our algorithms do experimentations test and evaluate our classifiers the progress in the field is currently being hindered by the fact that we still don't have great public datasets apart from ember and we don't have any benchmark algorithms so when you're talking to vendors about introducing a malware solution you don't really have any kind of benchmark that you can use to evaluate what they're proposing but luckily we have this Amber dataset that can help us fix the gap so lastly I just wanted to say that it always takes a community to bring someone to a conference and in my case I want to say

thank you to a community started by marion marshall AK black hoodie which i attended last year if you don't know about it black hoodie is a group that aims to bring more women into reverse-engineering so this is definitely one of the reasons that I am doing this work and why I'm here so definitely a big THANK YOU to them and also thank you to all of you and I'm happy to take any questions that you have [Applause] hello does anybody have any questions Bower machine-learning yes thank you hey I just wanna say thank you for a great talk I have a question what once these vendors may be able to actually detect malware and I'm guessing the Malheur

authors also will adapt their tactics in order to try to subvert AI how do you say see that kind of play out in the future so this is that's a great question because there is a lot of talk going on about this AI versus AI in in cybersecurity and I think I would reiterate what Dave said we're not at that point yet where malware authors would even need to use that because there is a lot of low-hanging fruit in breaching organizations so you don't need to go to the trouble of developing malware that can evade a machine learning classifier however that being said last year at the black hoodie conference I saw a great talk where

someone had trained a malware communication to a command and control server to look like the communication of a facebook Messenger so it could evade detection so I mean in terms of academic research it is possible to do that so if you're a malware author you could certainly do that but that's a lot of extra effort and I don't think we're at the stage where they need to do that yet I thought I saw you as working for elastic that's all do you know of any current or coming projects or tools using elastic to analyze or kind of productionize the the data coming from such malware analysis into elastic for example so I guess I mean our tools are

more generic data analysis tools in data I don't know of any current project that would develop something like this but I mean it certainly is possible to put this data into elastic because I've done it so if you want to talk to me offline about it then I'm happy to help but I don't think we have any specific products that are coming up for malware analysis anybody else all right you said when you looked at some of the some of the black samples from our histograms you weren't you weren't able to determine why they looked like that no the yeah you're correct I own I looked at some of the samples that were completely black and what I saw in them

that most of the bites were concentrated from 0 to 5 and then the rest of the bites were close to zero so for some reason this binary had had a lot of those bites but none of the others and yeah I wasn't able to determine why this was so I'm hoping I can find a way to get the actual file from virustotal so I can you know disassemble it and take a look at it but that was kind of my follow up suggestion that convinced someone with virustotal intelligence or something that can get you the sample to actually give you the sample you want yeah I think that's a great suggestion I think there is a premium subscription or

something that you can use to pull samples from virustotal all right thank you very much everybody and thank you to our speaker Camilla let's give her a hand [Applause] and we have a small token of our appreciation there you go thank you very much thank you is this alcohol sorry the reason I'm asking is that I'm originally from Finland and we smile even less than the Norwegians until we got alcohol I'm not sure he allowed to open that in the venue but please enjoy it later thank you [Applause]

Fully Automated Luxury Malware Detection

Related talks