MAGIC: Malware Analysis to Generate Important Capabilities

Name: MAGIC: Malware Analysis to Generate Important Capabilities
Uploaded: 2018-11-09
Duration: 35 min 55 s
Description: Presents MAGIC, a machine learning approach that uses static assembly analysis to predict malware capabilities without running samples in sandbox environments. By analyzing instruction patterns and control flow graphs with decision trees, the method achieves 86–97% accuracy across multiple malware f

BSides Delaware · 201835:5546 viewsPublished 2018-11Watch on YouTube ↗

Speakers

Sean Kilgallon

Tags

CategoryResearch Technical

TopicMalware Analysis

ResearchEmpirical Research Technical Deep-dives

StyleTalk

Mentioned in this talk

Tools used

Cuckoo Sandbox Ghidra

Service

VirusTotal

Vendors

Reversing Labs

About this talk

Presents MAGIC, a machine learning approach that uses static assembly analysis to predict malware capabilities without running samples in sandbox environments. By analyzing instruction patterns and control flow graphs with decision trees, the method achieves 86–97% accuracy across multiple malware families, enabling faster threat classification and evasion detection while maintaining interpretability for threat analysts.

Show transcript [en]

I don't you

let's start it up you

yeah this one is for like the actual iPad the top speakers like speaker for those iPhone okay really loud it means clipping probably so we're gonna bring it down a little bit tenzin hello speaks just two cameras just so you know there's the livestream camera it's that one and this one's like for the recordings for later yep and I don't need to do projecting I'll show you whatever you need to know another okay but this one should be pretty easy once you get it going I think we're all ready do we take the mic off so we're recording now yep I'm doing a little bit just because it's telling 1080s it's only picking it's not the

highest definition so when you turn it off end of it that's the only option

[Music]

when you are so projecting for that mic just for the rim scale in the back there player yep so my research is all obscurity using large scale into learning for the detection classification of malware so this paper with entitled magic that we're analysis to generate important capabilities so this paper is all about using malware analysis and machine learning to predict a malware capabilities before it's run in a sandbox environment so normally you would find a malware behavior by running it in the sandbox you can see if it became hero capabilities it has this paper we're trying to use fast aesthetic announces so also leading the scientist in a cyber security company about cyber 2020 so a lot of my research goes into

we're going to platform so get any questions at the end or boring about research or my work don't free that s questions alright let's get started so talk a little bit about the major problems in this atmosphere research one is that there's millions and millions now we're out there currently as last year's 720 million malware variants are out in the wild and right now I think Amelia's or so Mac and malware variants are being released every day that also coupled with now we're coming increasingly more complex so in that where that tries to predict whether or not it's being seen or being tracked in a sandbox environment things like that number two being the scarcity of

datasets so there's really no singular place to go and find malware to do malware research so the virus total which a lot of people use but it's really hard to get actual now where files from there so we actually partner with a company called reversing labs in which we get a huge malware stream we're now moving thousands of malware every day and running after our analysis platforms and then that brings me to dynamic analysis so kinetic analysis would be taking a malware running it in a sandbox environment see youth behavior so that's really good and that you can actually watch the malware perform its malicious behavior utilities but it's expensive so you have to run a malware say anywhere

from 30 seconds to 5 minutes in the sandbox environment if you're trying to analyze a hundred thousand malware that's been take a lot of time and a lot of resources so we're trying to come up with ways certain so right today I'm presenting our magic model which leverages fast static analysis to increase needing to run dynamic analysis as well we use this assembly analysis which is our most complex form of our analysis or static now our analysis so we know what kind of instructions feed malware passes and we did math those instructions to specific abilities one of the e is inside actually was to use highly interpretive all these original lines machine learning model one to

provide that to threat intelligence people and organizations and to people like to know why the decisions are made in the company so more complex machine learning methods such as neural networks might be more accurate but also don't provide any sort of feedback on why the decisions would be made so in this project we decided these decision trees so I'll actually get an example that when I apply so the first thing we have to do is talk about at a characterization or static announced some power so static analysis can range anywhere from looking at the distribution of bits in the file or looking at the strange of a file all the way down to the stability analysis and

looking at the source code so in this specific example we're using great attitude which is a disassembler and a reverse engineering tool so basically we can take that malware binary we can reverse-engineer it and try to get a source code of tree from that so there's three different levels of source code that we can look at you can look at a function level basically a control flow graph or an instruction or all the way down to the instruction level so the three different types of granularities we chose to do control flow graphs becomes a block level which gives us the granularity like function level in the kernel maybe sometimes there's only one node in the source code tree at the

block level of it tends to be hundreds at instruction level you can have a graph that's 20,000 votes it's really complicated to understand so one of the features that we use as an instruction histogram so we have basically all the instructions that the binary will execute over it's over inflate and we construct a histogram of that and that's it it's basically a lesson policy so that becomes one of the inputs that we can use for regime learning another more complicated way that we can use this information is if we just used say an instruction histogram you lose a lot of the information as in the graph so what we decided was maybe we'll do or random

walk of the graph so we'll start at the start position where the where the source code starts to execute instructor random walk so this is just by that one Walker five notes so basically every note is a block or a set of instructions so we basically construct a feature vector that's a set of instructions the set of instructions and set of instructions v 5y and so this gives us five instruction histograms that gives us a little bit of detail about how the source code tree is designed and then we can start the feature vector so the next thing we have to actually decide is what capabilities need to now are has but we can we can

find that out by actually doing dynamic analysis or taking that malware I'm running it in a sandbox so there's actually two ways we do that the first way is that we have a partnership with reverse and labs where we get our our data they also curate it so we have indicators of interest which basically provide us buckets of capabilities also we have a dynamic analysis platform around AWS so that's running cuckoo which is an open source sandbox environment it's also they also partnered with mitre which is a large corporation normally handles government contracts they actually started this capabilities called Mike form our attribution and enumeration characterization stands for but basically they wanted to create a set of

messaging protocol that they could send if they say cyber intelligence team finds a malware and wants to say to another organization that they might also see that malware so information can be passed from organization to organization so basically we can run on our through this blue cluster we can also block out indicators of interest here and we can get our targets so this are our mic capability bucketed so things like ant behavioral analysis so aims behavior analysis would be something like the malware checks its browser history of the Windows machine if there's no browser history it's most likely tampons it can do things like if it's only brought on one core mostly a clean sandbox things like that we want

to prevent against if we know the malware is empty behavioral before we run it in the sandbox then we can then we can may be running on bare metal system in which you know there's no sandbox environments now where will prob completely and things like that so the indicators of compromised from reverse and labs are bucketed there there were basically major categories which have subcategories and so we construct models for major categories so our dataset from this paper actually there's two sizes our kuku size is around 6 a.m. hour because we had to run the malware ourselves on our own clusters our a 1000 data set is around 60,000 malware and they come from 9 different our families

now this data set is all financial so now we're targeting other financial organizations were personal banking information and the accounts of the capabilities are in these two powers so we chose to use decision trees because they're highly interpretive all and these the middle class and we can basically point to the decision trees and say why they're making me so a 1000 is reversing left switch analysis cluster so cuckoo would be of our dynamic analysis platform a 1013 question so also in inherently performs feature selection and importance it's very low computational cost but they train all of our decisions reason two minutes I think and then most importantly we can identify instructions business and capabilities so looking at

our decision trees we could make begin recognize patterns and things like that so the results these are the rules results of the a 1000 or the tables are compromised so we're lowest accuracy around 86 percent high it's a 96% but across the board they perform very well personally using decision trees which is a very simplistic machine learning algorithm but and moving on our kuku results are bucketed so there's only seven categories they also range from around 86 to 97 percent which leads us to believe that both sets of capabilities are similar the the results from this basically allow us to make some predictions and those predictions basically allow us to use vast static analysis that we can complete

within I don't know 10 seconds 5 seconds to pretend important capabilities of the malware that we would normally have to run in in a sandbox for upwards of five minutes potentially it would never show to be it also we can identify malware that's Anton sandbox we're trying to predict whether it's being watched we can use something like a bare-metal system so that's just a computer with Windows and salt on it with some sort of supervisor that's watching from our so we can't determine that's a blacksmith's Coddington also allows the spot detection malware which might require more sophisticated machine learning techniques or more sophisticated than analysis techniques so this would be an example of our decision tree model so

all of these splits are actually instructions so this comes from binary global instruction histogram which means we take all the instructions from our construct a histogram and then we basically say whether or not that instruction exists and so we can say you know yeah it broke man down does he move yes so these types of patterns are really interesting for cyber analysts or threat hunters and organizations that have lots of information that they don't know what to do so we thought this might be interesting for cyber analysts especially if they're going to senior management and trying to to justify their actions it's just simplifying instructions for now but this is acceptable to pretty much any

instruction we just mapped I think 40 instructions or so and then counted from their source code

any other questions yeah so I've got a question just about the use case for this just it seems like you're not including like normal file so it's not determining whether or not it's malware you're determining the capabilities of the malware right the cells I think that like I came across a single mouse right yeah so so this has kind of to to use cases one is this would happen after malware detection probably at the same time during our family classification because basically all predicting capabilities is all about what the malware does you want to know if it's ransomware versus and spam because the spam bots just an unsparing or email ransomware it's going to encrypt your

entire system so knowing that capability when you detect that it's malware will determine how you remediated the malware as well the other use case is basically to quickly determine malware scheme of building without remember so five minutes may seem like not a lot of time but if it's a ransom way or five minutes to do a lot of damage

I guess one more yeah so you mentioned progress typical and as you know like how our fighters perfectly test their harvest if this becomes more prominent absolutely like they could start to also yeah yeah absolutely I think the one of the biggest emerging topics in cyber security research at Caesarea and there's a lot of work being done an adversary machine learning in terms of like in the recognition but this is the scary portion about adversary which Department I mean if you make if the leading organizations use machine learning to do cyber security threat detection there's ways around that there's ways to attack what you want to say that's that's a big problem especially using it simplistic model

yeah so we took I think we just took a list of all the malware that we currently have and all the assembly instructions that we had and we picked like whether you need top 50 or 40 most frequent instructions yeah so that's why we were using the feature with where we're using decision trees because it inherently does that feature selection portion but it's definitely the data that we're using for this is simplistic because we wanted the digital asset but if we wanted to extend this we could use things like distribution of divs in the file but it be mr. Graham's

MAGIC: Malware Analysis to Generate Important Capabilities

Related talks