← All talks

Attacking The Malware With AI - Dimitris Prasakis

BSides Munich19:45922 viewsPublished 2022-05Watch on YouTube ↗
About this talk
Malware poses one of the greatest threats to the cyber industry. More than 450,000 new malicious programs and potentially unwanted applications (PUA) are registered every day (AV-Test Institute, 2022). As a result, there is an imperative need to automate the process of malware analysis by onboarding artificial intelligence into our defense toolbox. In this talk, we are going to discuss some of the state-of-art methodologies that modern anti-viruses use for malware discovery and classification. More specifically, we are going to study the Malhuer framework (Rieck et al., 2011). Based on this paper, we will explore: how the behavior of malware can be analyzed using sandboxes how those sandbox reports can be embedded in a high-dimensional vector space how the extracted data points can be compressed into a smaller set of representative prototypes to reduce the computational complexity of the machine learning algorithms and how the embedded malware behavior can be incrementally analyzed on a recurrent basis with the use of clustering & classification algorithms, to either classify unknown programs or discover novel clusters of malware. In short, we will examine how state-of-the-art data science concepts and algorithms can be onboarded by cyber security researchers and engineers, to automatically attack and expose the malware. Speaker Dimitris Prasakis
Show transcript [en]

thank you hello everybody and welcome i hope that you have enjoyed your lunch break and that you're having an interesting and educational day so far i'm really glad and delighted to be here with you today at besides munich where we're gonna have the chance to explore one of the places where the finest concepts of data science and cyber security meet before doing so i'd like to really quickly introduce myself to you my name is dimitris prasakis and i'm currently pursuing my masters in cybersecurity at georgia institute of technology georgia tech in the us i'm also working as a security engineer at trade republic in berlin where i specialize in cloud security and infrastructure security my interests

also lie in ai and privacy and especially where all these worlds intersect with each other and one of those intersections is what we're gonna see here today together our agenda is made out of four parts the first two of them are introductory we're gonna talk about malware analysis and about sandboxes and we are also gonna talk about machine learning and introduce the concept the tasks classification and clustering the third part of the talk is the heart of the talk my your framework it's a malware analysis framework with ai and we're gonna explain how it works step by step and we're also gonna have a decision for questions and discussion so without further ado let's start

and let's talk about malware analysis in cyber security there is a job title malware analysts and these people their day-to-day job is to analyze malware and they do this with the help of certain tools called sandboxes a sandbox in short is an isolated environment in the context that everything that happens in the sandbox can't escape it if we execute a program in that sandbox it can't infect the host right so that's the first thing it does and the second thing it does it is that it has the ability to study and monitor the behavior of what is being executed inside of it so instead it takes a binary malwarebinary as an input it executes it studies it and then produces

a text report and this report will contain things like network requests made or metadata about the malware but today we care in particular about commands this malware tried to execute and we're going to talk about windows malware specifically and windows api calls apart from sandboxes the second introductory point is the task of classification classification is a machine learning concept is the task of taking an observation something about we don't know the category it belongs to right and putting it into some pre-known category into some class to be more formal so let's say that we download an executable from the internet right the task of classifying this executable to benign or malicious to good or bad is called classification

and in machine learning we classify objects with the help of classification machine learning algorithms and one of the most known ones is k n and k nearest neighbors and it operates based on the folks saying show me your friends and i'm gonna tell you who you are or in other words show me your nearest neighbors in the vector space and based on their class i'm gonna predict your category and we see here the example where this observation up there gets classified as a benign program because the nearest neighbors are benign apart from classification we have other tasks in machine learning and one of them is clustering and unlike classification which is a supervised kind of learning because we

feed our algorithms with labeled data so we fit the algorithm with data that have labels clustering is a task of is a kind of unsupervised learning we feed our algorithm with unlabeled data and the clustering algorithm tries to create groups of those unlabeled data clusters which we are going to equate today with the concept of classes let's see how all these three concepts sandboxes classification and clustering bond together under a framework for malware analysis called my year but before doing so i would like to talk about a very interesting statistic every day almost half a million new malicious software and potentially unwanted applications are registered i would like you to think about the volume how much volume of malware this is we

need to understand this malware we need to study it so that we can attack it right before it attacks us we can't do that manually we don't have enough manpower we don't have enough bandwidth right malware analysis is a very energy-consuming process and it takes time there is therefore an imperative need to automate this process as much as possible excuse me and this is where my year framework comes into play my year framework is an open source framework for malware analysis with machine learning and it is used to do mainly two things to discover new clusters of malware in other words to take all these half a million malicious software every day and try to

group them into clusters that contain malware with identical behavior and to also classify unknown malware observations into clusters that it created and we're gonna see how this happens in a bit my ear framework looks like this a simplified version of it we talked about the first concepts right so we take all marvel binaries we pass them through sandbox produce some text report and we talked about clustering classifications we want to cluster and classify classify those malware reports but in essence clustering and classification are statistical algorithms right they don't understand text files so this is where the second step embedding of behavior comes into play we take those text files and we transpose them into mathematical objects

into vectors so that we can later classify them and cluster them there is also a step in the middle the step of prototype extraction is a very beautiful idea which we are going to talk about in a bit so without further ado here what you see is an xml encoded malware report and i modified a little bit for the sake of the presentation and for understanding better what's going on we see three snippets of instructions the first one is something you're going to see many programs do the program tries to load the windows kernel into the process memory and also it reads the time nothing malicious about this snippet of commands right but if we scroll a little bit more down

and we study the next snippet the next sequence of instructions we're gonna notice something at least suspicious the malware tries to copy itself from the location it is to a windows protected area called system32 and while doing so it also renames itself to a windows well-known process cscs rss.exa and after doing this it also modifies the windows registry keys so that it runs every time window windows start up so that's not something a benign program would do right why would you copy yourself into a protected area and then put yourself to startup you may do the latter but especially those commands in combination show some malicious behavior and we see another example another snippet where the malware just

tries to check if it's being debugged and malicious programs usually do that excuse me to understand if they are being analyzed and in this case they try to shut down their malicious behavior and conceal themselves they try to hide themselves from the malware analysts there is an important observation to make to be made here from this example that behavior is often manifested as a sequence of instructions which we formally call as q grams so a q gram is a synonym for a behavioral pattern let's say and we saw two examples of two grams because they are made out of two instructions and this concept of q grams is important because it's what enables us to

embed the text files into the vector space we want to do that first of all so that they can be read by the clustering classification elements but also because embedding the reports into vectors will enable us to express their behavior in a geometrical way right similar reports similar malware are going to be closer in the vector space and vice versa we embed the behavior using a mathematical function called embedding function which looks like this we're going to explain in an intuitive level what this does simply it gets an input which is a malware report a sequence of instructions right and produces an output an n-dimensional binary-valued vector where its dimension represents a q gram it represents a

behavior for example imagine that we have the dimensional space right the x dimension z dimension the x dimension and the y dimension we may embed this behavior this q ground in the x diamond in the z dimension excuse me and say for example that the malware tries to copy itself into the windows protected area rename itself and put itself on startup if the value here is one or otherwise if it's zero the malware doesn't try to do it and we may embed this behavior the model we're taking for debugger on let's say the x axis right if it's one the malware tries to do it if it's zero the malware doesn't try to do it

imagine this for n dimensions in the n-dimensional space right something really interesting happens when we embed malware in the vector space like that because each dimension represents a behavior right similar reports will be closed with each other as we said and they will form dense clouds into the vector space if we look at the sp this space from far away we're gonna see clouds clusters of malware and each cluster has malware that behave in similar ways and this exact distribution is what we are gonna attack in order to exploit the malware before it does that to us before talking about that however let's dive into a very interesting topic of computer science computational complexity algorithms take time to be executed

correct and especially machine learning algorithms scale in a super linear way let's say in polynomial time depending on how much data we give them that means that they are quite slow and we want to execute this framework every day so there is an imperative need to reduce the time of the framework as much as possible there are three things that we can do either buy more memory and more cpu but this is limited right and by default we have some physical limits as well we don't have quantum computers for example so that's that option is out the second thing we can do is use faster algorithms but again machine learning algorithms are refined like they are as good as they may get we

we may be able to modify them slightly depending on the program problem but mostly we can't really reduce how much time they take the last thing we can do and that we are going to do is reduce the data that we fit the framework so what if instead of feeding the framework with half a million reports every day we fit the framework with let's say compress them with a ratio one to ten and we feed it with fifty thousand reports this is where very interesting and beautiful idea from 1989 the idea of prototype extraction comes into play instead of analyzing all these initial model reports all these data points what if we could extract representatives what if we could

choose certain data points that would represent all the others we call these data points prototypes and luckily enough there is a prototype extraction not the optimal one but a good enough one that runs in linear time so we take our reports and we try to compress them into small set of reports which we call prototypes and instead of feeding our classification and clustering algorithms with the initial reports we're gonna cluster the prototypes themselves and propagate the results of the algorithm back to the initial data points and see we see here we see the five reports right that we extracted before the five prototypes excuse me and they are being clustered into three clusters one on the top right one on the

bottom left and one on the top left one would say it doesn't really match makes much sense to create a cluster of one person right like we do on the top left why would we do that why would we create a group of one and we would agree with it with whoever claimed that right and this is why the algorithm takes a parameter and we say that if a cluster has less than m members if our clustering algorithm creates a cluster with less let's say than two members we are gonna reject this cluster we are gonna put it on the site and we are gonna keep these data points for future reference maybe tomorrow where we're gonna have new malware

reports some new prototype is gonna be extracted on the top left and as a result maybe a new cluster will be formed because it's gonna have more than m members and the same exact idea applies to classification we draw our decision boundary and we may say that if a new observation falls here for example the closest prototype is this one as a result we are going to classify our new observation into this cluster but if it falls more than dr distance away let's say here we reject it and we keep it for future reference we described the simplified version of malia we described how we embed malware into the vector space how we extract prototypes out of them and how we can

use these prototypes to classify and cluster faster the malware what we didn't talk about however is how we can make this framework recurrent we said that it's gonna be executed every day let's say and analyze half a million malware right this is where our last concept comes into play the idea of prototype incremental analysis excuse me and the framework looks like this so the first steps are the same we take the molar binaries we monitor the behavior we embed the behavior and then instead of classifying the initial data points we classify the prototypes on the rejected prototypes from the classification we run the algorithms of prototype extraction and we extract prototypes out of them and then we feed these prototypes into

the clustering algorithm and these new prototypes maybe will create new clusters with the yesterday's prototypes and so on what we described basically is this algorithm this is the heart of my ear this is the entry point the main function of it and it invokes all the other three algorithms that we talked about and this way our model gets stronger and stronger every day our framework can create new clusters of malware and as a result it can identify new malware and in this way we can attack the malware before it attacks us the talk is based on a paper called automatic analysis of malware behavior using machine learning it's written i believe by two german professors and

one austrian professor about 10 years ago but it's still relevant and it's very interesting if you enjoyed the talk i urge you to give it a read because it also has some concepts that we did not have the time to talk about today i want to thank sneha for guiding me and helping me prepare for this talk today thank you sneha i really appreciate it and i also want to thank besides munich for giving me the chance to be here with you today and explore one of the places where the finest concepts of data science and malware analysis meet i would love to connect with you i have a qr code down here i hope it works it's

not the malware i promise you don't have to believe me though right so you can search me by sorry right correct so yeah you can also search me by my name and i would love to hear your questions and discuss with you about what we presented thank you

awesome thank you very much any questions just come in front here please and ask your

questions just come in front thanks so we can hear you online as well hello hi hello first of all thank you very much for the very impressive introduction of your system i have just one quick question about it where do you get all that information related to the malware from because i mean it's a huge amount of data you need to put into the system and are you do we have any sources or where do you get so from usually we get those that's a very nice question actually we get those from honeypots right we a honeypot is something like a sandbox we could say that we exposed the internet right and it has some vulnerabilities and

attackers may get access to this honeypot and try to install some malware but they failed to do it because this is how handbots work so we take this malware symbol and we throw it into my ear so honeypots usually is the way how we gather those apart from that also self-reported malware from people or things that either viruses already detected let's say okay perfect [Music] any more questions

i will suppose it was clear then awesome thank you very much i'm sure you are available for q a later on yeah i would love to talk with you thank you very much