← All talks

Machine Learning and Images for Malware Detection and Classification

BSides Athens · 201716:2085 viewsPublished 2017-10Watch on YouTube ↗
Speakers
Tags
About this talk
Security BSides Athens 2017 (24/Jun/2017)
Show transcript [en]

There are some disadvantages though. Our minor code is really data dependent. So it's all the malware that already exists. So it's problematic to counter or detect zero data attacks, something unknown. And also, when we have only the structure or the construction of the malware, we don't have much information actually does the malware so we have only the signals so

when we actually want to collect or find malware and make malware databases we find usually byte files, ACM files or .exe files so then we need our experiment were performed in the supervised, the atom was supervised. So you get the labels and the mother families of this mother. So there are some problems that we may face in a large scale problem. It is called overfitting. Overfitting is when we have so many parameters to process, sometimes when we don't have if we don't make the necessary processing in order to perform our experiments, we have some errors so that we cannot actually find any relationship when we present our results. In order to counter this problem, we use a method of cross-validation. So this means that we have our data and

we make training data set and then when we train our algorithm we have a test data set which then we have our result below on my last figure you can see actually how I managed to do the processing of data first I create the features then I do first I do the cross-validation when I get the features and I then do my classification experience. So, the first, when the first, my first feature is called the segment gap. When we actually have an ACM text file, this is a snippet of an ACM text file, we see these things. we see segment, the address, bytes, opcode and operands. Literature says that we can actually count the segment of

this ASM file and when we say segment we mean the first column model and it is the keyword of this ASM file so we actually count how many lines of malware have this segment in this snippet that I showed you you can see only dot text there are other segments there are array data, data, headers so there are more segments on ASM file so the first feature that we we can use in order to perform good classification algorithms in segments of an ACM file. Then again, literature suggests that we can actually present a malware as a binary and then we can generate images of this malware. So this is a malware, the same malware, presented

as an image. The first is the valid file and the second is the ASAP file. So, what actually this means? That now we can actually perform image processing algorithms in order to identify some things. For example, to find the visual similarities of this file. and actually we only need because we have a great scale image we only need the first 800 pixels of this image so actually we can use image processing algorithms such as scene classification algorithms but instead of scenes we have images of matter

When actually we have identified these two features, we can actually perform our classification experiments. So I represent four classification algorithms. Random forest, multilayer present, decision 3, analysis and drawing. So decision 3, it's actually, I don't want to focus on the definition of this algorithm.

So the definition will be brief. So a decision tree is a binary tree and a classification algorithm that can be utilized to perform classification. And actually each side of this tree is considered as class later. So then,

the goal of this decision tree algorithm is to perform and calculate its predictions while it calculates the data that was actually extracted from our future engineer then we have random forest that is similar to decision tree but actually when we have the training time process it constructs a multitude of decision trees and then it has a goal to

output a classification or a mean prediction of the individual trees. Then we have multilayer perceptron that actually is a supervised algorithm that learns function f and we have some inputs and we try and we have a target output so we need to train this algorithm in order to have this target then we have nearest the drawing it is again a classification model that assigns the observation of the class of training schemes meaning that they have an exteroid and then we have our results so you can hear we see that we have some good, our algorithm was fast as we said and we have some accuracy, a good accuracy on this algorithm so we have the bar from 0 to 1 and

if we have a good classification result it will actually put a 0.01 as a good classification result. The blue, as we see, it's actually found similarities to different malware families and so we need to find a way for some other features in order to perform better classification. On the multilane spectrum, we have one misclassification to one Malbert family. Then here we have again our next algorithm results. It's the same. Again we have I found similarities to Malbert families as well. Then

Actually, when we build such a system, we can say that we need to find a way how to defend the system, but first we need to find a way to attack. So this is called from the literature that we built at the external chamber. What this means. When we have an image that exactly represents a panda, we actually add a noise of this image in order to identify by the algorithm as a different object. Then my attractive model is actually based on the image part that I presented and it's actually how we can use the steganography in order to perform or in order to make malicious examples. What does this mean? We have clean images that we can have from any nest or image net

and we have the malicious images that actually convert them as we said to the previous slide and I made the process that actually it can save the second image to the first image and the two lowest bits meaning that these two lowest bits they take a range of values and our malicious images have values from 0 to 255 so

When we do such a process, we generate new malicious images, but they are not actually visible from the human eye. So this is a theoretical model that I thought, and it's not in a large scale yet. It's not tested in a large scale problem, but I hope to manage it and test it in a large scale extent. Then, when we have this, we need to find ways how to defend. One way is called adversarial app training. And it's a brute force solution, meaning that when we have generated our malicious data, We again train our algorithms in order to become more robust. Then we have defensive distillation that actually is different from our solution. We have some probabilistic models

that we have already trained them with another task. And then we get them as an input. So in order to have outposts, probabilities of the class that it is considered malicious. Then, in order to defend our system from the steganography part, and because we have images, we can use image forensics for the analysis, meaning that there are statistical

more advanced analysis techniques in order to identify if a noise or a second image is added to another one. So we can actually train algorithms with these methods and automate this process on a large scale program. Though it is really difficult to construct a theoretical model in order to say that this problem will work. And

actually, because machine learning algorithms is an optimized problem, so we can actually say that we can find a defensive rule that will work. when we perform an attack model or a defensive model. Then the second problem is that it's a more general one, and it is encountered in a general classification algorithm. So it's really difficult to have good outputs on every possible input or to find inputs that will make sure that we will get good outputs. So some future work that I already mentioned is that we need to combine malicious images with green images and perform experiments in a more large scale environment. There are except images or PDF files there are other objects like voice sounds or app files. That's actually something malicious that

could hide in such files. And then we need to test those updates to new databases. Thank you for the time. I plan to put my implementation on GitHub soon. And thank you very much for your hand.

[ feedback ]