2017 - Machine Learning Aided Malware Detection With Focus On Android by Nikola Milosevic

BSides Manchester50:362.5K viewsPublished 2017-10Watch on YouTube ↗

Mentioned in this talk

Tools used

Androguard apk leaks Apktool jadx

Platforms

Android

Frameworks

scikit-learn

Show transcript [en]

thank you thank you thank you all for coming here so yeah like a nickel in my shoes this is my Twitter so in case you are doing anything please tag me or mention me so I know I want to have picture so like who I'm a mile so my kind of security involved it started in 2012 back in Serbia so I found it there or last chapter and was running it for around 2 years then in 2014 I moved to Manchester for my PhD so I have like both academic and industry background I came here like I as well got involved in almost Manchester so like if you were in novice Manchester chapter then maybe

you've seen me I'm as well leader of a Seraphim rights project which I'll be slightly talking here about because this would I did part of it is integrated in the app and yeah like I'm doing at the University of Manchester PhD I'm finishing in couple of months I started working at Manchester Institute of innovation research so it's kind of Philly 82 Manchester Business School and this is my blog so if you want to read more stuff that I do you can visit it and yeah like what is the talk about so like we tried to do some static machine learning base method for Android malware detection and there were like two things that we try to do one was based on the

permissions the other one one was based on the decompiled source code so like we were analyzing source code and yeah I guess I'll talk slightly about Android security model in case someone is unfamiliar and as well about malware detection and as well like so like academically I'm mostly involved in a natural language processing and which is kind of subfield of AI that analyzes language and here we can kinda try to like cross security and natural language processing with machine learning which are like kind of two fields that I don't find that often talk to each other and how talking to each other kind of can help and kind of can make new techniques so Android security

model if you know something about it it uses sandboxes so like it puts every application in the sandbox so application don't talk to each other don't share resources contacts as things that are outside of the sandbox and there is like permissions that are for some system based features that user needs to approve if they want to use them and yeah like usually application can talk only to each other if they are from the same manufacturers or if they share same digital signature however when users are popped up with these like disciplines to use this permission they usually don't check it so like they approve everything and and basically they are getting a lot of a lot risk to

their devices through that so basically why we chose to analyze Maori malware on android so android is in 2016 was 82% of salted devices totally world while currently 68% of mobile devices mobile users use android so it's huge market share and therefore it's very interesting for malware developers to attack specific malware devices since the rest of devices like all other operating systems are while around 30% while it decreases as well sincerely are sold 82% increasing of Android devices and as well there is huge amount of malware for Android so like there is G data reported like around 55,000 new Android malware samples and this this is unique samples so if you so you can do something so

it's the same malware like polymorphism which will make that the malware has this different signature while it does the same thing but like there are like five thousand daily release new samples of malware while if you take into account all these polymorphous malware it comes to around like like tens of thousands even hundreds of thousands and the trend is increasing so like in 2011 we had in whole year released like 3800 malware while in 2015 it was like 2 million 300,000 and the thing is that it can be distributed through Google Play so like you have you can have legitimate stores where you can download malware Google is trying to fight that but their approach is kind of like rule based so

they are now for example I got email that our app uses some admin permission for the device which is used for like remote vibe and features like that that we use for like theft protection and you can have legitimate apps that can use these permissions while they are kind of disallowing them and it is not the best approach probably I think but it kind of adds to the security to a degree but it kind of disables certain kinds of apps so Android malware is usually targeted to the regular users and usual use cases is that they want to steal either personal data so like some things like contacts banking details some sort of secrets or

like files whatever it could be and then as well you can use it to make botnets that will mine a crop cryptocurrencies use it for leaders attacks and they're like okay ransomware which like if you were here in a previous talk is the most common case for the for this like kind of for nowadays for malware and they can destroy device which is less common case however if you if you mind for example cryptocurrency probably in some time it will destroy your battery and yeah like this is for example this is one example of malicious app and how it ask for permission so like it is flashlight app asking for access to storage system tools your allocation phone calls

network communication high order controls and so on and so on and most views are wouldn't look at it and just allow it so there is no reason for flashlight app to have all these permissions so in terms of malware detection traditional literally two approaches one approach is like static which reviews the source code of the binaries and tries to find the suspicious pattern so like quite often this is done by the machine learning or some human who analyzes the code and then the other approach is dynamic which involves the execution of the analyze software in some isolated environment in some sort of sandbox and it tracks and monitors its behavior and tries to figure out whether

it does something malicious so in terms of static analysis traditionally it was used by signatures like the most prominent approach for most of the anti-malware software was to have signatures however as I talked before there is a lot of malware coming out and it is less and less feasible to manually analyze them especially if like back in the years people were analyzing them and saying like okay this is malware I'll take a snap snapshot of the AB make a signature out to feed push it to my anti-malware software to my users and if something same comes they will know they will get it detected however you have like polymorphism that can prevent that and yeah there is like the scale of

development of the anti-malware just grows and then like you can analyze patterns in like binary files in API calls in opcodes that the apps are using or the application this is like general this is like not related to Android only so like whether we are looking at the PC malware or Android or iOS it's it's pretty much same and the methods so like they started with manual analysis then it got we were working a lot of in like some sort of automation so like there was like pattern detection and now the field is moving towards machine learning and in terms of behavior analysis you have like executing in sandbox and then like monitoring certain things like

usage of batteries or code calls API calls the same as previously just like you are not statically analyzing not you're here actually executing the code in some sort of sandbox and as well like if you look at the current anti-malware applications there is they move towards the dis approach they have heuristics that prevents certain or detects apps that are using a lot of certain op codes a because battery and so on so like they have integrated detection of this and they try to raise the alarm when something suspicious starts happening on the device and yeah like in in many cases to to do that it requires root access so like to a certain functionality you can't access them you

can so like we are doing what we can sometimes especially like when root 2 devices you can do pretty much anything you want but you're not developing anti-malware software for people with root 2 devices because you're suspect that they know what they are doing when they rotate their device in the first place so you want to protect actually people like normal people who go to store buy a phone and well don't do anything with it so our approach was static we used two machine learning approaches so like our baseline was built on permission-based so like we were looking only at permissions and there is other approach that is code based so we were looking at the source

code we D compiled it and we use do machine learning groups of a classification and clustering there anyone know the difference between them yes one person until okay so do anyone want to explain they want to say what's the difference between classic yeah yeah okay do you know do you have the idea how you can use clustering in malware detection okay okay so here we go in some intro of machine learning so basically what it does like you trolls and they turn into train some model and then you can throw some new data on it and it will predict based on the experience so it is divided on supervised or unsupervised as well as supervisor ring and reinforcement

learning so there are like four classes of machine learning like 4 ways how to do it we are mainly looking here on supervised and unsupervised learning so supervised learning is as he said you have a labeled data and you're supervising it by saying okay this is malware this is not molar and then you try to extract some features from it and try to learn what is the difference between malware and malware or anything you are classifying so whether it's sentiment like what is positive 2x what is negative whether it's like whatever generally it uses some like probability statistics and some other kinds of maths so it was interesting like I was giving it like some training on machine

learning and natural language processing in to some company and there was a developer coming to me saying like ok I'm a bit disappointed with this learning it's actually only Matt's like I I was expecting some magic to be there there's like no there is no magic it's mad severe like classification corresponds to supervised learning so you are yeah you are classifying basically instances and clustering you are trying to cluster similar instances and yeah I ain't gonna tell you yet why clustering how we use class drink so idea of this like permission based approach that we use is that you can easily extract permissions from Android app like you can do it on the phone you can do it from manifest

file you can do it in a couple of ways so you can look at them and try to learn combination of permissions that go together in malware or not malware so yeah like we used one m0 droid data said that was developed by a leader that had to talk about IOT so and it contains like 200 good apk so like not malicious apk and 200 malicious ones now if you are trying to learn the difference between malicious and non malicious apps based only on permission what can be a problem there can anyone doubt yeah like you can have like access to the SD card and Internet and someone can be stealing your data or you can just need to store

somewhere something during your app so yeah that can be a parent that could be a problem and there you can have some false positives by D Alec it's not enough so like this is like Angry Birds and it uses like storage phone calls net for communication your location so like this app can still pretty much a lot of your personal data Wow it is basically a game and it's not doing anything to malicious and yeah so what we moved from them from there is to a code base methodology so if you go back in time to like 1980s 1990s to the first molars how people were like idea was to try to emulate what people were doing back in the time

so like you will have one malware analyst who would sit down with his computer get them our sample D compile it use some tool like EDA or whatever and look at the decompiled either like normal source code or assembly or or even binary and try to find some malicious patterns and and basically then at some point he would write a report saying okay this app does this and that it's malicious it tries to do this try to prevent it by integrating this and that protection in your anti-malware software and and yeah like basically human can do this the assumption was like okay machine can do this today with machine learning so basically we use the same data set yeah

like there was a slight problem with this that we couldn't they compiled all the apps so like 30 two apps were not to compile either it was a problem with our compiler or it was because of obfuscation or whatever problem was I was not going too deep into details so like ten was non-malicious 22 was malicious that couldn't be compiled and the basic idea was here that like code is pretty similar to a text so if we know that like we can do sentiment analysis or topic analysis so like we can classify these tags is positive this tax is negative or this tax is about politics this tax is about Spore this tax is about fashion we can compare we can as

well classify code because the code has pretty similar features as a language it is pretty much standardized language where you use set of functions import calls and so on as well in in natural language you use a set of words to communicate so basically the idea was like when we teach the machine to analyze and classify like if we can teach machine to classify text why wouldn't be we be able to teach it to do the same with the code so we use bag of words approach so it is one of the simplest approach in a true language processing where you don't don't care too much about the order of the words so and don't care about the grammar but you

care about the appearances of the world words and here when I speak about words that's a function calls variable names import calls like libraries but you import or whatever so it has a bit of naive assumption that like an order of the code doesn't matter and yeah like it is very often use in natural language processing with sentiment analysis languages attacks any type pretty much of classification so what was the the workflow so like you get a peaky file so Android binary you unzip it and get the Dex file from it you use the tool that's called like Dex to jar that transforms that Dex file to a jar file and then you can again and zip it and

extract class files so now you have a bunch of class files you can use any kind of java decompiler so you compile it we merge all the Java code in like one big file and then we use that to to train our classifier and when you when when we were like training and classifying when we were testing we use 10-fold cross-validation so does anyone know what is 10-fold cross-validation so basically you split your data set into ten volts so like ten parts and you use nine for a training while you use last one as the testing so it's unseen previously and then you do it ten times so each fault gets used for both like

training like nine time it gets used for training and once it gets used for testing so like if you have a small data set you can test it on pretty much every instance and see the results out of it so we use a bunch of classification algorithms so like support vector machines naive Bayes decision trees random forests jury logistic regression and we use as well and samples so like we merge couple of algorithms data like classification you using each of them and then use voting to to get the final decision so basically what it what it says it's like if two algorithm says something is malware that you are more sure than and one says it is a good word

you are kind of sure that it is probably malware that like two algorithms are probably better like you know like two eyes are better than one and if three people's done agreed and you go for a blotting and this well like we use clustering so like there are like free algorithms that we use like CdeBaca means entropy basically classification in farthest first yeah bit with this like in samples like we use random forests as like out-of-the-box as a single machine learning algorithm while it's not really it's basically like in samba learning algorithm because it uses like it's called forest so like forest is built out of many trees right so it's algorithm that uses a bunch of like 200

I think we use decision trees and then use this voting and then gives the result so basically this already an assembler nning and so for evaluation we use these measures so like precision is like how many true positives you have compared to true positives plus false positives recall true positives divided by true positives false negatives and then like f1 measure is like common measure to like calculate combination of those two so like how how good overall it was accuracy is usually not a great measure in machine learning especially like we had here balanced data set so like maybe I could use accuracy but this is kind of common what you see in most of the

machine learning based papers but if you have this balance data set it can happen that you had the great accuracy while you are getting bad predictions because of the non balanced so for permission based classification these are like single machine learning algorithms here the best results will suite support vector machines so you got you get a score of eighty seven point nine percent so like around eighty eight percent of the times you are correct to predict whether something is malware or not and it gets pretty similar amount of false positives and false negatives so like you are incorrect quite stable whether something is not malware while you're classifying is that the malware or it you're classifying gets it not a malware

well it is Mulder and then like yeah like we try to dis examples and well you can boost it to eighty nine point one around eighty nine point four so it is like one percent one and a half percent however we did some like significance testing so likely tests and it turns out that it is not really significant that it is kind of it ended up a bit up quite based on the random but like for the source code classification you get much better results so like as well as we am with support vector machine we dislike sequential optimization is like 95% you get correct and as well it's quite stable with precision recall you can boost it by like 0.05

percent which is as well unn significant however like when you are doing all this like 5 algorithm you're trying to process it so like do you waste pretty much a lot of computational power but you don't earn pretty much anything especially since it's not significantly different and now clustering so like clustering performed much worse so correctly classified instances were like 60% 50% 64% especially for the permission-based so for permission-based class clustering it's it's quite bad so like you can't really use these results especially in production while if you go to the source code it cuts like up to 80s 8281 and how you can use this which was my question from the beginning is that you

can you can use that these to extend the training so like if you have a clusters of of something that but you have unlabeled clusters however if you know that certain instances in that clusters are malware and certain instances in the other cluster are not malware you can assume that the whole set of unlabeled instances are what you have inside so like malware or not malware and therefore you can build a bigger data set without too much human effort and then like retrain your classification and then get better results because you will have more data and more data is quite often useful in machine learning so like more experience you have it by the assumption more experienced you

have the better you are at something however like what is the limitation of clustering is that like okay in how many ways you can class a cluster these people like you can faster by like what family they are in like this Simpson family these are school employees you can classify them by gender okay by color by whatever and when you are doing clustering you pretty much don't know what the clustering algorithm with big as a feature based on which it is clustering so like there there are ways out to tweak it and like if you really want to target some features but generally this is just like you Dravid it and he tries to find something

that distinguish them the most and if your clustering apps may be the thing whether the app is malware or not is the most it's not the most visible feature for it so that's that's pretty much the reason why why the results of clustering are much much worse than classification so like we in classification you really tell him I want to classify two malware and good work or on whatever you are trying to do while in clustering you are not telling him anything and you are just saying like find me the difference between these objects and make me two clusters or four clusters or how many clusters you want and you don't know what the results will is gonna be so

what are the use cases so like for permission based classification it is like really fast so like you just pick up the permissions you give it to algorithm it either learns or classify it has quite okay performance so like 85 to 89 percent it's not perfect is much worse than source code based classification but it's fair and what was the most important feature of this is that it can execute on the phone so like you can integrate it in in the app it doesn't need too many computational power it doesn't need any anything special so like any smart phone pretty much can execute this and yeah like it became part of all the Seraph Android

permission scanner so like if you enter there you and in this case your permission and it takes a couple of seconds to scan your 80 apps or whatever and then it says whether you whether it classified anything based on the permission model that we have as a malicious or not on the other hand like source code based classification it's computationally much more expensive and cannot be executed on the phone and there are like many reasons for that so like especially if you don't have rooted device you can't access to the apk files you can't decompile them you can't do all these stuff that is required in order to get to the text file and then

even text files are quite big to do then like do the classification auto like current phones with four cores or something probably could handle it but because of the permission model because the thing that you can't in that permission model access this and decompile and do all this stuff on your phone it can't be used but it has state-of-the-art performance so like whatever you find online has around like like whatever new research has around like 90 to 95 percent accuracy f-measure especially if it's machine learning base for android so basically it's really good but you have to execute it somewhere on the cloud or somehow to like and push the push the apk files to it so for

clustering as I said it has much more much worse performances classification it can be used for bootstrapping so bootstrapping is this process where you trying to make your deficit bigger out of like a couple of labeled lis instances you are trying to make using clustering data says that has everything labeled however it is not really useful for malware detection so as such and yeah like clusters could overlap especially with permission-based you will have a group of permissions that will be in both clusters and it will get confused so yeah like source code analysis can be done successfully by machine you can basically emulate what the humans are doing or we're doing back in the time I don't know how many

people are actually nowadays analyzing malware by hand so like malware detection scores could be quite high it's generalizable approach so like if you if you do this like machine learning it will learn what is what are the functions and API calls and whatever you someone uses in code that are that there it is common for the normal normal application and malicious of application therefore if you have a new application it will recognize these patterns so you don't need to like no need for signatures no need retraining model every couple of days and so on so model can be stable for some amount of time of course if language changes or I don't know API calls changes so like if there is

significant difference in the language in the source code that are used of course you will have to retrain the models but for a long time it could be quite stable and yeah like for the end like I think that interdisciplinary security research can be quite useful useful so like security does look in to certain err certain other areas and yeah like there is like machine learning obviously looks quite often for the application since it doesn't have any like if you look only or not algorithm it it is not very useful you need application and as well security as such it needs some application like it needs real case scenario but for example I know P is really kind of specific it

doesn't really talk to other fields and here this idea kind of come from like my knowledge in NLP and then like okay we have this data set can we combine the two is there anything that we can do and yeah like it is published in the Journal of computers in electrical engineering so yeah this paper you can find it online what I think it's not open access but if you if you are at the university you will have access to it or you can email me to send you over the paper and yeah like I said it is like part of our Seraphim droid app so like you can download the app decide like some of the

features so like we had this like your permission scanner which is basically this permission based machine learning classification weather the Apple is good or not for you and then like we have application log service log some settings checker USB SD scanners SMS and MMS some fishing Prevention's although it is quite bad geofencing knowledgebase and as well if somebody wants to help with some development email me and yeah it haven't been done anything for some time on this up and that's all thanks if you have any questions

yeah it is it is possible yeah but like if you have a malware family that that is zero

I mean it depends what you want to do like and what data you have I guess like even if you have if you have a closed set of families and if you know now what families you want to detect then I think classification will be better however if you don't really know like like you are kind of okay there will be some families out out there and I just want to see the cluster of the applications then yeah like clustering is very useful but like from my experience classification if you have a label data it's like it's like even when you take a human if you show the human what he is supposed to do he

will learn that there then if you let them play and try and let them figure out themselves so so even if you have a bunch of like if you have a look like closest adds labeled data a lot of label data which is quite often very expensive and that's why you go for clustering because you want to cut costs so I haven't been looking too much into the source code myself because it would be a lot of Scot reviews but like even if you do how likely is it to be very similar to certain calls that you use for the other class or or one or the other class so like you can have a junk code but it

doesn't change your performance

but it doesn't matter because bag-of-words doesn't take into account the order of the court but the frequency and what is going on so if you have like one line and then sixteen lines and some other line we learn that these two lines are appearing in the same code and totally ignore this in between as well if they appear ones next to you the other it doesn't really matter at all however like it it can have certain problems as well since it's like really naive like don't take order into account so like if all the order matters for some application like with this you can't do classification of for example part of speech talks because order matters but

for classification of text into a couple of classes it usually doesn't

yeah but like as well you can be mistaken like nothing is hundred percent correct so I mean I take that argument but that is like over whole field of machine learning like you ain't gonna get hundred percent and if somebody claims hundred percent he is either using it wrong or treating great with evaluation I think he was first

now I don't have like I haven't done any analysis of this but like I mean I didn't know whether like I mentioned this like 32 apps that I couldn't I compiled it may be possible that it is were native apps but I'm not sure I haven't done this sort of analysis now probably now I probably kept tax but yeah I for this lay like that this is a good point to like go to look into further

no but yeah like I have kinda history of google ignoring me so I'm not were trying well now like this justly compiles the apk and yeah like you probably can link them later on and do multiple classifications but yeah like current approach does like that's one pile this time

okay I mean you can do like you have the compiled code you can do any type of pattern detection like if some if something contained like you can do a rag X I guess detection or on the file while when you once you have a code the compiled so I don't think that is hard to detect just like I haven't done it okay that's all thank you [Applause]

2017 - Machine Learning Aided Malware Detection With Focus On Android by Nikola Milosevic

Related talks