← All talks

Identifying Android Malware With Machine Learning by James Stevenson

BSides Basingstoke · 202214:3593 viewsPublished 2022-07Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
Show transcript [en]

cool hey everyone so i'm going to start talks off with a question so the question i came up with for today is what does minority report the tv show person of interest and a random tool i made one evening for having her super riveting question i know i'm sure you're all the hscs and the answer is that they all use machine learning or some sort of prediction technique to identify multiple activity and that's what we're going to be talking about today so in minority report we have pre-cog specimen of interest we have the machine and this random tool i made one evening is using binary classifiers to identify malicious configurations in android patients and i wanted this talk to have two key takeaways right because i appreciate not everyone here is going to be super interested in android production so the first takeaway is just about that right we're going to do reverse engineering some other applications we're going to be taking them apart figuring out how they work and using this machine learning classifier to identify that work and the second key takeaway is right it's about how we can apply techniques from one field to another field so in this case i'm applying some data science some malware analysis to reverse engineering right which is what i do for a day job so first things first who am i well my name is james clinton and this time i want to say six years ago now that makes me feel old i graduated from the university of south wales studying computer security since then i've worked as an android internal software engineer i've worked as a penetration tester and these days i'm working as a vulnerability researcher for the startup interrupt labs and there's a good handful of us uh max mike and a few of us from interoperability today so if you want to know what we do feel free to grab us i've also spoken a few conferences and doing a part-time phd and i've a few books as well so what are we actually going to be talking about well first of all i want to talk about android revisions here right i want us to have a look at a few hundred applications take them apart figure out how they work and to get really everyone on the same page right and when it comes to anaerobic sagittarius then we're going to have a look at droid now georgetown is a tool that i've made that allows us to take an application strip it apart and figure out what's happening then i want to talk about the data science part right so i want to talk about the machine learning model behind drawer detective how it actually works if it works or if it doesn't work and then finally i want to go through a demo now i'm not brave enough to do a live demo so it is a recording so i know how it works it's great so let's jump straight into it right android reviews engineering as a software engineer we're going to write an android application commonly in either java or copy that then gets compiled down into a dalvik execute that dalvik execute was basically what's right on the device there's a few additional stats there but for the sake of this talk it's what's running twice as reverse engineers we can not read that right it's not human readable so we can do two things we can even disassemble it or decompile it right decompiling it to pseudo java the decompiler's best guess what that job would look like or we can disassemble it into smart or a human readable representation of that javascript but what we're doing today we actually don't care about code we specifically care about the apk and the apk is basically a bundle of information files or difficulty pieces that tell the device how to run an application there's a few things in there including the dalvik executable and some resources some files and there's a brief example here so we've got some assets basically our british storage native libraries we've got that darvik executable and then what we actually care about today is this android manifest file that's basically a configuration file that includes everything the device and the app you need to know to run so they look a bit like this he says there you go a bunch of x now right and it has a few things in it so it's got its package name package id some components entry points as i said basically everything the device needs to know to run that application but again what we actually care about today is these permissions and this is an example of some permissions in one of those manifest files and because of the way that android is sandboxed under all circumstances if an application doesn't have a permission it can't do the thing related to it so let's say we can look at one in the middle which is the internet commission right under normal circumstances if an application doesn't have that permission it can't access the instruction so that can lead us to a few questions especially if we put a malware analysis because we have all these different types of malware right we've got spam backdoors hostile downloaders and it kind of leads us to a question of if different permissions leads to different types of activities can we identify malware based on the permissions it's using so let's say a hostile downloader maybe that's using the instead punch or maybe a back door maybe that's using the boot complete permission spam maybe that's using the phone and that's kind of the hypothesis for this talk right we're asking the question of can we identify malware android application malware solely of the permissions that it's using and that's kind of where draw detective comes in so that's what droid detective tries to do it tries to take an application's permissions and identify if it's not wrong not solely off those permissions this is the this is not happening there you go this is the web ui of droid detective uh it's quite simple and if you ever visit this webpage you'll realize i'm not a web developer crash is probably about 50 of the time but there is a github repository as well which in theory works 100 of the time and we'll have a look at this later on when i show the demo and what this basically does as i said it takes an apk takes the part looks the permissions and identifies it's another one so how does that actually work well there's a few data sets involved the first data set we have is our known ad data set and this is our malware so this is basically a malware github repository that's online it's going to give you a few hundred malware the second data set we have is our known good data set right so for this example i pulled 100 or so applications from the google play store the assumption being that google google play protect have kind of filtered out the network so we have our known goods and unknown bad data sets we also have our prediction data set and this is what we're going to provide the model to attempt to identify if it's not a robot but then we have this binary classifier right currently this is a black box this is binary classifier what actually is it so it's using brandon forest which i'm not against i'm just some reverse engineer so i'm going to explain it at the high level basically what it's doing is it's a series of decision trees and those decision trees are made when we train that one and then when we provide it new data it then goes through all of those decision trees and we'll either do some sort of averaging or some sort of voting to come out with that final result and so again what we're looking to do is we're looking to take permissions from our application go through this decision tree and identify it's malware or not and there's two things that we need to bear in mind with a few things but two here that means they're very in line when we talk about data that we provide a model like this the first is that these features these variables that we provide the model need to have some sort of predictive power and what does actually mean well it basically means that those features that we provide the model need to have some sort of impact or weight on the decision that we want to identify so early on our hypothesis was that permissions would have an impact whether something is now or not but if we look at something else in an azure application let's say the application name does that have an impact whether it's number or not it's probably not right like a mara engineer can change application name it has very little impact but if it's malware the next thing we need to bear in mind is that these two or these multiple features are uncorrelated what that basically means is that these features are related to whatever so we can look at android permissions for a really good example of that so in android we have two different types permissions we have our manifest permissions and our runtime permissions our manifest permissions they're accepted by the user at download time and our runtime permissions are accepted at run times that box that pops up and says hey you want to use the phone but the issue with both of these as a data set is that they're quite correlated so if your application has the redictional storage manifest permission it will also have the regional storage runtime permission and what that will end up just doing is it will end up confusing the model so we're going to end up just using these manifest solutions but what does that data set actually look like right i've mentioned these three data sets that might not be too clear on the screen but we can imagine each of these data sets like a table right where the top row we have the permissions in the android open source project so that's everything from internet access to right external storage phone state and then the left column or the right corner i suppose we have all of the applications in that data set we then put a one for if it has that permission and a zero it doesn't so we can imagine each of those data sets as these kind of tables so this is the demo we'll see video can't be loaded cool well uh let's see if i can hotspot it so i feel like i'm probably making a lot of noise on the table there you go you know it's when you try to do everything possible to make a demo work and then in practice it makes it not work yeah what can you do okay i think we're probably just going to skip the demo uh because there is a little helper slide after this that would have shown us what's happening cool so let's imagine there was a great dead super amazing demo and then paul loved so what actually happened in that demo is we took the facebook lite application facebook live basically being a cut down version of facebook we then stripped apart the bundle of files and we took that manifest file and again we mentioned the manifest file earlier on basically an excel file we then took the permissions from inside that file took permissions from inside that file and we ran those permissions through this binary classifier and these decision entries that then came back with a result in the case of facebook like it came back as true we also ran a trojan through it as well which came back as malware and the physical light came back as not network we'd then be able to see that in the ui where we have a this was marked as not malware and these were all the permissions it used so oh what's he going to do great cool so we've had that amazing demo now i want to talk about a few things related to this model the first one is the scores in the top right and again i'm not here to scientist i'm a reverse engineer and a military researcher so we're going to take this at a high level the first one we're going to focus at is accuracy and what this accuracy basically defines is the amount of times it has successfully identified malware as malware and the amount of times it's successfully identified not number and this is 0.93 or 1.3 which is pretty good there are a few things that we need to think about here things like overfitting stuff like that which i'll briefly cover later on the next thing which is quite interesting is these model weights and these model weights should really be called future weights and what this is is it's the weight or the impact for this model droid detective has put on these features and again our features are our permissions is sure cool uh so these are our feature weights and again our features are the permissions and the weights are the impact or kind of the weight that the model has put on each of these features and this is quite interesting this is quite interesting even if we don't want to use droid detectors it's interesting if we're malware analysts interestingly if we want to understand how malware works on android because we can have a look at these weights and understand why these weights have been put on these permissions right we can say well okay this model really thinks that internet that write external storage that write sms have an impact on the decision making between if something is now work or not now that doesn't necessarily mean that because an app has the inside permission it's malware but it means that it has an impact in that decision come on you could do it cool so that we'll come towards the end of the tour uh detective is open source and it isn't perfect uh so if anyone wants to have a play around with it definitely recommend doing that uh there's a few ideas here which kind of needs looking into one of those i mentioned earlier on which is all about overfitting right so the droid detective datasets that i mentioned earlier on we have that known bad data set right our malware we have unknown good data set with google play apps and then we have our kind of prediction data sets and the idea of overfitting is that working maybe we've just trained this model to be really good at detecting what's in those data sets so really we need a better data set right a random github repository from four years ago isn't a great data set so that's one thing that can be improved there's a few other things where we could start looking at the different types of noaa right earlier on we mentioned well the spam have different permission sets to actors right another great example but yeah that's basically this talk control close if you've got any questions feel free to ask me now you can find me on twitter or i'll be around to the interrupt labs kind of table uh later on thanks everyone