← All talks

BSides Rochester 2018 - IoT Botnet Detection System using Machine Learning

BSidesROC19:40557 viewsPublished 2018-04Watch on YouTube ↗
About this talk
Talk Description: IoT Botnets recently became a destructive weapon against the internet domain, most notably Marai and the up and coming Reaper botnet. Our research focuses on determining which features are most relevant in detecting botnet activity and designing a machine learning infrastructure to detect anomalies. Our talk will provide a high level overview of our system which features a method for dynamically generating profiles about known device traffic and signatures for anomaly detection. Bio: Jonathan Myers is currently pursuing his Bachelors of Science in Computing Security while also working as a Research Assistant at the Rochester Institute of Technology. His areas of interest include binary exploitation, web application security, and security tool development. Jonathan is also an active member of the Security Practices and Research Student Association at the Rochester Institute of Technology.
Show transcript [en]

now obviously but I've been going over this the Russell middle ground okay you got it thanks dude good again all right hi I'm Jonathan Byers I'm a security student here at RIT over the years I've been heavily involved with our clubs on campus as well as the competitions and everything over these years I haven't really found a specialization for myself in security don't really identify pentester Incident Response psychoanalysts nothing like that and I've kind of taken a liking to just being able to do something brand new every single day and along my journeys I stumbled upon machine learning and I got pulled into that train because you know buzzwords and so we had a professor on

campus by the name of Heim Sanders he gave a presentation a bit earlier he had done a lot of work with artificial intelligence and neural networks and I was like oh this is a great source I came to him I was like here's my ideas and he's like ah that's that doesn't work the way you think it does I go back and try count how so one bag did a little bit research came back to him here's my new idea he's like no nope that's not gonna work you should read some more books I was like okay it's got a little bit busy he had a whole gallon off coming into the beginning of this year I

was like I really want to get published as an undergraduate student so I met with a professor here named Tom doe and he told me about his research group and he was like yeah we're trying to detect on malicious activity on IOT devices using data analytics I was like oh that's cool I'll do that and that's why I'm here giving you this presentation so going over a presentation we're going to talk about our team we're gonna go and talk about the devices that we used our strategy and philosophy for building our solution we're gonna do a little introduction to machine learning and then we're gonna have a little conclusion at the end so our principal

researcher is Tom o he's with the Department of Information Sciences and Technology he's it extended CSEC faculty member and his role on the team is to sort of organize and direct everything additionally he's an awesome resource to go to to learn how to actually do academic research how to write the papers and all of that we have me already talked about me we have ohon Phil bak who is another student here at RIT he's heavily affiliated with all the clubs specifically rc3 where he's the web admin for them with william mcdonald another security student he's also heavily affiliated with RC 3 we have Tappan as Jumeirah who is part of the computer science department here

he specializes in data analytics and so he sort of replaced time for me and now I go to him for all my questions and he's been a a real big help to me and then of course we have our sponsors each read the electronics and telecommunications Research Institute based out of South Korea and so we have a young ho Kim and jongmyo Kim who we work closely with them they're part of the mobile security research section and information security research division and they provide the funding and made this whole project possible so thank you for that so coming into those research I came across this quote that suggests that it is best practice to review our

logs every day I know I personally don't do that by show of hands how many of you guys do that yeah not too many a couple ok that's cool but most people don't so then you think well well everyday users of IOT devices check their logs and the answer to that is absolutely no they probably will not so that's where our research starts we want to use machine learning to analyze the logs and sort of get rid of that and have the machine do it for people so goals with this research is to find a way to protect IOT devices and of course we want to use machine learning to do it because that's a cool thing right now we

needed to narrow our goals for our research a little bit more so we decided to particularly focus on IOT devices so that consumers are going to be using so cameras DVRs routers those kinds of things these happen to be like the number one culprit in IOT botnets right now we'll look at that a little bit down the line and so then we're like ok well we don't want to protect against all malicious activity that that's kind of a really big ambitious goal so let's narrow it down and specifically focus on botnets and we chose a mirai since it's been really big it's in the news and everything and we had some previous research done by Joel Margolis who who

wrote a pretty in-depth paper that's now published on I Triple E on the inner workings of Moriah and when we're looking at Mariah we're like okay we have we don't want to be on the devices themselves so let's look at network traffic what are the main components of Mariah and the random internet-wide scanning is one of them that might be a little bit harder to detect but the telnet brute force with dictionary attacks that's the primary exploitation device for this botnet that should be pretty easy to detect and then of course you have DDoS there are all kinds of flooding techniques but we don't we would like to detect it before it's a slave to a botnet and causing

damage so looking at our devices some previous research suggests that routers and cameras were the biggest component in the Mirai botnet and so we got a hold of our d-link dcs-930l and this thing is a pretty horrible has had all kinds of unauthenticated remote access to it there's all kinds of CDs I think there's even a Metasploit module for it at this point so that's not good but it's really good for what we're doing we had a street cam but that got bricks so we don't have that anymore we have a router we had a d-link 850 L it's very similar to the the camera it is also not good we have a DVR which is a digital

video recorder you'll connect your cameras to this and it'll record things and I think that company is called a wazoo or something like that it's a it's the best seller on Walmart right now and it also isn't good and we have a Lutron cassette a wireless and it's a light bulb and we had it so why not use it so now we're going to go over the overall Tippa topology this is a really high-level overview so you'll have your users and your adversaries that are connecting to either the network or the device a lot of cameras are just directly connected to the internet and so they'll connect to the network or the device and we wanted to separate non IOT

devices for my ot devices protecting laptops and servers and clients and everything that's our that that's a whole different problem we just wanted to focus on IOT devices so between the the internet and the network and the IOT devices lives our solution which is currently a Raspberry Pi we wanted a really tiny thing that we could give to consumers and be good there so all I ot devices look behind that it collects traffic in between the Internet and the IOT devices all ingress and egress and it pumps all that traffic to a remote machine learning infrastructure which does all the processing determines if the everything if the traffic is malicious or not and it talks back to the Raspberry Pi and

sets up defenses so now we're gonna look at the individual components and a lot more depth at the front we have the traffic processor which is the brain for the entire operation this is going in and collecting all ingress and egress traffic associated with the devices we originally wrote this in Python and it was just collecting individual features from the packets themselves but we found out that when we wanted to collect more complex features like how much congestion is on the network when this specific packet is received we're like oh wow that's harder what kind of data structures we know this is too hard at this point so I was talking about this and a friend suggested using bro and I

was like I know bro so we scrapped a whole of our code and replaced it with bro which we'll talk about a little bit later on and later on we plan on implementing a signature based I psi D s system so we'll generate signatures on our machine learning infrastructure forwarded down and we'll have the firewall and the signatures to go over the traffic the feature extractor so this is this is where bro really shines bro has a bunch of different protocol analyzers that'll pump out different logs depending on what you wanted so you'll have like HTTP Logs with all the features associated with HTTP SSH file DNS ICMP those kinds of things additionally bro comes with a really

complex but also really powerful scripting language and companies like CrowdStrike develop things for this and you'll find things like tor traffic analysis and dns sync Holling and things like that we were particularly interested in off brute-forcing because you know the telnet brute-forcing from Mirai so future extractor extracts everything puts into logs and then that forwards up to our remote machine learning infrastructure which should not be living on the network it'll be off somewhere else maybe if someone chooses to sell the product they'll be running this thing at the front we have logstash which is where we're receiving the logs logstash comes with three components it has inputs filters and outputs inputs you have different plugins

that you can put in I think most people use file beeped but you can use Dropbox if you wanted to all kinds of different things to get the logs into log stash then loss tax uses filters to take the unstructured data and put it in a structured form which is great for our feature samples and then it has outputs which does the same as inputs but the other way and you have plug-ins for that too and so we send all of our samples to elasticsearch and elasticsearch is basically a really fast database that is a super optimized for holding on all kinds of logs and you can go there and you can search for the different

features that you want build different feature sets to test with your classifier so we send though we export those as a CSV we send those over to our classification system we haven't decided exactly how that's gonna work we don't know if we want to use one classifier to do all the traffic we're not thinking that's the best idea we might do multiple classifiers for each each device or you can do classifiers for each individual application or protocol maybe that's the way to go oh that's that's part of the ongoing research project prior to grass and we're gonna figure that out so eventually that's going to pump out answer as to whether or not that traffic is malicious or

benign and it's also going to give us a confidence level and so that's forward onto our firewall database which is probably gonna be part of its gonna live on the machine learning infrastructure part of its gonna live on the device itself it takes that decision as well as its value of confidence if it was high enough it's gonna generate a firewall rule and send it back down to the the device and be like hey you know this IP address is domain now don't let that in that's not good and then of course if we were wrong about our prediction and it actually caused a lot of havoc on the user's network they're gonna be able to

look at the alert that we generate for them and revert the rule so that it does not affect their network oh all right so now we're gonna go through a brief overview of machine learning so I decided to start with talking about the differences between supervised learning and unsupervised learning so with a supervised learning we're going to have a defined set of features so in this case we have and is white is IP white listed is the domain black listed the port the protocol etc and then the key here is that you're going to want to know what your dad is your is this particular set of features malicious is it benign and you're gonna want to append a is

malicious label to that set of features and that's gonna be either true or false based off of what you know and if you're looking to popular algorithms for that random forced is kind of cool and a lot of people are using it right now we're not going to get into that because I get some really complex really quick then looking into unsupervised learning you're gonna want to use this when you don't exactly know what your dad is you don't know what's malicious you don't know what's not malicious and so you it's a Doudna you omit these malicious label and instead you let the model search for different clusters of points on its own and so once you have these

different clusters forms you can search the data for different similarities you can look for patterns and you can now identify the outliers on the on the graph and if you're looking for an algorithm there you're gonna want to use isolation force that's a that's a pretty cool and right now and it has a really cool visualization if you look up some pictures of that so we're particularly interested in classification there's two different types of machine learning or regression which is predicting a value and a classification which is determining if something is or is not a particular object based on the provided information about that something that's kind of a kind of a abstract so looking

at an example we have a large quantity of packets we want to train this model with this large quantity of packets so we tell the model which of the packets are malicious which ones are not and then we then when we given new packets that it's never seen before ideally it should be able to tell us if it's malicious or not and so we provide the classification model with these features and it does all the math and everything and then we have our results some some gotchas here is that features must be numerical or innumerable I was kind of thinking oh well we can just give it an IP address or maybe some of the strings and it should be able to

find some problems there but it looks like you want to choose things that are more like boolean so is that IP address whitelisted or does that does that string contain a malicious is that string malicious or not and then use boolean's and enumerable fields instead so looking at feature selection our strategy behind this was to train the model as if it was a sock analyst and if you're if you're a soccer analyst here or incident responder you kind of want to know what the normal traffic on the network looks like once you have an idea as to what the normal traffic is you're gonna start looking at the you'll be able to quickly identify the abnormal

traffic maybe a strange DNS request comes up maybe some protocol that you usually don't have running on your network as I'm laying your pcap files that's those are some not good things so we're starting to notice that if you train with a whole bunch of normal traffic with some various malicious traffic in there it tends to do a really good job at figuring out what's malicious and what is not some of the basic features that you'll find in these peak apps so we're looking at the basic features that you find in peak apps first and so that's your source destination IP addresses application layer stuff all that's a CH TTP all that and then you can't really determine if

the traffic's and malicious based off of one packet so we're often looking at the bigger picture we want to look at the amount of packets over an interval of time is there a whole bunch of is there like 10,000 telnet logins and the ten seconds that's probably indicator that someone's performing a brute-force attack you look at TCP streams and you look at all kinds of other statistics so we needed to train the model to also recognize these things and we had to do that by writing code to look at to generate these features for us using the data from the bro logs right here's an example of binary classification a really cool example where we have a

bunch of rings some bolts some nuts and a bunch of scrap metal and the idea here is that you're gonna try to determine which ones are which and so you have these lines here that sort of separate the categories these are called boundary lines and it might be hard to see from the back there but you'll find that some of the things are in the other categories where they shouldn't be this is what makes machine learning and security incredibly difficult is that you need to have close to 99.9 percent accuracy you need to be as closest to the top as possible and that is that is not an easy task to do we have over

classification and overfitting of data these are some really big gotchas that I learned about over-classification is where I sort of thought at the beginning you could just pump as many features in this thing as possible and it'll the Morphy the merrier they'll figure out which ones are the best and be good to go but uh that's not the case there's actually a sweet spot and it's different depending on which problem you're going to try to address sometimes you need less features sometimes you need more features but there's a certain point where you start to have negative impacts and then you have overfitting of data which is uh that's gonna be where you use too much training data to train your

model your model is too trained and it will only recognize your traffic and when you throw it out in the wild it flops it fails and nothing good happens and there's some there's some go-to ratios for that like use 80% of your data to train the model 20% to test it those kinds of things and then when we go to test our solution that's kind of hard to read but uh one of our team members he set up a mirai C - server on our research network so we can actually test with the legitimate version of Mirai additionally we plan on writing Python scripts to exploit different components on the IOT devices themselves to generate traffic and then we'll use this

to actually test our trend classifiers this kind of collusion here our goal is to create a more effective IPS IDs stack and we want to use machine learning to do that ideally if you do this correctly you'll be able to get an increased accuracy so you'll have less false positives if they get down if you do if you have such a low number of less positives you can eventually automate your defenses and hopefully be able to just generate firewall rules to be like hey no you can't that's bad you shouldn't be in here and then of course you have the potential to identify new threats that you haven't seen before that's uh that's a that's a cool ability

to have if you can get to that point but uh that's really where this is where our research is going to continue we're looking to completo maybe we should try computing more complex features to train these algorithms maybe we should focus on individual components so looking at applications devices themselves trying to figure out what uh what a structure of classifier we want to do to have these things going so this work was supported by the Institute of information and communications technology promotion that grant funded by the Korean government and Eitri was a major contributor here so you guys have any questions I'm welcome to answer them now so right now we're we're taking a packet by packet a little

bro generally used bro before okay so bro will generate logs for each individual application and so we're taking those and we're combining them together in a format where we have a traffic over a period of time as well as individual individual packets so a little bit of both we're trying to get a variety of information over a period of time and the packets themselves so yeah yes looking at that I believe we're using our I I don't do a lot of the data analytics stuff so looking at the machine learning I am NOT a machine learning expert I've been just learning it along the way trying to the gap between data analytics and its security

I believe we're using R to do this and I there's a whole bunch of different libraries I know with like there's something called scikit-learn that I'm seeing a lot of and there's a something called the bro analyzer tool that allows you to convert your raw logs into panda data frames and then push it through scikit-learn so those are those are some more words for you to look up if you're interested in doing this kind of thing

okay all righty cool [Applause]