PG - Using Machine Learning to Support Information Security - Alex Pinto

Name: PG - Using Machine Learning to Support Information Security - Alex Pinto
Uploaded: 2017-01-20
Duration: 1 h 12 s
Description: PG - Using Machine Learning to Support Information Security - Alex Pinto Proving Ground BSidesLV 2013 - Tuscany Hotel - August 01, 2013

BSides Las Vegas1:00:12120 viewsPublished 2017-01Watch on YouTube ↗

About this talk

PG - Using Machine Learning to Support Information Security - Alex Pinto Proving Ground BSidesLV 2013 - Tuscany Hotel - August 01, 2013

Show transcript [en]

okay all right we good to go okay guys thank you very much for being here my name is Alex and I'm going to try to give you like the the most painless journey I can I can put you guys through around machine learning and what that could potentially mean for us information security F and uh this is actually a topic that's quite dear to me it will become clear very soon why I I actually delve into this and uh but before I actually start going I I really want to take these sides and our sponsors and especially my very now good friend Joel sitting here which helped me make sure this was appropriate I guess it's the it's the right word so

just before we start okay I have some important warnings here number one this is not about a heck all right I'm defending stuff I'm not attacking so if you're looking for vanity hats you know I just cracked the root key of Android and let fix the next week this is not a talk for you okay so you might as well just go find another place all right and we're actually trying to build something here okay and this is and uh one of the things I try to go through a lot in this presentation is to try to explain why this could be significant we by Nature we are paranoid we are suspicious of different things especially if they come

with a very big marketing High on top of them and this oh my God does this have marketing high so just want to make sure this is kind of an introduction so we can get start getting the conversation though and there's a lot of math okay not so much as you know people would start screaming and actually coming here and hitting me but should be all right all right you can just I I'll take any questions about anything later on so let's get into it so why do I care I guess that's that's an important thing so uh uh I've been doing this for like 12 years information security the reason why you've never heard of me is because

I'm from Brazil so uh most of what I did was there so I used to I used to actually run a fairly large of security consultancy there and uh I got to do a little bit of everything although if you ask me to do it here right now it would probably suck because it's been a while but anyway for seven years or so what my day job has really been is about running a sock all right building the team setting up those fancy s dashboards for customers and things like that and man it's very painful so anyone here who's had experience with that okay yeah a few hands I see so you know how it this [ __ ]

does not work right and I I touch a little bit into that and I'll touch a little bit about that uh on another presentation I'm doing because it's it's it's an interesting point at the end I can skip the the the mid one is that this is actually my first presentation at a real in conference so thank you very much guys for attending have feel very you're very supportive but uh it's only think a few hours because 5:00 pm I'm actually presenting like a giga version of this in in black hand so it's actually about uh the algorithm that I I put together which I'm going to touch a little bit here not going to go to the mod detail

but anyway so did a lot this [ __ ] got CED a lot by Sim Solutions and starting this year I started researching a lot of machine learning and uh things related to data science which we're going to touch it as well and uh participa on some competition I mean I'm not a master by any chance I probably making some very R mistakes in order to put together but it seems to work fine so I thought might as well share so okay here's what we're going to talk about first of all we're going to talk about the elephant in the room Big Data all right going to get that out of the way understand what

it is what it isn't and then we can start actually talking about machine learning a little bit about how it works or how it should work for you and what does it have to do in infac and a little bit about my project how you guys can get started if you're interested into it and some takeaways okay so let's get to it so what is this big data thing right and uh I think this is a slide that will actually give you uh uh a good overview of what people usually understand as data so when we're talking about Big Data what they should be talking about I mean 95% of the time is a right which is

an open source way for you to put together large amounts of of data in in this uh distributed F system it's called hdfs and then do a lot of crazy stuff with it uh which you probably heard of map reduce which would be equivalent to the assembly of big data so it's it's a it's actually API job which is pretty fun to call it but uh so that's that's the the the building blocks are where people build the rest of the stuff okay so when people talk about Big Data what they should be addressing in general is some part of this and you see there's all different things you can get blog colaps from the thing called FL this is

very much based on pletta which is one of the the leading guys they are like the head hat was to limits they're kind of what that is what it is to do in a sense so what want to point out here is that there is a machine learning component here so it's a part of it okay it's important to be in mind you don't have to do big data to do machine learning whatever I what I did here I didn't touch a single line or a single configuration of Ado I surely will in the future but not as of now but anyway this is like the cleancut view all right this is pretty much what you guys should

be wrapping your head around but then we have this problem right because everybody comes forward and they sudden they they're big data guy and they're doing all sorts of different things so you of course you have the based Technologies here okay but even Oracle will tell you even though they they don't really really touch a d they'll say that their original relational databases are actually Big Data just because you can run it on a little five machine cler come on that's that's lockal right when you talk about some of these guys they're putting like I know thousands and thousands of machines on do clusters and running all sorts of ghip and we'll get into this ghip which

is actually quite interesting so there's actually some guys here we actually do a little bit of that for log application so there know way they're trading this I I'll I'll get a little bit into that as well but uh it's very very interesting so a funny Point did you guys hear of you guys probably heard of new sequel right so we have a funny there's a funny joke usually usually in San Francisco Silicon Valley it's like yeah I know you know about no se because we currently have like 250 no no wait 251 compan who are doing this now so that's that's pretty much I mean this is crazy and all of them will have their marketing

message their hype everyone will try to convince with their doing big data but at the end of the day it's more about not how you store it or how you use it sorry not how you store it or or which kind of capabilities you have around that but how you actually use it and this is what I'm trying to do all right so let's start with the idea of machine learning and and uh the point here is that you don't actually it least sounds kind of fun especially for for some of the developers in the room you don't actually write the program you write a program that will be able to learn what it should do from the DAT that you that

you feed it so we'll have some examples around that but it's it's it's through repetition really okay so if you are you mean you're like a baby you just born and then you your parents okay this is a chair right right and then the first if the first time you see a chair okay a chair has to be exactly like that has to have this metal frame has to have this this custom uh whatever stuffing and uh but then you see a lot of different chairs a lot of different malls a lot of different things suddenly you can generalize where a chair is okay every time you see a chair again okay that's a chair I got it I don't need to that that

no that's that's move into tables or something like that so his pretty much much what we're trying to do here much what what these machine learning product are trying to do then some applications here all right so I mean you guys all seen this right you go to Amazon and it's some it's kind of guesses right what you would like to buy or or or what you'd like to do and the way it does this it's actually a method called collaborative filtering because it's so it figures out it knows everything you've bought and it knows everything everyone has bought in Amazon so they can pretty much much uh see measure up who are you similar to okay so you

bought all of these hacking books I got a lot of people there who actually buy hacking books as well and you know what they bought this one that you didn't so I'm pretty sure you're going to buy want to buy this one as well so it's a huge data mining effort all right they're Crossing it's like a think about a billion billion degree Matrix right on each on every direction they're just calculating that [ __ ] and they they can tell you you okay you want to buy this book this has got a lot of press for a few times it's the The High Frequency training stuff okay so actually a lot of the high a lot of the training nowadays

is done by algorithms that's not actually people doing it and uh it's this is actually a picture of when it crashed not long ago so they got confused and they dropped it like almost 10 points and they they oh wait wa wait let back up that's not really the way we should be doing it I don't know he found a local minimum in the function and it started selling everything and everybody's oh my God he's selling let's sell everything so anyway it got back up but this is this is a little bit of the barrels and which iend to tell a little bit about but anyway it's it wouldn't be able to be done without it you wouldn't

be able to done it in that previous but sometimes this kind of things happen and another cool one here this is actually a a project from Google which they recently put together like I don't know 10 10,000 machines 16,000 machines to uh try to figure out which pictures of the internet were of cats they really know Their audience right so they we try to automate a learning system to identify cat pictures and it was scaringly successful uh according to them it's actually it touches on a new field called Deep learning which should we should be hearing a lot on on the next few uh years or so but anyway this is what they do what do we do

but what do we should be doing right so when you talk about fraud detection systems they're usually uh in a way or using some aspects of it some of them only do like statistical analysis uh but most of them will have something like they will try to figure out what is your behavior so this is what you usually do in this in this region of space so to speak so if you do something that's too different from that they will flag this as an alert it's it's called it's called a clustering algorithm I'll touch this a little bit further as well uh there was some people who claimed to do some sort of artificial intelligence on stuff like

Network anomaly Behavior Uh sorry Network anomaly detection and uh it's much more like statistical analysis they were trying to do this clustering thing but they would always hit on the problem that uh what is normal with the network and all of us who have to do Sim implementation and had to do some sort of baselining in order to figure out what was the X we had to put in that role uh in order to okay up to this it's fine after this I want to be alerted that's exactly the same problem that these things hit uh uh from time to time the point I'm trying to make here is not that these things are terrible they actually help when

you're trying to do this sort of s configuration work but they're not really really machine learning so you just just have to be careful with what you're being sold all right uh the the bottom one here is actually one of the things that I'm trying to do is to predicting the likelihood of someone attacking you if they come from a specific region of the internet or they come from a specific uh you have some specific uh information about them so you can better I mean you're never going to predict who's attacking you right that doesn't it doesn't work like that but you can start predicting that if my attack came from this side okay I might

want to pay more attention because I've seen bad stuff happening from this side than if someone comes with this it's all a triage exercise and uh the more predictive models we can build around this and it it's kind of a trial by trial by everything and it's a little bit of water that we trying to develop we can increase the confidence in each step we may get to a point where it actually makes sense to tell a fireal to block someone because we've seen so much bad behavior in so many different layers of the network that it actually makes sense we have confidence enough and that's a word that they use a lot um that this this should be blocked and

finally our good F pan filters and uh I mean if you ever heard of the basian filters that's it you're right there so it's it's it's that's actually the name of a of a basian algorithm which I'm actually going to touch a little and uh anyway that's that's that's car on on that so how to do it there's ve two very interesting important points first of all you have to mind the data that you have so uh for for my example so I had a bunch of log data okay and I starty to extract features out of it so how many times did a specific IP address happens over several days how many times does a specific uh uh sl24 netlock

appears and then I can start putting figures and numbers to it I can actually start counting is actually when people call about feature engineering and it's a big word that gets thrown around in machine learning it's actually about counting stuff okay it's very fancy word for counting but then you have to try to count things on on ways that make sense to the problem that you're trying to solve so this is actually one of the biggest problems there and every every every algorithm development will have this wall and making sure that the dat is good enough for you to use okay and then bad data can mean data that you can parse properly or data that I mean I've

been trying to for instance I've been trying to ingest a lot of stuff from Sims and uh most of the time they they can't par they they can't tell me if this was a firewall block or a firewall out I mean simple things like that and uh you and we keep Reinventing the wheel adding new parts but anyway if you ask anyone they'll tell you that 80% is actually cleaning up the data and making sure that it works it's usable for you in any any shape and form and about when I say exploring the space is that really depending on what you're trying to do there are like dozens of algorithms there which do things in very different ways and uh you

really have to stress it out and try out different ideas and different things and you might come up with an algorithm that will actually give you a good result give you a good feedback on what whatever you're trying to do okay so deling a little different okay when we talk about machine learning we got two different guys we got what we call supervised learning which is people that we tell the computer what these things are so I'll give a chair to the computer I'll tell you this is a chair I'll give a table to the computer and I'll tell you this is a table all right if I give enough chairs and tables it will

probably be able to tell them apart which is what the classification problem is so it's is this a or b or it could be a multi-label classification as well but let's try to keep it a little bit simple now or a regression uh problem which is okay maybe I got a table with uh with three uh sorry I have like a table with three uh legs thank you very much see that's what needed not best to you and uh you have like uh small uh benches right only three letter as well okay that's weird so kind of this kind of looks like a table a smallish table all right but so okay so if from 1 to zero how much of

this is this a table or how much of it is this a chair that's what the regression problem is trying to it's trying to solve so you're really trying to get a measure from these uh classes so to speak that you have these labels where where this is so they both have a lot of interesting uses so I'm naming some algorithms here so Neo networks classification this support Vector Machin we'll talk a little bit more about that naive Bas which is our span filter hero uh linear logistic regression which pretty much fitting a line I'll talk to that as well and you go to supervised learning supervised learning you actually don't know anything so you're just I'm just giving

chairs and tables to the computer I have no idea if it's a chair or table and I'm hoping that the computer will be able to tell them apart all right so the computer will look at this okay these are actually different okay so I'm not to sure what this is I'm not to sure what that is but I can tell they're different and then a human can come on oh this is a chair or right chair this is a table and then you can start doing the the other kind of stuff when we talk about what uh fraud detection okay it's actually trying to Cluster uh all the behaviors of the users okay it will find uh a lot of

users that spend very little tickets uh not so often all spend very High tickets not so of so you have all sorts of different classifications nobody really knows how many different uh uh and this is what this Cas stands for right it's how many clusters you're creating uh this fraud detection things will do I mean sometimes it's even a secret sauce for them uh but uh then if you're classified into one of these groups if what you start to do is too far away from where you are from where you should be that would be flat the Frog and finally uh there's this decomposition uh Solutions which are usually help with the data mining and the exploring the

space plan so if you actually have a bunch of different uh features all right you actually know a lot of stuff about the chairs about the tables but you want to try to figure out what are the most relevant uh features what would explain better what a table is what would explain better what a chair is so I mean using the tables example with probably height you know would be a very good uh uh one for chairs and tables so if it's y High it's probably a table no but it's it's probably a chair but I mean there's a lot of other different things that have to be considered so this is all this is

when you this is when you got too much stuff on your hands it's not running properly you don't have enough time so you just try to simplify it and then be able to to run your more anyway this is fast uh we when we talk about naive base which is what the fan filters do it's actually trying to it it kind of guesses okay uh how what is the chance that a message is Spam and and this at the end this doesn't really matter a lot uh if if you the idea is here if you get a message uh How likely is it to be spam okay so if you look at uh those me oh 80% of the messages you

sp okay don't care about that let's wa 50/50 all right so uh then you start calculating okay if I get a word uh like Viagra on an email all right and uh How likely is a is a is a word is a is is uh an IM with the world Viagra to is supposed to be span okay and you use and you do that for all badw like enlargement and things like that and you start running this for all different words so what is the chance of a male having this word being spam and what are the chances of of an emale have dis word not being spam so when you run this enough times for every single word you

have okay and they call it naive here because it actually think believes that doesn't believe that there's any correlation between the words so it it doesn't think like uh enlargement should always come after penises you know so it just thinks it's completely random words appear at random and if the words are there okay that's a span and that's not a span so obviously there's a lot of optimizations to this that people do they use several words but but that's not the point and uh so they just run this uh this model here which actually calculates the times this word appears as a span times that what we know as spam appears uh uh and the time that we

actually get this word at all so this is time the words is a span times the probability of it being a span time the words is not a SP span and times the probability of it not being a Spam so this actually means every time the word appears right so then you keep running this every time you run this you're a little bit more confident or less confident that something is a span and the reason I wanted to put this into here is because um this has been going around for ages now okay I found a paper for Microsoft research from 98 and uh We've implemented this and um I don't know about you guys

I don't see a lot of new talks about spend it's it's almost like the problem has been solved all right of course I'm not talking about fishing spam so fishing spans will be something that will be exactly like what a ham would look like and that will get through and people will click and all all the bad things I'm not talking about the all the the crazy stuff you sign up on the Internet okay that will be actually have triggers all the stand filters for not go going through to you to actually go through to you and bypass those stand filters but it's interesting right we got a very high efficacy in this right now if you

look at gmail I don't know I mean I've had the Gmail ACR since 2014 right I don't get spam anymore I mean arguably they are the ones with the largest data set okay and also they have a lot of awesome in-house teams who really understand how this and all the algorithms work but when you think about that I think it's it's a very unappreciated part of information security here where this actually these concept have been using heavily for ages and ages and ages anyway moving on just wanted to throw this out here I don't want to get too heavy uh on this but uh one of the simplest examples is what's called linear regression so the idea is

you got some stuff you know okay I have some uh the linear combination of of some variables so an example I would give is that um I think that paternities do a lot all right so they try to if they they have the ultrasound and they they can pretty much figure out what's the size of the baby what's its height okay so based on the height and how many weeks of the gestation you have you might want to guess okay or infer what would be the the weight of the baby once it's born all right so actually they're linear right the bigger the baby intuition it makes sense as well right taller the baby heavier it should be the longer it

is gestation yeah it will get stronger allegedly yeah anyway I'm talking about um non disease related stuff so you can actually get all these observations you have plot them into different points so I I mentioned something with two variables so let's think of one variable for now because it's much easier to see so think about the height you plot all of this and then you try to find the line that actually messes up less okay so you're trying to find if you if you drew a line for this okay what is the minimum error I can get on this which is pretty much the distance between the points you measure and what the the the

and what the the line will do so this is a very classic uh regression model everybody does this for almost everything this is every this is really an everyday uh application that it's I mean it's just everywhere and if you if you need to know something to impress your friends you know like you're drinking the bar you want to talk math this is the one you want to talk about okay because nobody nobody can get this wrong this is actually a this is actually this is a very cool blog if you're into this and the guy actually has a lot of uh visualizations of what these uh algorithms look like and it actually so this is would be the

probability of where the line should be okay based on based on on the on the error that the error that I mentioned before and finally and I promise this is the last one for this this this is my favorite one I love this ship okay so it's called Uh support Vector machines and it's especially good for classification problems okay so you have a lot of numeric features okay you've calculated a lot of likelihoods a lot of probabilities of things and then you want to get a yay or nay or a zero or one or whatever you you want to do so what I really like so thing to about that is that okay these are two valid

ways and I don't know if the colors is good but you can tell that this color is this color right so these are all valid ways of separating those these surface right this put all the blue guys here and all guess guys here but this is a much better one because the I have kind of confidence B there I haven't seen anyone uh uh that could throw me away on this region so I'm trying to maximize this region Arrow here that you can see so that I mean since I don't know everything and I always you always work with limited data you're trying to make it the most better prediction is possible the the thing this awesome is that it's actually

it does a mathematical trick because you have you're going to have a surface like this right you can't really throw a line through that you got like a very squiggly thing that's going around and uh so the math behind this okay there is going to be some Dimension somewhere where I can just draw a plane or a hyperdimensional plane whatever whatever suits your your um math Necessities um that this will be linear separable so I I'll be able to try to get a plane equation and uh they do the calculations and they okay this is true okay there is a plane there is a space where this happens but we don't really know what it is we Pro it's probably

infinite Dimensions okay so you do a calculation in an infinite Dimension that you don't really know what it is okay nobody knows if it exists or not but then you can run this and run it back and You' got it separated here it's like this is crazy this is really really crazy it's really exciting I don't claim to understand it but the idea that I'm doing calculations on Ann Dimension I don't know that and the computer is doing it for me I like that I like that ific pH but you probably right anyway what's the point okay okay why why should we care right what is the what is the RO of infoset here and uh

anyway I saw a few hands earlier guys the similog mon stuff we have is actually just vertical data warehousing tools from the '90s you know where Callum databases were never invented for some reason I mean actually no they just coming up with that sorry I'm being I'm being evil so we are really banging our heads against a problem that has been solved again and again by other Industries uh using a very s par okay we cannot we don't have the speed we don't have the agility we don't have the capability to actually run the Bri that we want okay and uh my point here is that we got a lot of logs we're getting more and more each day all the vendors

are starting to push uh Big Data uh stuff with their with their uh specific tools and when I say big data I mean like do like storage okay some of them are it's starting to dable into analytics but I it's still in its infancy there's going to be a time where we are not going to be able to deal with all this okay it's too much for analyst I I believe it's already too much it's been too much for a while now okay and this is only going to get worse and worse and we need help we need help from machine learning we need help from data analyst we need help from data scientists and uh that's that's pretty

much the place where I'm coming from and uh so just to make sure that we we understand this perfectly this is what a data scientist should be okay so the the most important thing and this guy this is a this is actually a very cool guy uh that I met there in the Bay Area and he he he has this this SP right which is the guy is better than statistics than software engineer and better than software engineer than any statistician and the thing is you actually got to do a lot of stuff you know to be able to to do this right oh so just there who you're wondering hacking skills means development in startup speak right they

they totally wrongly appropriated that but we got to roll with that they they're more numerous than we are so there's nothing we can do that so I mean if you have the discal knowledge and this is a tough one this was this was one I had to really go uh learn a lot you know to do stupid stuff uh and and you know what you're doing you're just a regular scientist right if you if you can program and you know what you're doing yeah you you're just going to create some Killer Robots here machine learning let's let's do it but I really love the fact that if you know if you know the field you are you know and you

do have some programming skills you are totally on the danger zone because you're going to mess up all the different things all the statistic significance that you need to have in order to have 100 so there's I didn't have time to put it here but there's a very cool skd about that about people trying stuff on jelly beans and they try so many times if they find like an out leer and they would find anyway so that's pretty much the the problem that we have but it Al could be double P rows if we get to all of this and uh this is also an important consideration and this is the I think this is the point that I I think this is

the one of the biggest equations to the information security ground is that um and it ties into the statistical significance I was talking about you have to be aware that whatever data you're getting there there potentially is some bias okay even if it's a self- selection bias where you've asked people to volunteer data okay this is the kind of data that I get from people who actually volunteer at this so you might if you're not trying to generalize or not trying to be cognition to that you might make some some rooking M stakes and the thing that I find is most uh important is actually the adversarial aspect of information security so once we get more of these uh models and once

people start developing this people will find ways to supper that so this is something that you also guys have to have in mind um and but other than that if you if you take those two things into consideration the more data you have and this is an error measure this is the error measure on much the data that you have and this is the error measure on the data that you've never seen before you know you you tend to have a go to what would be the optimum here you're never going to get to 100% okay that's not what these things are bu for okay you can never maybe if you saw the whole data set in the world which you're not

um you might get closer to that but anytime someone tells you that they have like a 100% confidence model just run away from them very fast okay because they're doing something M on so anyway and of course this is the second point I was trying to give and I I think an important magnitude about this is the way that spam engines involved so once people started uh uh getting this right and uh the the the beach F started working a little bit well so spam emails started pasting The Works of Shakespeare at the end because the number of words there that actually made sense that could potentially be on an email they actually would offset the

model on the sense of what should be of what should be gathered or not uh from the span side so I mean long time has passed they we all got over that but uh this is the kind of thing people will constantly try to test it and try to make to make sense on how to break this un this is what we do right this is this there's a lot a lot of our culture in the hacking the real hacking Community it's about actually trying to for this but I think this is this this this will get a good R so anyway so just to tell you a little bit of what I've been doing so after all

this very very long introduction what machine learning is uh I've been researching models and ways to uh make uh create some projected models that will have information security I'm sick and tired of looking through logs you know I'd rather have a computer look through logs okay tell me what the pain points are and then I can do some honesty goodness instant responsib okay and oh my God what about the tier one jobs no they can learn the good stuff okay nobody wants to do tier one work okay especially at midnight okay they will miss stuff they will sleep you know they will go off for coffee you know just let's try to gain some consistency

here okay that's that's pretty much what I'm trying to go for so if you go to the website you can talk to me you can me that I'm complely crazy or you can sign up you know and we can find a way that we can share some logs and I can use this logs to a few the model okay so just to give you a very quick summary of what I'm talking uh in black hat uh what I did was based it based on some I mean I'm not trying to do behavioral analysis here I'm not trying to count stuff I'm not trying to infer what's good what's bad no no no no no let's start very low

okay let's get all everything that's blatantly bad all right this guy is scanning your firewall he can't get much better much worse than that you know so we I mean all the intelligence Community when they try to make inferences on I don't know threats to the nation or something like that they go from much much less than that you know we have actually people trying to break in our doors all the time if you we can't tag these guys as relationss I mean there's something very wrong here so let's get what's blatantly bad okay and then I went to to D Shield okay the the sense program it's actually awesome they take fire logs all over the world although I

do suspect it's mostly us Centric from the results I got uh and they create some summaries for you top 10 ports top 10 attackers and things like that and if you're very nice they will give you the bul data so I just mined that for six months okay and uh what I got okay when you do the math and I won't go through the math here is that there's a false positive ratio there's a false negative so I gets the wrong both ways and that's expected but at the end of the day all right if you look at the result what I'm telling you to look at should be 13 to 18% more likely to be attacking you than

otherwise okay I wouldn't go blocking stuff on that still because it's just this is the the trouble about statistics right I mean what is the likelihood of someone in the internet is out to attack you if you ask someone from this room oh everyone is all together but really okay so there's a number there we don't know it's called the prior in in many situations it's one of the the things of of basian uh principle so but if you are in a in a position where you are uh looking through this you are looking through laws okay you are in a in an analyst position you got so much stuff to look at you can't possibly eyeball

all of that all right so if you had a a okay that created a separate screen for you okay guys this this is like if you only have time to look 25 look this 25 Lev okay because they they seem to be they seem to be doing the most noisy I I see some I comp that and that could potentially help us move forward and scale is more most important the amount of stuff we have to monitor okay so blah blah blah Internet of Things y TV6 let a lot more grasses uh screw sorry we have no chance to help in we need help this this is what I'm trying to do so this is this is one of

the visualizations I'm using so I don't know if you guys ever seen this this is called the Hillwood curve this is this is my attempt to plot the internet okay so what actually looks like is that if you go it like this like the first fourth of the internet and then it goes here here and here actually the way this works it's like does something like this it's you know it's trying to feel this but what this construct does is that it's trying to uh preserve uh closeness so it's the best it's it's kind of like a map okay when you see a map on the screen it's not your round and just put it on a flat surface so this is a map

that's trying to preserve closedness so you actually get here the the sl8 and the slash 16s or 24s the closest that we can get to each other and um so this is a very simple statistic I'm just counting how many times I'm getting blocked uh 22 uh 42 blocks from fireal and this is like almost 7 months of data guys this is is not Rand you can definitely see some hot spots oh by the way we're here we this is uh 24 SL sh the resolution is I mean it's not a very big resolution here it's it's actually quite a lot of the addresses but anyway we can see some guys we love here you know I make sure

to point Brazil here so that people wouldn't think like I was being bad to other countries but you see some PR stuff going here it's it's quite more breu than what's going on around and this is the kind of behavior that we we try to analyze and then we try to inforce some solutions so this is an idea this is this is another visualization on the same uh vein of uh what the the the algorithm is doing now so what I did here is BAS this is also 422 right so just to make sure we keep on the same on the same example so uh I got model trained so this is this is actually the the the model from trained

on that same day and I just randomly picked 100,000 IP addresses from each uh uh SL a each class A okay and then I asked the algorith okay predict this for me do you think these guys would be out to attack me and then again always remember uh this is from the frame of reference of the guys who are submitting lws of sense okay this doesn't mean anything uh or me actually means little if you are not if you are not of the data that we mind there's actually a transference uh that you can get but the more data you have the transferences should go up we should get a a model that works fairly well for everyone but

uh the brightest the tile here you know more likely are this guys out to that to cat you so what I see more likely here like pretty much 100 100K okay who got flagged us who got flagged as as positive and when I say positive I mean malicious so what's the L likelihood here of there being a malicious on this on this sample so you get from 10 Theus 4 to 10 Theus 7 here so the brightest guys here they're actually almost a 100 times more likely for you to see an attack from these guys or let me rephrase that an attack from these guys could a PO a packet from these guys could be interpreted as malicious okay

than uh someone who is in the darker times so anyway I I actually struggled a lot in how to try to demonstrate this because this I mean this is a very big scale it's very hard but I hope I can I I mean of course there will be a lot but if you guys want some questions so and it also those fun things because I was asking it I I try so one of the dry runs I did was in Port a0 I had a lot of stuff in Port a0 right so okay tell me robot who are the bad guys Microsoft and Google knows that [ __ ] right actually no right actually since what I'm what I'm mining is blocks

and port a okay it's very easy to see that these guys will always do uh the Google Bots and and the the B Bots they willon be they actually don't care if there a website yet they will just starty to connect 48 and that's it but uh this is the this is the point one of the points I'm trying to make here is about the the the data scientist diagram where we were talking about substantive expertise all right if we get this in front of a traditional scientists they would like no no no this means they're bad the data doesn't lie or something like that no no no you actually just need to have a different view you need

to create some exceptions and again it ties back to the data mining and the in a way the making sure that you understand what the data is telling you so that you can extract some good some good results with this anyway sounds cool how to get started all right uh and I I apologize you do have to program there are a few uh like websites that let you do machine learning but being honestly they're like so clunky and you have to to know almost as much as you would if you were programming I personally think they're they're kind of a waste of time you're better off trying to learn a language to do that a lot of you will be familiar

with python so python has the pandas uh uh like library for uh for you creating like data frames and and stuff like you you can represent like like you would in an Excel table in a way badly comparing and the psychic learn which I showed in the previous slide which is actually their machine learning dat machine learning library which is I mean most popular one I guess I should be should be honest about that and I personally enjoy art a lot that's where I learned my stuff okay and it's a lot slower than to be on python but I know some kind of personal really want to talk about uh anyway you got to learn the theory okay

but nowadays the internet is an amazing okay a lot of interesting courses in corera okay there's also some paid stuff from universities you will get uh I mean traditional education online educ traditional college education online for very Hefty price I'll say but you can get to to interface with this highly you know uh uh Rec professors in these fields and of course you have to practice okay it's not just about being a book so C is a website where machine learn is pretty much fight to the death to see who comes up with the best model to solve a specific problem and uh we had there's a lot of other competitions so skd here which is a very famous one

it's not specific about security but it's about machine learning visualization in general the vest challenge you might have heard of this before they actually the vest challenge every year they have an infoset kind of computation where you either have to mine something to come up to conclusion or come up with a visualization that you help to see what's going on and tally VAP okay so I love I love database I wish I I I had I'll definitely something that I'll be researching more uh going forward and they do some very very cool stuff trying to visualize of that again it's another way to help the analyst in way anyway so what do I want you guys to remember

uh the most important thing is that big data is just but it is right it is going to be a reality it is going to be something that we're especially if you're doing defense especially if you if you are in a largish when I say largish because every single moment Pop shop now will have a log uh management solution because of PCI this is such uh untack potential there to help themselves on what they they they could be doing uh I really do believe that this will come okay and it should be I don't think it will come soon enough I actually uh my personal belief is that no big vendor will be talking about this

I mean we be talking about this but they won't actually Implement anything for like I know two or 3 years or so and we're going to be left fending for ourselves you know everybody's going to like burn out and lose all their hair because they can't make sense of all the data and then this will start get to I don't want to wait anymore okay I'm trying to learn this and I'm trying to make sure that's something that we can we can use as of now or as soon as possible uh anyway learn your daily stuff man everybody this is the as they say it's the sexiest job in the 21st century the data scientist okay and you

think about it's actually much better than statistician right only if accountants call themselves like money scientists man [ __ ] rock stars I mean and um anyway check it out it's interesting check out my talk if you have the time so I'm I'm tonight tonight 5:00 p.m. at blackhead now I'm going to be in dcon on Sunday if any of you survive the paries on Saturday okay and uh but most importantly there's one takeway that I want you guys can have from this thought is that this is what machine going all right and it's a robot learn by itself but again guys I don't want to add to the H here this is not a FIA have to know what

you're doing most of all okay so I've I I've been doing I've been doing this research for less than six months okay I still think that I think there's a lot of potential there for commun to okay for from the little that I did and I mean uh IP addresses that's like the lowest bottom Fe you know in in in IR but still we get some confidence we might not be wor the really real Packers or something like that but think about all the little guys who don't have the money or don't have the expertise or don't have the time okay to actually have a very good inst uh uh is SCH I don't I it just feel like I've

made so much money not so much money but there's so much money being made in consultancy right where people like just sell Sim and then configure all this [ __ ] okay do not work let's buy another one and configure everything I know I felt like okay you got it let's finish this right I I can't do this anymore and uh but anyway I really want to emphasize that especially the danger part you really have to learn this okay you really have to look for forward to to get some more technical uh basis on this because otherwise this happens okay we don't want that to happen all right most important anyway that's it guys that's

all I

have questions yes please so uh Sim versus machine language or something we spent hundreds of thousands of dollars on apps there on Capital on Consultants that make them they work etc there and have not gotten a lot of value out of most of those appointments and now we're looking at Big Data type of scales there and my concern is that we're kind of Reinventing s because in order to do the types of data analytics that we want to do we have to deploy collection agents you know Flume and so forth there we have to do parsing all over again to figure out how to get the elements out there we're basically Reinventing the whole s stack just to get the

foundational level where you say okay now I actually have data that can feed into my M jobs my art machine Lang there and I'm concerned about the the level of investment that we're going to have to do on an individual company basis just to get to a basic level of functionality are there solutions to that well I wish no I mean you you you tou you touched on a point that I also believe very strongly I I personally believe that what's going to happen is that in a year from now now these big data products Security prods will be released a lot of people will spend money on them they will be frustrated as much as they earn sin maybe even more in

a larger scale of f preservation you know and they will just I mean it will get as much as a bad r as okay I don't really think it's act I don't think Sims are actable without some sort of assist ma okay and uh the Big L stuff certainly is not going to be acable what I would suggest Gest okay and um as as painful as as might might seem to me for it to come out of my mouth I would just stick with the big guys from now in the sense that hopefully they will have a migration path where you have your where you you have your sim solution uh running on traditional relation

databases okay and then you can you can there's a lot of guys in the cloud talking up sorry okay uh there's a lot of guys poing out uh that actually started out as like a like that mostly in the 12 so that can be an option so um on his point most of the especially bigger they already have connectors talk directly to hdfs yeah but again that doesn't mean anything but if you but if you can get your correlated data into hdfs that's it's already been decorated so you kind have an idea of what you're going to look for then then that gives you a starting point for for your ml so it's connectors from you've already gotten

you're already ingesting to your sim you know there's no reason for you to start sending to your sim and so that you have a my big my biggest yeah I'm totally with you there my biggest concern where I'm coming from is that not a lot of people will have this uh capability so I mean a few of you guys in the room undoubtedly will and uh they will get you know HFS up and running those big scripts on it all kind of chy stuff uh but I mean are we going back to the wall your Solutions all all back again okay let's scrap everything out let me run some M summarizing scripts here that's where we're moving that to

okay realistically if we're not talking if we're not having the same conversation about ingesting more data with automation of this data we're just going back and everybody's going to be unhappy again I'm sorry so you had a question I was going to ask you about the data quality we started looking at this about a year ago one of the biggest problems theog quality of it and have you given any thought to trying to provide input or feedback is how Le to what actually gets put on a long what to applications have a lot of flexibility regards to what they throw along and is there a way to increase the data quality to allow for better results

the the way I'm approaching this okay and uh you have to bear in mind that this of course this is a working progress so all the data I got from s oh sorry all the data I got from s was already relatively clean okay there I mean a lot of missing uh Missing uh values okay they had like like it's supposed to be outside Firs a lot of uh T dot stuff there you know but anyway you just have to clean this up but in general uh the the the overall uh framework we trying to create is that if you have an actor okay so the law has to have an actor someone is doing something

okay that that's the first part okay a lot of long entries you can't really tell who did it maybe it was on the log before something like that uh if you can somehow uh first of all uh identify uh a behavior from this law okay this is a law that actually tells me something someone logged in an application someone used this role specific he had in an application in order to I know es not escalate the Privileges it would be it would be allowed but he accessed this page because of this so you start to get actors you start to get behaviors that you would like to predict or not okay and then the hardest part comes which is

since you have those actors how can you cluster them in a way that you can start so for instance on on on this IP address uh example I used uh the relative position of them on the internet as a we to plus so that's that's pretty much the the the point there so it actually propagates so are people from the people from the the marketing department more likely to have this or that role that would make sense so if someone shows up of that Ro that's not from this groupy that could potentially be flagged as as a red flag on so that's that's the way I'm seeing them okay and of course this like I said I tried to make sure I got I

got to the ground Flor okay so I I I really couldn't be uh I wasn't trying to get stuff is this bad or not no no no let's go to the most blatantly bad so that we can start this so maybe if you if there's something you're looking for and you can identify that of logs you can find the actors and you can in a way cluster them or make sense of what are the relationships between those actors you can potentially use something much we've just seen that there certain vendors name as far as the Ling they give you this components that is we're trying to CL and some vable kind of things those S of

algorithms and it's like if only this vendor would give us this additional information we would clear I can only tell you what I'm doing I'm I'm doing really one step at a time so I'm very confident this is good for f logs or IBS logs and things like that these are I have IP addresses and I can L find that stuff okay so then we'll I get to maybe the harder questions and it's all about the questions we just have to to ask bigger questions and older questions and then start mining the [ __ ] out of bit and know how much time sleepless nights and know to try to find something that fits anyone else yes

please uh the black hat talk is at 5:00 p.m. uh today yeah 5m. what is it yeah good good point so there's a little bit of production I bash in Sims a lot more at the start uh yeah just and uh I talk just a tiny bit about machine learning just as much as to get to people don't think I'm utterly crazy about the results how are you doing this and then it's more than half of the presentation it's about the specific method on how I mine the logs how I compose the data what would the intermediate results so what do the features look like when they ploted against each other so just to try

and it's really really trying to tell a story on guys this looks like it's working please does anybody else in the room does anybody think I'm crazy does anybody think like I'm doing now you're doing like a rookie mistake here please okay I'll have my life back right but I don't know I really feel like this is something not only it might be worth f as a as an example of this but I really want to get this discussion going I don't think we're doing enough of this and I I really really think this is the future we're not going to do it on our own not at all especially not the the boring part man

who likes to be S through this well let's just let's just give it to the robots anyway done what about the death it's the same that's yeah it's it's on it's on Sunday uh 1 or 130 so if you survive if you wake up like noon can you get a coffee anyway thanks a lot guys thanks Alex and thanks to our sponsor strong off I have some swag here to uh hand

PG - Using Machine Learning to Support Information Security - Alex Pinto

Related talks