GT - Intro to Data Science for Security - Rob Bird & Alex Shagla-McKotch

Name: GT - Intro to Data Science for Security - Rob Bird & Alex Shagla-McKotch
Uploaded: 2016-12-14
Duration: 1 h 56 min 38 s
Description: GT - Intro to Data Science for Security - Rob Bird & Alex Shagla-McKotch Ground Truth BSidesLV 2015 - Tuscany Hotel - August 05, 2015

BSides Las Vegas1:56:38139 viewsPublished 2016-12Watch on YouTube ↗

About this talk

GT - Intro to Data Science for Security - Rob Bird & Alex Shagla-McKotch Ground Truth BSidesLV 2015 - Tuscany Hotel - August 05, 2015

Show transcript [en]

but we're set yeah see if you can find Rob and Alex there the speaker liaisons for this talk yes so should I introduce the speaker Lee what's the so I'm the speaker liaison I'll be introducing Rob bird myself and Alexander shy from Cotts here we're doing the internet day of science for security talk today hopefully you're on the right place okay come on in sit down okay okay so just to begin who are we oh here we go who are we i am robert i work at Akamai i lead big data architecture for our platform team and the deep learning research group there we analyze about half the internets traffic every day and it's a

lot of traffic alex is I may look Shagal my catch I work at rapid7 among the global services team so here's a kind of standard disclaimer I am NOT Akamai and alex is not at rapid7 we do not speak on their behalf we didn't harm any marketing people in the making of this presentation and nobody approved the message so you complain to my boss II won't care I promise so our agenda today is gonna be pretty straightforward we're going to take you through a background like a historic background on data science some of its definition in process and what it usually means for security applications then we'll get you up to speed basically on a bunch of

methods and talk through some of the background lingo that you'll just have to know so if you get in today of science if you've never done data science before this is your first time ever even consider bring it you will have to know some terminology just like you even have to know terminology for security it's the same kind of thing but hopefully we will not draw our microphones here and then finally we'll start looking at some data and apply some of these methods will apply it's all kinds of different data just try and keep it diverse interesting for everybody so with that I'm going to hang it off to Alex and he's going to start off with our background all right

all right so let's get started many people think that they data science is something that's very new it's actually quite all started about two hundred years ago probably right now we just used to find interesting parts of most new applications and a lot of researchers driven the new advancements in data science probably the first data scientist was cows when he was a teenager he created least squares which was pretty awesome because I we've used only 16 years old in 1960 nor our first coined the term data science interchangeably with computer science and then in 2008 the first person to get the actual title of a data scientist was DJ petilla linkedin someone else at facebook also claims are on the same

time that he coined the term as well for a job position but we're gonna go with the DJ and in this particular instance so what is data science that's mainly statistics so this is give us many of the most powerful tools and data science it wasn't like many things like what they need use the mic call your you can you hear me now yeah okay statistics give us the most powerful tools and data sets the most important of which is uncertainty the next part of data science is math it's one of the main tools as well looking at when your algebra two apology graph theory combinatorics basically mathematics is at the center of just about anything to

do with computer science which six ways is in do computer science the foundation a lot of it comes from Shannon's information theory lots of methods and data science have correlations to this and most data scientists are at the end of the day programmers before it was all done on pencil paper and lots of theory now that we have computers we can compute these algorithms and optimizing that'll make them faster and more usable using Python and are a large part of data sciences machine learning provides many tools of data science in particular machine learning has used for used to infer models for using optimization and this isn't religious optimization for speed it's optimization for different parts of the albumen whatnot it brings

us to visualization and having to do it responsible we'll get to that in the next slide it's a big role in data science because it's very difficult to look at a graph of a hundred thousand note you want to be able to intelligently display the information so that someone can actually look at it interpret it because we're all visual most people are visual learners and wanting to see this if you have a bunch of numbers it's kind of difficult to take in and grasp that and this is a good example of data science where it's it's not very responsible yeah it's a very responsible so this is showing a farts and big data and how they're

converging and yes of course it is a marketing buzz word now but the main takeaway should be that it really should be data science should be used to actually showcase value in your data praying others tons of data created daily and we need to come up with different ways and better methods to actually consume that data in show its value so it's just not taking up tons of disk somewhere in a data center yeah and just to add to that insecurity we often you know a lot people say we have a hard time kind of conveying the value of what we do it's it's difficult to convey that to supervisors and so forth especially if you're not actively under an attack

or some terrible breaches occurred now obviously we can always point to external things but the great thing about data science is it gives you lots of ways to sort of give you difficult to you positions for this stuff so the next one of the key things that you need to have a data scientist lots of patience you must are going to tune your data models clean your data set with pre-processing you must not be able to rush to conclusions if you start looking at something and it looks obvious then you're probably doing it wrong or it's for something that is a very standard model is the find similar which brings us into the the fool's errand you do not

want to get caught in this so if a pattern occurs randomly in time even ones that are inconvenient or too good to be true remember the first of all data science should be to understand your uncertainty not your certainty so if you're looking at something that's very well defying data science is probably not going to be a good choice for that or if you're looking at something in a pattern pops out and you go down that path you should probably broaden your scope of that and look at all the different methods for looking at it before you're just going to just go down a rabbit hole and it's not going to be good and then brewing data science is

more of a mindset than anything else and it's a mindset of the process so a simple definition is find things you don't know about and communicate the value so a data science boils down to you want to think about the data that you're going to be trying to ingest you're going to think of the process that you're going to want to use to analyze that data and then you're also going to want to see what the outcome is going to be input it into a format that's others can consume that might not be data scientists or someone that has statistics background or machine learning banker insecurities it often means knowing what usually happens you wanted to look at your data sets and

think okay if I'm looking at logs I know that the logs are going to be doing things for sis alive let's say there should be these colonel messages and things like that net flow they have a structured format you're going to want to under be able to understand the actual data that you're analyzing someone just gives you random data and tells you to do analytics or statistics on it it's been kind of difficult for you to do that you're actually read up on the actual data that you're going to be trying to process and finding the things that shouldn't happen looking at the outliers basically we're winning right this is very easy all right no it's not going to be Bravo take

that there are any questions so far so bro yeah okay so here we're gonna get our hands dirty and kind of you know for all that kind of overview we all know math statistics that's all part of it but as Alex points out it's sort of deceptively simple insecurities and that problem of understanding what usually happens and understanding what's unusual and when it happens it's a little trickier than it sounds and we'll talk about some of the things that make security a little more difficult than other forms of data science probably the first thing is that we have complicated data types and security so you know we've got P caps which are in in the day

of science field they'll call the semi-structured data there's some structure you have fields but then you've got other fields that are basically just like payload that are just wide open anything can conceivably be in there we have high dimensionality to our data so we're dealing with a lot of plain text we're dealing with a lot of measurements we're dealing with a lot of potential systems and sources if you imagine for example you wanted to create an interaction graph that shows all the different source IP addresses and all the different destination IP addresses for a large network and who talks to who that's actually pretty large now if you try and embed that or send that data

into like a neural network or some kind of machine learning model you have to convert that into a form that it can digest it can't digest tuples like that so when you end up doing that you create this kind of binary state space explosion and it can get very complicated very quickly and the models that you can apply rapidly disappear basically or that are useful it's how a dynamic data which for statisticians basically means everything is changing over time so the means are changing over time is probably the simplest way of thinking about this if you imagine a time series chart that's going up over time where the price just say for the state for a stock it's just goin up goin up

goin up goin up that if if you take a slice of that at different periods in time the mean is changing and actually the number of methods that can be applied to dynamic systems like that it is much smaller compared to the ones that can be applied to so called static data and there's some tricks for converting dynamic stuff to static stuff and stow it and so forth and we'll get into all that later on the next thing is is that it's hallie stochastic data there's lots of randomness in our data timings everything we have failing devices failing sensors you know the ordering of packets coming into a queue somewhere all of these things combined

to make the problem a little tricky so if you imagine you're doing say sequence analysis on events if you're looking at say the timings between different events those can vary dramatically so it's not as simple as say trying to model it with white noise or trying to model up with uniform distribution you have to really pay attention to the randomness in your data and actually for a lot of detailed security problems this can be one of the most challenging things actually it's called state space modeling and they have a relatively short value window so obviously there's a lot of value in forensic analysis you know there's lots of things that you can do three months

later after you've detected the breach and someone's complained and your stock prices plummeted but if you think about actually reacting in an incident response way real-time data science is a hard problem and it's a very computationally expensive problem potentially and so again for security these things kind of all add up to make our job a little more difficult than the practice examples that shall frequently run across in data science textbooks in fact I was at a large data science conference and one of the guys said that the hardest thing in data science is finding those magical data sets that seem to show up only in classes because they don't exist anywhere in the real world as far as I know no one's ever

actually analyzed irises and stuff which is a bit of a data science too so the first thing here is when the data is obvious and maybe this is super obvious but when the data is obvious and you have a model for it just use the model don't try and use don't apply data science tools to well-defined things if you have a protocol that works in a certain way and you know that the protocol is either broken or being manipulated if it's working in a different way just define it like that so there's a lot of this kind of blacklist whitelist sort of rule stuff out there and you know a good example of this is

relativistic mechanics we know that relativistic mechanics works within our frame of reference in the universe and therefore you know basically if something shows up that's completely violating save the speed of light we know that's really weird and we should look into it we don't need to train a neural network to figure that out okay so take advantage of the tools at your disposal especially with complicated problems you know one of the good example of this is I was analyzing video content recently trying to determine quality of experience and one of the tricks with that was you know you have to have different frame rates for different kinds of content so sports content you need really high frame rate

a talking head like me you know five frames a second is really all you need for maximum quality and I started looking at data and I was seeing video that was being done it like you know five hundred thousand frames a second it's probably reasonable to assume that that's a problem like a bad measurement or an error or something right so I don't need to use an anomaly detector to go through and do that analysis I could I could still find it but it was much easier to just say that's that's irrelevant and you'll run into this you'll run into bad measurements just by show of hands who and the other rooms ever work with that flow before oh

that's awesome okay have you ever seen weird things in that flow that make no sense whatsoever that's me about the same number of hands okay okay okay I would say that the first rule of data science is talk about the process every single time and I say this because it's really easy to get sucked into shortcuts and I've done this before and I'm not going to make that mistake but when your especially when you're first starting just follow a process and the process is so straightforward and obvious that it's actually the one that you would follow any way but just formalizing it helps because you'll remind yourself not to forget parts of it so the you know sort

of the popular data science process generally speaking is you know you have a bunch of data at least this at least I found this is how a lot of folks tend to think about it you have a bunch of data and of course you've got a nice big Hadoop cluster somewhere because somebody spend money on it and so now you're just going to make money you know you you're just going to win automatically there's nothing else to be done really once it's in to do if it's done you know and obviously if you're programmer you should just figure this all out for yourself right so we bought you the cluster why do I not have yeah I

went to a conference I heard that there were gold nuggets in my data why don't you have nuggets for me yeah this is kind of what people will say to you these are all these also tend to be the people that say things like I've got MongoDB and obviously we're doing great just a little to MongoDB there but the reality is is that data science is not your database okay it's part of the day of science process structuring and storing your data how you access it and more importantly for data science the kinds of things you can get the database to do for you can be really valuable so for example you know we're all used to

sort of traditional transactional databases that have kind of a row oriented store have any of you ever worked with a column we're into database before some of you okay so calm oriented databases all they really do is you know if you think about how you do analytics you may have a column that's say BAM with use right so if you pre index on that column then you represent the data literally as a column on disk we're actually doing your analysis everything will be very fast for bandwidth use right so column oriented databases array databases which represent on your data is big arrays and let you do operations in parallel like like that those are some of the nice tools that shortcut the

scaling problem for data science but they are not data science tools okay in their own right so the process itself is called crisp DM which i think is actually I don't know it's a much cooler name than the actual process name which is a cross industry standard process for data mining ok just ignore that the point is it's a very straightforward process that I think you know frankly it's what you'd be doing anyway what it really says is you want to combine the knowledge and information from your business side so understanding what value problems you're trying to solve with the technical knowledge of your data and quite frankly one of the biggest problems usually like some sort

of what I would call anonymous or drive by data science where you just get some data set off a website for something and you're trying to analyze it is you're not an expert in that data at all right for security data you are experts in that data so you've looked at packets you know strange things are going to show up and in fact they should show up if you're finding where you know if you're finding malicious behavior right you have a good understanding of values that you should anticipate so timeouts retries the general flow of protocols those things are sort of familiar to you even it like an instinctual level just from looking at this data for so long that domain

experience is really hard to capture in most generic data science problems so you're already off you already have a head start okay and we'll see later on why this is very important the second step is to kind of clean up and as I would say like spruce up the data right you want to take this ugly data this raw data vich that you've collected and you need to probably reformat it you need to break it up into different pieces like say for example a payload like I don't like packet payloads you can't probably just take the raw text content of the payload and just analyze it you're going to have to probably drop out stuff it

doesn't matter maybe remove random strings maybe you run reg X's to pull out certain key values that you know are inside of a certain protocol this is all part of the structuring and kind of setup process to your data and I realize this is not the sexy part of data science but the reality is is if you want to be a data scientist you will be doing this all the time and you will like it and if you don't get out of this room I'm telling you for your own goodness just and he's leaving right now for this very reason cannot stand pre-processing that guy's not into it third you're going to model your data

you're going to come up with reasonable models that seem to fit the data and you're going to apply them you're going to formulate a hypothesis and you're going to try and validate the hypothesis okay and I say being a little pedantic by saying you have to formulate a hypothesis but don't forget that you need to do that as always point out earlier there's patterns in every form of data if you just take random binary strings god there's so many patterns in there oh my goodness that could be who knows what's a it could also just be random binary strings okay so having a good model understanding like what exactly what you're doing here understanding that you can't date a

dredge which is a process for just throwing algorithms blindly at things and hoping that results come out is a very important thing now some people say well I you know I'm very successfully used something like that that data dredging technique to for data you'll hear it called exploratory data analysis they're not exactly the same thing and the reason why I say that is is that people that are doing and maybe even you when you're doing a story data analysis like this you're really doing it more as an ensemble okay so this is more of a technical term and we'll talk about it later but when you when you run many many different models blindly against

data it can be done validly but it's got to be applied in such a way that you're accumulating belief you're accumulating evidence for your eventual hypothesis that you think that you're formulating but trust me having screwed this up many times get a formal hypothesis write it down on a piece of paper writing on the wall of your shower okay at some point just get it down as early into the process as you can so you don't forget what the heck you were looking for to begin with later on when you think you find something cool over here in the corner you think find something cool in there over in the corner like Alex for example come back to it later okay and

we will come back to Alex in them ok so in reality what is the process actually look like this is what most people think it's going to look like they are going to just take their awesome knowledge they just know so much man I've been doing security for 20 years I am so leet you have no idea I go to the 303 party it's awesome I'm going to take all that i'm going to go squeeze the knowledge out of my boss and make him tell me you know tell me the valuable questions you need answers to and I will go get you those answers and then they're just going to you know they're going to do a

little modeling or a little cleanup and then it's just going to be all about the modeling man we're going to use neural networks we're going to use you know genetic optimization we're gonna do all these cool things you've never even heard of before and they sound so impressive they will raise VC money faster than you've ever seen okay and then eventually you're gonna figure out how you actually evaluate if it was good or not but you know what that deep neural network you use is so cool and at this part of a problem here with this is that you know really in the common literature right now modeling is the sexy thing okay it is so hot right now

which by the way ironic seen in this scene he's actually saying that Hansel is so hot right now and Hansel is right behind him in the upper right corner free when it sees the movie instead of a trivia Club so in reality this is what the model really looks like okay it's very hard to find domain now it is very hard to get business insight on the questions that they're looking for aside from why I want valuable answers well don't we all okay but the question is do you have valuable questions and there's a lot of there's a lot of data science done without this focus and i think that's it kind of dis credits the whole effort because really

all you end up saying is we found all this stuff and we have no idea what it's for and if you're going to care or not but we did a lot of algorithms you know the reality also is is you're going to do a ton of data cleanup this is very painful okay think of something like IP addresses on the surface and IP address sounds like a very simple data point now I've got a column and set its got the IP address now what if you want to gauge the similarity between two IP addresses okay a lot of ways you can answer that question so you can say okay well maybe it's just the lack that last octet maybe

I run a trace route between those two IP addresses and I determine the number of hops maybe I have a BGP network map and I use that to help identify you know maybe some notion of proximity or similarity point is there's a lot of ways to skin that cat and it depends on what you're trying to find out from the IP addresses how you present them to the algorithms because the algorithms aren't going to solve this problem for you the modeling algorithms don't know what an IP addresses and don't care they expect you to care and to figure out what they mean otherwise they will make you look like a jackass and I promise you this

you can draw bad conclusions it's very very simple to do like I've seen people use IP addresses in their integer form and plug them into a regression algorithm and all its really telling you is what's the ordering of the IP addresses in integer form it's literally useless to you from the perspective of identifying unique endpoints yeah yeah yes converting it to a 32-bit number yeah yeah not not destroying data but again also in that case you're actually presenting it to the modeling system as an ordinal like a rank a form of rank data and it's not right and you are destroying data in the sense that well you're not exactly but it's not what a

typical integer value might right so a typical regression algorithm might look at that integer value or you're expecting to see it as something like a bandwidth counter right not a graph structure over you know a global graph structure basically over a hierarchy of data where you've got really leaf indicators on a tree which is a totally different kind of data than just numbers you question the back you [Music]

know you're sort of doing it the opposite way you're sort of the way I would describe it as and that's that's a good question because it's the reality of data sciences your your biasing the search I'll state our big secret you're biasing the search in ways that you know we're meaningful and you're that's this is part of the applied that human knowledge component here I recently came across a really clever idea for how to vectorize IP addresses how to turn IP addresses into a form that I could use them to calculate you know neighbors and find clusters and those kinds of things in terms of activity and it's a brilliant idea it was a brilliant idea

that Lily over the course last 15 years I've looked for ways to vectorize IP addresses just probably two or three papers on total in all of academic literature it gives you a sense that this is part of that weird data from security people just go oh it's just IP addresses whatever their categories I'll treat them as unique you know hash values or whatever but that's destroying data and a security person knows that's destroying data and we know we can take advantage of that the question is how so presenting it in a form useful to the algorithms is critical and then finally it's measuring success measuring success is hard in security it's obvious for things like well it's an attacker it's

not an attack like classification problems okay but it's not obvious for things like good and bad right people what one person might think is good the other person might think is bad that's a great example of this since I was on the side of the fight for a while is detecting illegal p2p activity ok so imagine the classification problem of analyzing p2p network traffic determining if it's quote illegal copyright violating behavior or legitimate behavior now assume that the p2p network is encrypted how exactly are you going to make that asian and you can't it's very difficult to make that conclusion unless you kind of insert yourself in the path and do a bunch of other tricks and even then

you're not really making any kind of judgment whatsoever like you can't determine for example is this person using it with fair use is this britney spears downloading her own music and distributing it from her dorm room that's literally possible when I was at the University of Florida we actually had a famous singer living in the dorm who we busted for distributing her own music over a p2p network okay funny story but true okay so any questions on the stuff before we get in some of the lingo background goodness okay so I'll let Alex pick up E all right so next we're gonna get into some low Wingo we kind of thought you're right side up

still it's totally a nice wide stance oh yeah then we go number two halo Santa is long but it will work alright alright so we kind of touched on some of the lingo while we've been going over this now we're going to add some formal definitions to it so first we're gonna get some data types really for our purpose here we'll be covering three main data types that are about ninety percent of what will be covered in security and each one of them has the pre process and modeled yeah one thing to note on the different data types too is that if you can't put your data into one of these three forms you may not be

looking hard enough and if you really can't put it in with these these three forms talk to me because I want to hear what it is yeah alright so the first data type riffic categorical data these are represented by counts or binary existence like bags or sets I mean that you can put a real category to next is ordinal data so this is a brain data it's like an hour example we have the different metals and you know it's different ranked based upon the color of it this is based upon equal spacing yeah the important thing to also imply here is that it's it's not the same thing as just putting integers as one three okay correct so the subtlety here

is is that gold is not one unit better than silver which is two units you know which is one unit better than bronze it's gold is a hundred units better than silver Silver's maybe 10 units better than the bronze so on so the algorithms that handle ordinal data which again it's a very small subset of algorithms actually deal with this nicely we'll let you tune kind of relative values and weights and how to deal with ties and things like that and finally a numerical data this is very easy we see this all the time doesn't really take a lot of pre-processing and is usually very like direct data lots of numbers so it's pretty easy to look at in the

spreadsheet you just have the numbers and you have the values associated with them where you wouldn't really have to put much weight waiting on it or anything like that this is probably the simplest data set or data type to actually work with is just numerical data again nice continuous numeric real numbers our goal all right we've touch that we've touched on this before with the pre-processing so this is probably the actual hardest problem so we have all these equations that we can plug data into and then the main thing that you have to do is actually sanitize your data and pre process it so these things will be looking at time series and normalizing the time normalizing your

actual data itself trying to just make it so that it can be ingested by these models and algorithms if you have a spreadsheet let's say a bit on 50 different topics and most of them are blank spaces and then it will throw off your model so you want to extract the data that you want to use for the specific use case that you're that you're going for if you're looking at logs for instance you might not care about like morning or observed or the different types of severity 'he's you just want to know what the messages are in terms of time so you don't need that extra data involved because it can just change your model look like at the end

so this is really where you can spend most of your time it's very good to for when you're trying to train your algorithms or compute your fitness functions you're going to want nice data now you're not going to want to have lots of sporadic jumps or leaps in your data or have different pieces missing from your data [Music] so how do we really do this most numeric types of data we can compute Z z-score so basically you're looking for a variance mean of zero for time you're going to want to normalize the time and what you want to do for this is figure out what liars could exist so if something is like six uh make the

z-scores like six I'd derivations away from you mean you're going to have something that's really going to throw off your model so you want to look and see why that is beforehand and then try to correct that in your actual model so the best way to do this to normalize is by using a z-score which is pretty standard statistics Sigma yeah yeah so mu is the mean and Sigma is the standard deviation of your data salute the a bell curve you're basically computing the Z scores for each of the different deviations off of jeeps and respect to you yeah the key net result here is that after you z-score numeric data the mean of that data is going to be 0 or at

least extremely close to 0 and the variance for that data as well as the standard deviation will be 1 so you should have unit standard deviation in me the other half of the normalization equation for numeric data is getting them on the same scale we're getting one to an equivalent scale so for example if you've got data that goes from say 0 2 to the 32 in one column and data that goes from 0 to 1 in another column when you actually if you just plug those as raw numbers into linear regression algorithm so as an example the scale of the one that goes from 0 to 2 the turn to the 32 will dramatically dominate the

data from the 0 to 1 scale so you have to put them on the same scale basically to make kind of put them on equal footing to get them equal space in the in the data space I know the text is kind of a special case because text can be encoded look it like a packet for instance if you look at the data portion of the package is usually in code you don't have like the raw information there so you're going to have the pre process packets especially a lot and text in general on different things can be encoded into it if there's something if you're looking at a covert data channel when I like someone to another

talk or different parts mean different things in the actual string itself you can add context of these things there's a couple different ways to do this if you look at the sliding window basically a kind of google does verb searches is a sliding window that creates Engram so it's basically moving across the text to find out these weighted words basically basically text sequences like DNA they're all encoded even even numeric time series they're encoded over windows usually order they use specific models that learn they take one data point at a time and predict the next symbol so a great example to predict the next symbol one data point at a time is any kind of

advanced compressor like an X Z compressor or like ZZ pack they take a single you know binary value each time and predict the next binary value the other way as Alex grunts is that what they call the bag of words or bag of letters to Peng upon how you do it but basically what that's saying is like for Google they take what they call five grams so there are sequences of five words and they've actually released a data set that has every sequence of five words found in every book on google books and it's like this gigantic thousand file data set but what that tells you then is it gives you you're basically encoding a model that

understands five word context right now it seems like what it will why five why not three well you can use increasing amount of context to give your data more and more value maybe give it more relevance you know so so the kind of the basic case would be I just count all the words i count the instances of the account the instance of a etc etc etc maybe I remove the very frequent words like the NA and I just look at sort of the valuable content words that's going to tell you something about what you're looking at but it's not going to be able to predict the next word with any real accuracy especially when you look at things like human

language the complexity of human language is very high because it's turing-complete so it needs all kinds of phrase structure and context across sentences and what was I reading in the last paragraph and everything else to really tell it what it's doing so you can imagine if I tried to build an Engram model that was a thousand words long well first of all I got to have a thousand words right second of all I need such an incredibly large amount of data to get counts that are not one for every single one of those thousand word blocks that it becomes effectively useless right everything that comes random again so text is really tricky sequences are really tricky in general

time series really tricky in general there's some good our goal today is to kind of show you the methods that won't trip your feet up and can be practically used quickly okay so and like Markov chains kind of use yes this factor of n graham to kind of predict the next war that's going to be due for sentence generation and whatnot there's lots of stuff on that online the next kind of data is graphs these can be encoded graphs binary binary trees different input models will give you different graphs low-dimensional embeddings can use to fix the dimensionality so if you have a very large sparse graph you'd probably want to encode different parts of that for depending on your model so

this goes back to the kind of pre-processing one that yeah is it to kind of lay it down like if you take the facebook social graph right every and just imagine that we're going to create a giant matrix of zeros and ones you have a one if you're connected to somebody you know for that row and column you have a zero if you're not connected to that person okay so if i want to quote analyze your predictions or your your connections i can imagine i've got this enormous row that's you it's probably you know nine hundred ninety million zeros and then you know a couple of ones like maybe a few hundred ones in there that's a very sparse

matrix the matrix of all of the users of facebook is very sparse the problem there is look how big the input vector is you've got a you know nine hundred ninety million length long array for each user and so if you're calculating values and probabilities over all these different values it becomes very difficult to do so what you can do is you can their techniques that take a giant high dimensional thing like that like each row and compress it down to like a fixed number of roads like 64 rivers or columns like 64 columns or 256 column so those are the low dimensional embeddings that he's talking about and those are there some great algorithms in

scikit-learn for doing this kind of low dimensional projection and we're going to talk mainly about python tools today but there's tons of tools that will do this for you don't mess around with it and don't please don't try and plug those raw rows into your prediction algorithms you'll kick yourself for how long it takes so it will blow them especially for using mantlet yeah don't use matlab use our Python yeah just use my mom and then the next thing is a missing data is one of the more difficult things as well with the pre-processing portion of your analysis is important that it should be noted that all the models can cope with missing data in most cases that it won't

really throw them off unless there's huge gaps in the data and solutions for this can be adding in very noisy data into it you think that doesn't that would have really worked but introduced injecting very noisy data into the to normalize data and filling those those gaps for you so if you look at like a signal analysis and you have a missing like the frequency dropped or whatnot or you just have a chunk of it gone you could take something is very similar to it and inject the noise into it and you can compute how that would fill in those different parts of the data yeah so for like when we said earlier that like

continuous numeric data is easy it's easy for a reason because we can use interpolation techniques so for example if you've got you know at two o'clock in the morning you have a measurement of the number of b-sides attendees down the casino at three o'clock you don't have a measurement because somebody got drunk and forgot to write it down four o'clock you have one you can interpolate tween the two values and kind of estimate with a linear model you know what was probably at three and then you can plug all that into the prediction algorithms the main thing is is that actually without doing this stuff I'm going to contradict Alex most models cannot work with missing data they will

fail so there are a few models that kind of handle it for you under the hood nicely but I guarantee what they're doing is exactly the same techniques you would apply manually so what you really want to do is find data that's not missing a bunch of values okay try and get good measurements this kind of speaks to the court problem too like oftentimes with security data you'll get in there and you're like we've been archiving data for years and you decide okay I went to this data science class I'm gonna go try some of it now get in there and try it and you realize oh it's not measured at equal time steps its

measured whenever the measurement was taken and there's missing data all over the place and that file about deleted three months ago and whatever it becomes very tricky now if you have a giant gap as Alex said in your data set interpolation probably can't help you it's probably more valuable to throw that data out entirely so like if you're missing you know if you're measuring bandwidth usage every month or say like like every minute and you're missing an entire month in your model throw them throw the month out don't try to interpolate the month because you're missing too many points to give a meaningful interpolation and especially with like logs to that you can see this

like if attacker comes in and wipes your logs you can't really look at that and try to model that so you just kind of look at the data and see probably the best method would be to actually be able to create the mountains all right next part we're going to over to his data sets so the first part is we know pretty much independent on time so if you'll get a spreadsheet for instance you don't if you don't have time in that it's a good example use like pivot tables and one that there's no actual time scale to follow for those so if you have like colors and ice cream there's no time scale they look at that or people's

names and what their choices would be there's no way to actually interject I'm into that number next thing is time series so there's really two types of time series there's stationary time and non stationary or dynamic time so stationary time basically the mean is stable so if you have something that you're looking at it and the number the graph just spikes up or there's like big jumps for right multimodal data that won't be stationary yeah I'm coming to give you this Mike oh ok so that way oh great so you can either hold this right here okay is this better yeah this is perfect yeah all right right so dynamic data is more descriptive statistics are changing over

time priceless stock indexes is a good example of this you will get the stock data for the last 20 years you'll see it's very multi modal there's big just keep going microphone like that again all right is this okay hello dust is this good here okay perfect yes we will get the stock market there's really no way to there's no staple meet its ever-changing there's dips and spikes we will get something like oh there's that ways to predict there's a huge push you look like a like a que a smaller one I think and try to predict these things is there any questions

okay so Miss classified data more sophisticated classification models are stochastic models they assume a certain amount of error in the data which can include the classification okay there's a lot of different kind of tricks that you can do so like as example if you if you have a certain amount of confidence in each source of data like say antivirus vendor be we see misclassifications one percent of the time antivirus vendors see we see ninety percent miss classification whatever you can factor that uncertainty into your model and granted you're just estimating it but it's better than zero right so even if ninety percent it's eighty-seven percent or whatever the point is you're trying to factor some of that in and we

have other techniques Bayesian methods which deal with uncertainty just implicitly in numeric data very nicely and there's some categorical methods as well for that and end pin ranked and ordinal a the methods as well yeah okay so this part we're going to talk about analysis and coincidentally we're going to start off with analysis of uncertainty as we mentioned earlier you know uncertainty is in some ways it's the most important lesson from statistics that we take away so most people think about it statistics they think about probability they think about this but what is probability really telling you you know it's giving you a sort of a degree of belief in something as opposed to absolute binary certainty

one way or the other in fact it's funny I would talk to somebody at work who said he doesn't even believe in binary at all he's like it's just an illusion you know he's a pure statistician but so you know the the reason why this is so important is because that essentially all real world data has got uncertainty in it if you imagine like at akamai scale one of the things that we factored into we have you know a quarter-million machines basically scattered around planet and 10,000 data centers we look at a cosmic ray modification of timings and drop in Los that's factored into our models because we have to because we've been amount of data we see is large

enough that it's measurable right network traffic gets dropped all the time routers fail measurements are bad hackers control your your devices they're injecting fake data all those things factor in so really what you have to figure out is you know the one of the core questions and statistics it's sort of this idea of do I have enough samples for what I'm looking at how do you even know if you have enough samples did you did you really are the samples you have so far even representative of your data or they just a huge burst of outliers right unfortunately a lot of data science that security gets a tends to get applied in an emergency almost like

a desperation like we think we can quickly try and find something oh I see a spike here okay okay and you just kind of roll on with it but that doesn't mean it was actually valid or useful at all this is kind of a core conundrum in statistics like sometimes people will tell you there's different rules of thumb for this the general rule of thumb that I use is if you're dealing with univariate data and you plot that data and the distribution of that of that data appears to be relatively normal or stable like it looks like a distribution you can draw look up and Wikipedia maybe that our way of putting it that's pretty

well understood data if you're if you're looking at data it's high dimensionality the other rule of thumb is essentially you want to have basically two to the dimensionality of your data or to the number of columns of your data before you consider an adequate number of samples now why should that be throwing off red flags for you because if you think about say like the character n-grams we talked about earlier like let's say I minutes i'm going to take text I'm going to encode that text as a vector that counts the number of four character sequences in that text okay so it's you know I've got ABCD once and then the next thing comes off and I got

three of those and whatever else say like the space I might see a lot okay solution the problem is that the dimensionality of that is extremely high right so you have 27 to the fourth possible combinations there and so if you have to get to to the 27th to the fourth examples before your data is valid then you're going to be wait a long time to get enough data okay so you're oftentimes going to be running in situations where you're just going to have to go with it and as Alex pointed out there's some interesting tricks here which basically involves injecting random versions of the data you already have so you basically use the data

samples you have you kind of gently perturb them especially if they're numeric values you may add typically it's all additive noise but you would add say a slight deviation to that value so rather than being you know we measured 1000 bytes from this user you could turn that into say a hundred lines of we measured 100 bytes we measured 100.6 bytes we measured 99.1 bites right even though the point one doesn't mean anything to you to the algorithm what you're really doing is you're giving it smooth data okay and so when something is trying to learn it's trying to sort of drive a rover over a complicated landscape if you will to find the lowest

point in that landscape if you give it sharp rocks that's going to blow tires out and get stuck all the time this happens it's called getting trapped in a local Optima if you give it smooth data it can very frequently find that minimum with no problem so dealing with uncertainty in a sense can also can basically mean injecting uncertainty and it's a little counterintuitive but it works in practice we have really have two main tools for dealing with uncertainty in statistics one is what I would call the maximum entropy rule if you will which basically means that all things considered if you have a bunch of different models for your data you want to choose the model that maximizes our

uncertainty given what we already know so notice what notice I'm not saying pick the model that maximizes the correctness of what I do know okay because what you have to assume is that you don't know everything you're probably in fact you probably really don't know everything and you're better off picking maximum entropy as your selection point a great example of this I think is the crash here on 911 right around the I'm after nine eleven quite frankly there are a lot of mental models and computational models that factored in some measure of uncertainty right to different degrees some people were completely skeptical the market some people because they were already questioning it because the tech bubble

some people were worried about terrorist attacks but most people were working with in sort of a bound very few if not zero people predicted the simultaneous occurrences of 911 and a major terrorist event in new york city at the same time sorry the tech bubble exploding and 911 actually occurring at the sick wow those two things which weird however it's it's fits you know maybe sobering to know that a maximum entropy selection model would have accounted for those two things so maybe on the surface your intuitions to say well if I'm accounting for every possible thing that can happen you're not you're not accounting for it equally right you're not saying that a plane crashing into the World Trade

Center is equally probable to an ear of corn growing from a hornstock you're saying that I'm thinking that all of them can exist and i can assume some type of probability to those events even if i don't even know what the events are that i'm trying to assign a probability for so how the heck do we do that well that's where kind of our second tool comes in this is thomas bayes this is the creator of web help what's called Bayes theorem or Bayes rule and basically Bayes theorem lets you work in your prior knowledge your prior expectations about the probability of certain events combined it with the evidence that you're that you're actually gathering and then make a new

prediction based on all that stuff combined and actually it's interesting because human beings and kind of blind trial tests do not act in a probabilistic fashion they tend to act in a Bayes theorem fashion so if you think about everything that you've ever learned in the course of your life and somebody comes up to you and tells you something it may line up well with your notion of uncertainty like how that kind of sounds like what I'm but I've heard before so therefore I don't need a lot more evidence if somebody comes up and tells you something absolutely astonishing like hey I built a car it's outside in the parking lot it can drive faster than the speed of light you're

gonna want to go see that damn thing right and you're gonna want to touch it and look at the motor you're going to see it actually driving faster than the speed of light you're going to sit in the car while it's going faster than the speed of light right so basically the more it's not it's not saying that on certain things can't happen it's not saying that you have to perfectly predict that on certain things are going to happen or not what it is saying is that you can give them all a chance okay and that increasing amount of evidence is necessary to confirm your belief when things are really unusual it's a pretty intuitive notion for humans it's

actually expressed literally in this formula here which I kept the name of the company they're acquired by HP if anybody knows like 10 billion dollars anyway this is from their Lobby but basically the way that this breaks down is a is the prior information this is what we believe before we start this is our probability belief in our hypothesis be is what we measure so it's the evidence that we've gained and basically what it's really saying is just what I expressed before that basically you know sort of informally if the hypothesis was extremely unlikely we should also get a lot of we should also reject it unless we have a lot of evidence for that new

hypothesis to kind of completely realign our expectations Bayesian methods in general and machine learning and optimization or like regression problems tend to represent for the gold standard for accommodating uncertainty into your model and Wolf kind of focus on the Bayesian methods because just an experience they've applied very nicely to security problems okay so now we're going to talk about the analysis of data to actually find patterns and anomalies so as you point out earlier you know one of the simplest problems and security is knowing what usually happens and finding stuff that probably shouldn't be happy okay so the question is how do you know that how do you know what's usually happening how do you know what's unusual

how do you do that in practice and something this is kind of a part of the talk where I say as much as security is usually we sort of think of ourselves as kind of this ivory tower niche where we're special we know stuff that other people don't it's are in the analytics section of data science there's nothing special about security except for the complexity so I'm gonna show you some really complex data that's got nothing to do with security and I could hide the axes and change them out for something else and it would work exactly the same way for you okay so what is normal you know this is kind of an interesting concept and it's maybe

not coincidental that normal is actually a name for something in statistics so in statistics we have the normal distribution this bell curve which hopefully a lot of people who has anyone never seen the bell curve before okay thank god I've had hands go up okay um this is also called the Gaussian distribution this was named after Gauss obviously and you know in this sense and this is what we call you know modal sense we have one kind of dominant value here a range of values here basically we know what normal is by taking the average so if we have normally distributed data and I want to know what quote normally happens the value I would

expect to see most the time my best estimate of that value it would be the mean okay and if I want to know what's unusual well I'm going to look out here on the tail right so the normal distribution by having a well-understood formal distribution tells us exactly how to find normal and unusual it gives us a rule we even have percentages for it right that's really convenient member when I said earlier if we know that something follows a certain model just use the damn model don't go throw tools in it if you know you have my distributed data just use it you can just use these percentages calculate your z score which by the way equate to

these values here so this is 3 z's away from 0 basically that tells you how unusual your data is simple right so I can just filter out my unusual data or look for high deviations everything is dandy and what's also interesting is that this is the most common distribution in nature so a bit of a personal how that I'm trying to actually determine there's a there's a whole category of distributions that are very stable in nature they're called the stable distributions and the normal distribution is one of them and the fat tail distributions are also part of the stable distributions and what they call levy distributions which are even fatter tails are also part of the stable

distributions they're all the same they're all related to each other functionally like in the equations this is interesting because actually the sum of arbitrary noise is Gaussian so if you take you know I create levy noise and couchy know he's always weird noise types and I for them all together now anima I'm going to get a distribution of data that's basically normal so it makes sense that this occurs in nature very frequently because there's all kinds of weird random noise all the time created by different processes and when you lump them together you get normally distributed data and this is very fortunate because it's really easy to work with we can calculate the mean we

can calculate the standard deviation we can do that stuff for other distributions you came and calculate the mean so and it matters this is called an attractor also you may have heard that from chaos theory things like that it's basically an attractor for like I want to say a tractor we take different stochastic processes and you tend to combine them together it will attract to one of these states so a lot of things in finance or modeled by not this anymore but you know kouchi distributions these fat tail distributions and so forth for for this very reason okay so let's talk about how do we do this for categorical data okay categorical data is arguably the

simplest kind of data to do this or because it's already in the exact form you need to analyze it there's nothing else to do you just look at the counts and look for the outlier okay so in this photograph we want to find our outlier who wishes probably that ipad right next to the Pope they're on behind the shoulder no but basically what it what I mean by this is if you've got like a bar chart that counts up you know the number of TCP packets the number of UDP packets etc etc etc and you've got like this one weird DNS txt packet floating around or a hundred of those if I want to look at

those first so you can just use the straight counts they are indicating your probability directly if you want to be Bayesian about this you can apply the categorical distribution as your prior and then apply your evidence your actual counts here okay this is Big D here let's talk about one dimensional data ok and we're gonna start with numeric data here and basically just kind of put it in standard context okay finding patterns in one dimensional data is typically called istagram so if you've seen a probability distribution it looks like a histogram a bunch of bins kind of lined up that's estimated the exact process you know that we have say for example this nice continuous

thing for so basically why is this of any relevance so if the data is shaped like this it's easy right we know how to find the mean we know how to find us here in deviations piece of cake if it is not shaped like this we can't use the mean this happens all the time I cannot tell you how many security products that if use commercial security tools find out what the heck those scores mean and how they're calculated in your in your tool oftentimes they will be assuming that the data is normally distributed and the data could not be farther from the truth as normally distributed and in fact the values that you're getting or

literally but not just wrong there they're screwing okay so a good example of this is this data set here now they are the same data set represented by two different histograms with different bin widths right the one the left's got a nice big fat thin the one the rights got I take an equal number of bins to the number of points so it's about as good as it could be just using raw stuff so obviously they're totally different they don't even look the same there's a huge mode in the center on the right side that's not that's lutely gone on the left side right it's kind of blurred into the center bins well that's a

challenge right how do i how why do I make the bins can they be do they all have to be the same with do they how you make this decision and you might say that there's a couple different kind of rules of thumb that people have tried to deal with this and strangely enough it has not been solved for quite a while this was a recently discovered kind of solution to the problem turns out to be very elegant and simple solution but the solution nonetheless so here's an example of two different methods one on the left's called newts rule not to be confused with Don Knuth this is like Bob Knuth who's not related in any way and

he attempts to act to actually estimated optimal average bin with ok it's pretty good but it's not great mainly because it doesn't incorporate the our uncertainty around these measurements at any point it's just using the straight data as it is there's no Bayesian anything involved here the second method is a bayesian method called Bayesian blocks and it was developed by Jeff Scargill at NASA it was specifically used to spot rare measurements in deep space deep field light measurements with photon sensors so what he's looking at is he's trying to find really really far away stars where it may have a sharp transition in that region of space but because it's so far away it's just small

so we have something very sensitive that can still pick up a tiny spike without running it over and without boiling up our complexity right and we'll talk about the second one that matters so one thing that you'll notice if you compare these two is that the kind of fat tail things like out here this bin is very wide right so it's assuming out here that this is essentially the measurement out here of course in here the bandwidth is very narrow so basic excusas adaptive bin with specifically to create an optimal model it's a dynamic programming solution and basically gives us a nice solution this problem the key is this is already implemented in a bunch of Python

libraries if you want a histogram a bunch of 1d data you can either go read the algorithms understand them or you can just go apply it and get a great histogram output and just use it now why does it matter why can't we just use the guy on the right does anyone know why why would we want to have a this guy on the on the right here are Bayesian blokland versus this little fine bin with one on the right here that's yeah yeah that's that's that's one reason definitely you don't know how many bends they're going to be who knows how many points are going to end up measuring maybe today you got a million maybe

tomorrow you got a hundred billion but is that there's something more subtle going on here there is noise the measurements but actually the histogram account for that that's part of what you're trying to do with the histogram it should be noted that in two dimensions enough histogram and gets called clustering so if you've heard clustering before it sounds a lot sexier than histogram look at the exact same thing still not quite there any other guests exactly it does not reduce the complexity that is the main issue here okay now think about human being about Occam's razor we want the simplest description of the problem that we can come up with the simplest description of our data

that accurately describes the data as accurately as possible so while the one on the right here works it's incredibly complex if I try and calculate the entropy or say estimate the entropy from these different fin measurements I'll get a measurement that's essentially very very close to one okay so like when you calculate entropy you're taking a log the log base in the case of doing this correctly needs to be actually the number of bins so the number of bins is very large if if I have a hundred million is my logarithmic base it takes incredibly large counts to vary the base so at the end of day your entity ends up being very very close to one right it

kind of approaches one if you use bayesian blocks you get a very small number of bins so even though it's just as descriptive it's much simpler and so we would prefer to choose that model so even if for example the entropy measurement of that one versus this one were identical for some working reason you would prefer the Bayesian block measurement because it's simpler it's a simpler description of the same thing so if you think about this way it's like I can take a program and I can write it entirely in machine language that's great but putting it to work the complexity of that solution is extraordinarily high I'd prefer to write a program at something like Python yeah

that's going to turn into machine language but in terms of my description of that problem it's a very efficient way to describe that solution so this is the sort of mathematical equivalent of the same thing for the super nerds we're minimizing our kamal or off complexity by doing this okay so what does this look like for greater than one dimensional data now it's one thing when you got 11 dimensional data it's pretty easy to come up with bins but when you move out of 1d suddenly shape starts to matter you normally have shape in one day you have regions of density and non density really but in higher dimensions we have different ways of doing this so

here's an example displaying population density of the United States this is from Wikipedia 2d data doesn't usually lend itself well to using simple clustering methods so it does in sort of practice data sets you can use a grid you can just break up the entire country into grids like our bins in one dimension they'll just be grid squares and I can count the number of people in each grid square and I can use that as kind of a raw histogram estimator of my of my data but typically speaking this can yield some pretty ugly frankly looking output and it can be pretty counterintuitive because what is a grid square mean I mean the country is not a square so how

do you map a grid square to the coast of California it's it's moving in you're going to have two people live offshore or their house boats out there like what are we talking about there's a typical solution this which is called kernel density estimation which basically takes a big grid square implementation like this and then smooth it so basically you take a if you've ever used like Gaussian smoothie Gaussian smoothing in an image processing like legen photoshop for adding blur that is exactly the method applied here okay so you can either take the grid square implementation smooth the resulting accounts using a Gaussian filter or you can take every point where there is a person imagine that you place

a Gaussian function that bell curve at that point and then look and see what the intersection of all of those looks like there's kind of two different ways to get to the same place so what's wrong with this kind of kernel density estimation well what's wrong with it is is that you're making an assumption the assumption is what is the shape of my bell curve so is the fat bell curve is it a really skinny bell curve and that assumption is hard to make and why is it hard to make let me give you a great example so this is from Wikipedia I think this is originally done by I think it's really done by NASA right here's

what Stanford got this is the exact same data ok now the importance here is it's not that the data was wrong the data was exactly right but the smoothing process made it look totally different and we can see this if you look up the northeastern United States everything from basically Indiana to the coast is the same population density and we know that's not the case right we see that here but only by choosing a very narrow Gaussian function so we kind of come back to that same problem again how do we choose our bin width or smoothing function with to apply to high dimensional data and if you think this is hardened 2d when you have like a

million dimensions it gets way harder the problem actually becomes exponentially more complex at that point so you know the the the choice of trying to select this has been there's literally volumes of books written on kernel density estimation just to figure out different rules of thumb for picking that with and maybe sometimes it's wider than others maybe it's adaptive maybe it's predicted by something else maybe there's I don't know cheese balls that give you the answer that the point is is there a better way and turns out that actually there is a better way nature gives us a better way and you know the intuitive notion here is basically wouldn't it be great if bins weren't

shaped like cubes right so if I'm in higher dimensional space I don't need to think that everything is a big grid maybe there's a more natural shape that lends itself more appropriately to this problem of estimating density right so by estimating density we're estimating our normal for estimating are unusual how can we do this better it actually turns out that nature does this implicitly with levels so the patterns of lines that you see here actually something called the Voronoi tessellation the Voronoi tessellation is the natural minimization it's the region of if I have a data point and let's say it's the data point right right here and this is it's for annoy tessellation that means that within this space outlined by

this border that is the closest point to that to whatever your input is so if your input is here that is the closest point one hundred percent of the time and these are the boundaries where some other point becomes the closest points that you know here here and so forth point is the inverse of the foreign of the area of these cells is your density of the cell it's how unusual that region of space is so in very dense areas of space where there's lots of people example you're born or cells may be tiny let's take a look here so here's an example with using the the coast of Finland I think apologies if it's another country and

someone beats me up for but basically it shows you it's very simple here if I want to figure out what area has the least people in it pretty easy to find I just look for this deal like what's the biggest sell the biggest cell is this one down here in the lower left horn right this huge offshore area where some persons living out on an island here he's the only guy in that whole region it seems kind of obvious but the nice thing is this is really easily calculated as well there's packages for this and see it's called the queue whole package it's also implemented in Python and scikit-learn and you know this is great for finding individual points but

a cluster is a collection of points right so a bin lay our histogram is not one individual points there's got to be some way to build off of this I'm some way to compose this and we're going to go through this no matter sure so what you're seeing here by the way is if you were to take a line and draw a line between every points and then take the lines perpendicular so for example if I take this one here and I drew a line to that dot here this border right here is perpendicular to that line and halfway in between so it's the midpoint perpendicular line to every one of these other points when I do that comparison

that's sort of the implicit mathematical procedure that's generating the Voronoi desolation so let's take a look and see what this looks like now I chose this one actually from jeff's gargles work the guy that did Bayesian blocks and he said okay well unfortunately as a NASA scientist we look at more than one dimensional data and I need to figure out a way to cluster this this is the photon position data from a from a Sky Survey okay what's interesting about this is does anyone want to take a guess at how many clusters are in this data thank you guys no but I'm glad somebody said a high number there's 23 clusters in this data totally not obvious this is not

something where humans going to just walk up there I mean three of them are pretty obvious right but the other ones where the density is relatively consistent in the area and it can take an arbitrary shape for the cluster is not so obvious now there's lots of clustering algorithms that if you were to apply them to this data right here would give you the Nance if you in fact the irony of this is a lot of clustering algorithms actually requiring to tell them how many clusters are in the data before they go and cluster the data think about that for a second okay that's like cheating I know we have some college professors in here so you ever

give your students the answer before the tests really that's what that's what you're doing here so what's Carlos said was let's start off with that data and let's create the Voronoi tessellation by the way there's some really cool art that's done with Warren airy tessellations I know you should check it out let's create the Voronoi tessellation and does it tell us anything more than just the density of the individual regions it does in fact it turns out that this actually tells us it gives us a gradient as well so it tells us what if you think about the cells as being tilted in a certain direction it tells us the tilt as well don't really know this directly but it's

calculated by these packages for us and makes life a lot simpler and this actually indicates it's 21 not 23 this actually indicates the number of clusters in the data and which nodes should be joined to other nodes now the interesting problem is this is just a heuristic ok so if I go back to my tessellation this gradient idea this idea of kind of linking things on their path towards higher density which is really what he's saying here makes good sense and it makes sense to us as humans and it works on a lot of data but it's not perfect and the reason for this is that this is a global optimization problem it turns out so

that means that basically I've got to I can't just use local information around any individual point I really need to use every piece of information every time I make an analysis for every point that gets really complicated really hard like really expensive really quickly and hard to apply to real railroad problems so we create these heuristics that kind of work well but realize that you can always find some room for improvement these methods he's actually explored two or three different methods of this the key here is what's nice about this particular method is again it's a bayesian method he assumes a certain distribution within each of these cells and then that result of that is we're

incorporating our uncertainty in the in the diagram in the clustering and the clusters can be of arbitrary shape so for example the algorithm I told you about earlier that requires you to tell if the answer before you start the only clusters they can create our circular clusters this is used very broadly in fact I will tell you that of all the machine learning libraries that I'm aware of for quote big data applications the the one thing that they will all have in them is that particular clustering alder it's called k-means okay so k-means is out there all over the place there are all kinds of clusters in nature that are not round okay so some folks have said well the

obvious answer is let's just use lots and lots of clusters and make it really like will say there are a thousand clusters of this data and let it figure it out from there the moment you try and link up my micro clusters like this you're right back to the heuristic problem okay you're right back to the same global optimization problem so there's a lot of ways to kind of skin the cat you can fake the Voronoi tessellation with k-means and then apply a similar aggregation method to build up a cluster and so on but it's a little tricky and you're going to have to spend some time with it so we talked about patience a lot of the methods we're

showing you today can be used almost I say almost almost blank sheet like out of the box you can just apply them but they should still be checked by a human that's that fusion of your domain experience if you see clustering that doesn't make any damn sense it doesn't make sense sometimes we'll get rare stuff like here is a little unusual especially that makes sparse space but it's an important point is there any questions on this stuff yeah the gradient slider the this one like the Voronoi tessellation right yeah it's yeah what's also interesting is if you look at yeah I mean you'll see standard lattices but that's actually a spin glass structure to like you see that

that's the microscopic structure of glass as well okay i should say none yeah any non uniform lattice is going to look almost like this yeah that's why I'm saying nature nature of Horus complexity and finds lowest energy solutions the way that bubbles pop merge etc that is a very natural form of energy minimization thanks that's great how to see like a final microscopic picture of a steel or something okay so let's talk a little bit about time series and you know it I guess at some level time series is maybe the data that we all are most familiar with in fact it's a lot of people may think of data science are thinking about applying at a

time series and then they they start taking data science classes and they all they learn on is stationary normal data and then they go how the hell do I analyze time series okay so it's a little bit trickier than just looking at a standard data set the reason why it's trickier is and when you look at a cross-sectional data set like an Excel spreadsheet you can look at all the points at once okay so you have every single data point that's ever going to happen they're assumed to happen effectively at the same time or in a way to where the time doesn't matter in the measurement and in order to convert a time series that's you like you imagine

a stock price it's changing second by second by second you neither you have to think about how you're going to vectorize that or how are you going to put it into how do i convert that into a form that the typically predictive stationary models can use so when you're modeling you have kind of two options you can either choose a stationary model which is again ninety-eight percent of the models that you learn or a sequential model a time series style model now you may say well it's time series why don't I just use time series like why would I bother using stationary stochastic models the reason is is that a they're simple to use be you can get a

lot of practical results out of simple stationary models like simple linear models can provide good information about a lot of problems and quite frankly I think just the mathematics of sequential processes can be quite complicated so they're not even well understood we understand Gaussian processes where if you imagine a random Walker right so I'm just kind of randomly taking a step my step length as I walk around the room is effectively a fixed size I'm just going to randomly turn and walk we understand the dynamics of that but as I was describing to him earlier if I'm a levee Walker royally vflyer they I think they're called I may take random steps and then I might

suddenly teleport across the room to a different place estimating where I might be at any given time for that lady process is much harder than that random Locker process so part of the challenge obviously it's kind of back to that sample problem discussion earlier but part of the challenges I don't even know what the hell process i'm looking at it's like somebody tells you oh this is a leading process yeah what process governs the stock market they don't know okay if they knew they would have modeled it by now but they don't know that's the point and the process itself might be changing over time too so it might be Gaussian right now might be a random Walker and

suddenly a hurricane happens or there's 911 and now it's behaving like you know Ali deflate so it's a completely different thing so this is actually great example just a heartbeat or is it this is actually eight hours of data so it's not a heartbeat this is seismic activity superimposed over an eight hour time scale so what's the challenge why would this matter why would I want to think about two different types of models your stationary versus event a dynamic model the problem with the stationary models I have to pick a window I've decide like if I'm like let's say I want to turn this into a high dimensional problem right I have one dimensional data here I've got some

value it varies and I say okay well I can give it just the instantaneous value until it to predict the next value but that probably is going to give me a good indication so maybe I want to give it some context right over the last five values less 10 values well a where you stop and realize that as you're escalating as you add those points you're increasing the dimensionality of your problem so as I if I take the last hundred points now my complexity in terms of the number of samples I need is very high compared to the simple case so let me try and give an example if this is eight hours of duration and I pick a

window of say ten minutes I'm going to have all kinds of windows that are like this just flat nothing windows right and even the transition points are so relatively slow growth to the window it's not going to capture the dynamics of the system right that I named it for the system the real dynamic that the human sees is there's this small activity big spike kind of a recovery and then another period of stability so two notes here one you can use a sequential model which again have their own host of problems or you can use a vectorized model a stationary model and this is where the human says you know what the relevant window size is

this big it's at least big enough to capture the largest feature that we can see in our data set that we care about and that should be your primary rule of thumb if your station rising a data set using those linear regression or other methods you know neural networks whatever you want to try and make sure that the window captures the feature that you really care about and that can be tough it's easy for this but it's hard for things like like English language right what's the right window size for English language text what kind of depends what the heck context the text is you know if it's if it's tweets you could probably get away with just

looking inside the tweet if it's a book you need you know you might need information from the last paragraph from the last chapter etc so the vectorization kind of represents a memory like a working memory for your analysis it doesn't have to be the exactly previous data either right like I could at this point here I can structure a vector that takes the previous value takes the tenth previous value takes the hundredth previous value I could deal with all kinds of different ways there's all kinds of different ways to vectorize this right so when we said patience earlier patients part of that is parameterizations dealing with these parameterizations these window functions can be particularly tricky so when

you're looking at time series the very essence of time series is wave analysis so you you I mean I know we have a lot of people to play around with audio and do video editing you're doing this all the time if you're using cool tools all the times you can be applying to work you know things like a sine wave or very smooth and predictable very easy to break down and analyze another spectrum Thanks some other waves are incredibly turbulent they have hierarchical processes that induce turbulence and they're not well understood in fact turbulence is one of the great unsolved problems in physics it's one of the millennium problems in fact I think just recently like last week or two weeks ago

they finally prove that the na'vi or stokes equations that define turbulence are incomplete we thought they were complete for 80 years okay they're not complete and the reason why this is difficult if you think about it I mean you know we just looked at one dimensional time series wave this is to imagine one hundred or a thousand dimensions of wave analysis it gets really complicated really quickly there's no good way to model turbulence and you end up having to basically use optimization so we have a bunch of tools that we can kind of use to look at time series probably the most frequent one the one that most people are used to is spectral what they call spectral density

analysis it's basically the time series equivalent of clustering but you're looking at frequencies so i'm looking at i'm trying to basically break a time series down into a series of waves of different frequency big waves real fast waves etc and you see this all the time and like audio processing and so forth it's kind of a standard go-to method and in particular the Fourier transform is using is breaking down your input signal into a series of sine waves the trick here is and I should mention by the way it doesn't actually have to be sine waves okay it can be random ways they can be chaotic waves there are complete other families that you can use to break

things down so if you go down this path of spectral analysis there's a lot of interesting tools that that you can apply here it can account for this another tool that we use which is really involved in the comparison of time series so like let's say I want to compare two different time series imagine that your excuse me that your vectorizing the motions of a person doing sign language and you want to interpret that person sign language over time well if I take kind of the strict case like every tenth of a second I'm going to take a photograph of that person and use that information there's a sort of brittleness to that approach right you know I might move my hands

faster or slower tomorrow if I'm doing speech recognition I might flow my tone of speech down I might speed up really quickly I could be you know breathing helium and something weirds going on the point is is that just using like the euclidean distance to compare points does not really give you a very good estimate of time in fact for speech analysis it's basically useless instead sort of the optimal way of doing this at least the best way that anyone knows how the moment is called dynamic time warping which sounds complicated but it's basically just the euclidean distance where we allow one of the time series to be squished and shifted at during different ways so we can find

sort of the best way that it could align if we really make it flexible for things like speech recognition this is the de facto gold standard period period there's and there's nothing as good as this and for large volumes or for very large collections of time series this is still the best method it's a little expensive though it's a quadratic distance like its complexity is quadratic so you've got a pretty expensive distance function here oh I should also mention that this this is the continuous this process of dynamic time warping is a neat little feature it's the continuous numeric equivalent to up 50 alignments of DNA sequences so sequence will be sequence alignment problem with gaps for DNA is actually

equivalent to duck to time warping in the numeric domain which is kind of neat now there's two other kind of interesting little tools here that if you build off of dynamic time warping and you say okay well what if I want to look for the most frequent patterns what are the common patterns in my time series okay what are the things that happen all the time those are called a little bit of a different name from clusters they're called motifs in the time series larger by the way this sort of Lord of all things time series is the guy by the name of a man KO and he teaches at UC Riverside I think ur Davis

he's I mean I was advising someone recently in basically said look that everything that you're trying to do with anomaly detection and time series is basically solved this problem is basically well understood extremely well understood in time series not the same in stationary yet so even time series are super complex weakens we can basically create an anomaly detector and a classifier for time series with incredible accuracy using nothing more complex than just storing the time series as our database seems like how could you do that deep neural network or no you don't I'll explain why in a second the discord is the equivalent of an anomaly in time series so discord basically is the i'm not sure why he

chose these words by the way but basically it's the most unusual aight thing in your time series now that's what's interesting about his methods for motif and discord discovery is they're basically they line up really well with human intuition of where the unusual things are in time series so if you take for example a heart specialist and he's looking at EKG output he will notice subtle fluctuations and variations in that EKG output that are absolutely critical and indicative of something to him and of course he could classify that they may vary over their duration okay and it's almost always just the shape so people are kind of shocked to find out that it's not really

anything more complicated in that so here's an example of a tool that if we this opposed be a four hour class but like if we had four hours we would be spending time doing this in applying this this is um explore CSV if you hit me up after the show I'm happy to give you a copy of all the software basically this particular package here takes time series I believe these are from I think this is actually from a bunch of sensor measures but II basically it's highlighting in red here all of the most unusual sections of the time series and how unusual they are so the things that are kind of more deeply highlighted red

they're highly unusual the things that are lighter red they're not unusual these are all the top 5% items here in the second column this column here this is the power spectral density analysis of the same sets of time series and then the final column here is what they call the linear autocorrelation or basically repeats in linear time of patterns okay now we can see if we just look at the power spectrum it's not exactly obvious like what is the I mean how do i find the unusual sections of the time series we might say well I'll just select out this I'll remove these other frequencies and then reverse the transform and and see what pops up it turns out that gives

really bad results surprising because it's the combination of the frequencies at a specific time that yield specific patterns and you don't get that captured here so you are you're not destroying information with the PSD but there's a whole bunch of complex domain information that's here that's not mean displayed okay so that is kind of lost how do we calculate motifs and discords well basically his analysis over like I mean when I say that this guy's Lord of time series he has one paper out where he analyzes a million point I'm very sorry a trillion point time series here's another one where he analyzes like a million billion point time series is all these like crazy since synthetic

data sets and what he's found is that basically to find motifs the common patterns you use time warping to find the discords you use just the standard Euclidean distance so you just take a window length look for the Euclidean distance you find the nearest neighbor of each section of the time series so for example i take this window here and I compare it to every other window except and this is a very critical point because people do not do this except any window that overlaps with me if you can this is not true for stationary data if you compare it with any window that overlaps with you you will just get mud but if you compared to every other

window that doesn't overlap with you you'll get these nice clean results that show you exactly where the unusual things are and interestingly enough over extremely large collections of time series even the dynamic time warping stuff for finding them the motifs converges to the Euclidean distance I mean it's founded on euclidean distance oh it's not that surprised but so it turns out that actually the vast majority of time series classification prediction problems can be solved with a very simple technique called nearest neighbor classification which is basically fine the thing most similar to me whatever its classification ones that's what I am so if it was bad I'm bad if it's good I'm good there's some other modifications to that like if it's

if my nearest neighbor is still really far away from me then I'm probably not that thing ok so there's again lots of variations on a theme and were going to get into these here in a second so yeah well short I guess so any questions on this stuff here okay so everybody doin okay you need a break reader eamon KO which is basically spelled eammon keo GH might be key oh yeah yeah actually it's funny if you look up time just just like a time series data set you'll find his name he's probably that widely known okay any other questions okay so on that note we're going to get into prediction analysis stuff this is the hot model section of

the show okay so since we just talked about it we're going to start off with k-nearest neighbors and I personally think this is not where most people start their introduction to prediction but it's definitely where I start my introduction prediction Kate k-nearest neighbors are sort of the best-known version of what are called lazy learners any idea that's why why I picked this picture right but basically the idea is very simple the idea is that we can predict the future based on stuff we already know and we can use our best example the things that are most similar to what we're seeing now to tell us what the future should probably hold okay so in that sense it's almost the most

intuitive way that there's been some interesting speculation recently that for all of the deep neural network and everything else research that the human brain may actually just BK nearest neighbors but a giant database of caterers neighbors continuous right this also kind of has another interesting thing which is that if you take there's a study recently by andrew n who's the the chief data scientist that baidu and also the professor of a very popular series on machine learning at stanford on youtube and he found that actually over a ten-year period there's this kind of because modeling is so sexy frankly cuz it raises money there's a big focus in graduate computer science on new modeling methods right so everybody

takes these old methods and they flat they apply their new method and they tweak Wendell parameter and now it's a whole new method with a new name and you know I'm so much better than you and there's these very kind of competitive things and what he said was you know has there really been the steady improvement in these algorithms or the last ten years there's been a really strong march and improvement and what he found out was when they actually compared the wrong methods and applied them he found out that actually no the methods themselves were really no different from each other in this particular case they just had more data and they had more compute time so if you

give linear regression more data it will give you a better prediction if you give k-nearest neighbors more data it will give you a better prediction so in that sense a lot of models just get stronger the more data you have now finding structure and complicated models is a different problem and part of what makes stuff like deep learning interesting for those very big systems but I've also seen for example very advanced variants of canaras neighbor that induce hierarchical structure graph structures etc on top of this data that can perform just as well as deep learning networks and they don't use a single neuron or black box optimizer everything it's notable that the one nearest neighbor so

k and the k-nearest neighbor is the number of neighbors that you look at it's notable that the one nearest neighbor graph is the Voronoi tessellation okay so if I'm doing a simple K nearest neighbor classifier and I'm looking at this data set and I say okay what what cluster does it belong to when I find the nearest neighbor it's just finding the tessellation and looking at the nearest neighbor and making that prediction so the nice thing we use to find structure to find our clusters and our outliers we can directly use to find predictions and make predictions now that this particular model obviously is not a linear model on a straight line but it's a piecewise linear model remember all of

our borders between ourselves they were straight lines so there are really two kinds of models in general there's linear models and nonlinear models basically linear models trying to find the best straight line fit and non-linear moms trying to fit find the best curved line fit that's kind of a high-level way of thinking about it Anna let Alex cover linear models they're going to bounce right back quickly and cover some other non linear stuff thanks for your patience oh yeah oh

alright hopefully you can hear me better alright so we're going to go over basically linear regression has anyone ever used this before alright so we're going to start with a very simple example we're going to look at a tip versus how much meal distribution look at the this is basically showing us in the absence of any other data the sample means that the best estimate of the future value we're going to kind of get into this mood for right now the mean is the null hypothesis meaning any predictive model has to be better than the mean to have any value it is notable that the sample mean and is unbiased estimation we're going to be going over

some bias later on we will get the page rank like what Google uses and mean means that the mean itself has a distribution so with this we look at it we can kind of see the points are kind of distributed I'll based upon the actual tip amount for the meals you can see the average is the meaner here is time now if we actually look at this and we look at the the actual errors based upon it will see that the sum of the squared errors is 120 we're going to remember is 120 because we're going to go back to that once we actually get into the regression so we looking we just actually take it and we create a

scatter plot where we take the tip amount and then the actual bill amount we create the scatter plot and we're basically going to be creating the linear regression for this and you can kind of see here is where we'd be getting it at mean wine guy eating scatterplot years just see it linear makes sense if you just pipeline

and if you lose then if you did plot a scatter plot its up again it's like chaos on there they probably wouldn't want to be using your wedding remodel for that so for this case you can see that linear model there is a best fit line that we probably can which here that's what we do so step one is we have to find the centroid which is the mean of each variable and it will be one of the two points is a winner your best fit line so we can see and this it's at a 74 10 and like we said neat i'm showing before it kind of fits within this line here all right now we're not going to

really go into calculating this because we are kind of on time so or basically right now we just created that best fit line so we find the slope of the line and then the change so it should be known here that if this number is 0 you're going to get a negative value and that means that you're not going to want to use this type of model for that so you really this kind of let you to see where that is but if you have a x value that is 0 you're not going to want to have it use a winner your regression for this may be the best lesson there is that the models do not apply infinitely

over the entire domain right so you got like it's only applicable really when there's actually a bill so you're not going to get back money to somebody that got a free you don't tip when you just get water or something like that if you're not church for anything and then how do we know how good our model is well basically look at the residuals or the actual the square of the distance from addiction of our model to that to the data points so what that's really saying is here is our actual data point we kind of zoomed in on this and here was that line so if we look and see how close our line is to the actual data

points we can tell how accurate or day so let's say we hit this but all of our data points for all over the place we're always excellent you know that's really not a good method to you you going back to that example then what we want to do now is actually calculate this type of error to see where this so for the first one we looked at the tip amount based upon the MU law and then second one we just went over we looked at or the first one we looked over was the bottom one here which was just tip amount and we looked at the actual tip amount versus the meal now you can see that we have two

different air so if we subtract those we get like 80 and then some murder yeah 19 decimal places 18 yeah in that that will tell us how good our data model is so you want to actually take both of which is the okay so we're going to talk about a pair of pelts cover graphs you online we're to cover a pair of nonlinear models these are very commonly used one important thing to recognize is that every nonlinear model has sort of data that it's good at and it's not good at and most non linear models can be massaged to be good at that kind of data it's not good at okay so you should never really rule something out like

just because deep learning networks are new and cool and they said they have got that deep in them doesn't mean that they're any good for your problem and it doesn't mean they're gonna be easily apply to your problem as a lot of models like for the for the last decade support vector machines were knees and if you weren't using support vector machines you were probably not going to get good results or at least not result as good as people that were so keep an open mind so what is a support vector machine you know it kind of comes back to understanding how good a model can actually be right why why is one model necessarily better than

another model and is there a way to it's really easy in the linear case in the sense but is there a way to sort of maximize goodness so support vector machines have this idea that we want to try and separate what we call classes are two different sets of predictions predictions for a predictions for oh that's not good prediction for red and predictions for blue we want to separate them by a hyperplane but just a line in say two dimensional space or a plane in high-dimensional space and we want to find the hyperplane that has what's called the maximum margin so it's the hyperplane that has the most space between it and the example data points

so for example we can choose b b is not wrong and if v was your model you can classify red and blue at least this data just fine right but if a new point came in say it was right here it's going to be wrong okay it's going to be now it's going to be over here by choosing a we're kind of choosing a fattest road if you will between our different data sets and basically it gives us a nice intuitive geometric understanding of you know there's a certain intuitive robustness to this model like we've got a nice margin of error within which we can still be within our model and even for data we haven't seen yet and

everything is just dandy this were also pretty easy to calculate the other thing is that it doesn't have to be a hyperplane so there's this great thing called the kernel trick which basically lets us take a nice linear model or sorry a very nonlinear data set like this one here and even though this would be the maximum margin hyperplane by using the kernel trick we can basically project it over into a higher dimensional space where we can then get a nice linear model to fit between them and it's significantly easier to actually calculate that margin and this lets us handle data sets that are significantly more complicated for example like this one okay this is called the Swiss roll data set

and there's this is a widely used this breaks clustering algorithms it breaks prediction algorithms all the time it's kind of the it's a cold center but it's like the it's like the entry-level model too oh you think you're a nonlinear model and a use n you kurt and you can really find good margins then you should be able to handle swiss roll with no problem whatsoever if you think about it similarity and adjacency and so forth would be tough here you know if i use k-nearest neighbor here it's probably okay out here you know actually it's probably okay in here where the points are really dense and there's a significant distance between it and the

next points but out here the distance between each point in the class is almost the same as the distance between it in the next class so k-nearest neighbor actually gives very spurious weird results on this kind of data so support vector machines give you a way to kind of shortcut this process I really do apologize we have like five minutes I'm going to try and fly through the slides here but another nonlinear model that's very popular our neural networks and you know what our neural networks really neural networks are function approximation systems that's one way to think about them they're using a very simple piece of underlying function in this case the sigmoid function and we're going to use this to

in combination so we're going to use a linear combination of these things with different weights and we're going to use that estimate some crazy wacky nonlinear function and it's it's interesting to note that this is the cumulative distribution function of the bell curve okay this is also called the error function there are other kinds of neurons in fact in deep learning you might think okay well this is very interesting what about deep learning what does it do it doesn't mean you use a nonlinear neuron it uses a linear inner called a real ooh a rectified linear unit which actually is just flat and then an angled lines okay so you don't have to have really complicated

component functions to create really complicated learning systems which is maybe one of the important things to learn here it's based on very simple building blocks just applied in combination and in hierarchy basically it is an optimization driven approach you typically wouldn't use ant colony optimization but you literally could it uses most of the time it uses a method called stochastic gradient descent which is basically just following the curvature of our data if you will the manifold of our data and trying to find something that minimizes it or the valleys in our data it does that by calculating the derivatives and then using that information so what is deep learning is deep learning just more of the same as it's just layers of like

50 layers a neural network or what well it's a lot of things really but deep networks were actually developed in like 1968 and they use nine layer deep neural networks then that seemed astonishingly sophisticated for their time the breakthroughs that have happened in the last ten years that you read about are really better ways to train neural networks better ways to simpler ways to represent those functions and kind of the understanding that what a neural network is doing is creating a lossy compression of the data and when you extrapolate that back what you end up when you look at eight neural network and what it's doing it's learning a hierarchy of features so who's seen the

stuff about Google and like Google imagery learning cats cool okay so what it did there was it learned a series of stages first the first layer of our neural network in imagery learned edges it learned that dark next to light next to dark indicates an edge in a picture and learn edges of different angles and orientations okay the next layer up in the neural network learn that features like corners are composed of multiple edges and then corners combinations of corners can create you know circles and squares and other things like that and so as you apply more and more layers you're learning you're taking this already lost thing like edges and making something that has to be composed of

edges and making something has to be composed of shapes and make it only has to be composed and eventually you get the cats and human faces and trees and other things and the reason why this is important is because a lot of structure in nature is self-similar that occurs at many different scales if you zoom way in on a fractal is exactly the same as if you zoom way out on the freckle now nature is not that perfect but it's it's similar enough that hierarchical represent a shins are very important we see this all the time like if you write an outline for a paper you can write the high level you know there's an overview and it's

section 1 section 2 section 3 and it's going to cover these major points and whatever that's a hierarchical representation okay so that's all a neural network is doing it's just learning the same thing okay we're gonna quickly touch on graph analysis as well graphs are really useful for analyzing structure and relationships between data so interaction graphs are particularly important for security and probably the most well-known we just going to cover two algorithms here probably most well-known algorithm for graph analysis is something called PageRank basically PageRank is looking at the popularity of pages or nodes in our graph based on the quality of the links that link to it and the number of lengths that link to it

right so if you have a really high-quality site like your NBC com or something and you link to my personal homepage wow I am important yeah but if some junk site links to you it's basically disregarded you can actually apply this directly to security data our last slide actually has this to find the most important say log lines in your log file or the most important sequences of events it's very easy to visualize and understand so that kind of helps us understand even though it's a link analysis algorithm it's actually ranking the notes in our graph so our circles in the graph or what we're finding in the measurements of so what if we want to

understand the paths where the most important paths through this data the nicest way to do that is something called links aliens this turns out to be a really flexible way there's literally hundreds of ways to analyze graphs I'm not trying to say that these are the best or the only ones they're really practical for big security type tasks mainly because they reduce the complexity of the representation down nicely so link salience basically uses the shortest path algorithm that we all have networks powered by and using Dykstra is it basically computes the shortest path from every node in our graph so say this is our original graph and for each node is going to find the

shortest path for that node it combines them all together and the aggregate gives us a measure of the importance of each path so the intuitive notion here is if a certain connection is the shortest path or a lot of nodes to other stuff it's really critical so an example of this is if look at the World Travel Network as an example that the top one has the frequency of every single route flown by a plane over the course of a year and that's kind of useful but I mean where do you start there everything looks almost equally important same thing with the food web or the grams of carbon per year down the bottom or the World Trade

thing out the bottom oh we skip the breath darn oh man that's okay so basically if he is links aliens to analyze this what you end up with is it will reduce these graphs down to just a few critical hubs the things that really are important and what's astonishing is like for the air traffic the literally the salience rank is the rent the world ranked in the popularity of that Airport it just a naturally aligned 1 21 which the original experiment that kind of drove everyone nuts about this method it can make a graph you can basically then drop all the links with salience below a certain threshold and just look at the high salience links in your graph and if

you're looking at a big interaction graph this will pull anomalies out like Matt other thing too is if you take page rank and you invert the weights now you have an anomaly detection algorithm for your graph so these two methods back-to-back paired up work really nicely so here it is actually applied to a log file this is just like a typical oh now this is from a usenix conference and it's basically about 25 million lines of a super computer log first of all how nice is that for 25 million lines along I mean I can deal with that now if i was on a total inept boob when it came to like building interactive you

I you can click in and drill into these different things I understand them but the key is that the link weights are also linked by their salience they're also colored by their sand lance and notice that like why is this one so important these nodes themselves are not that important in the page rank but they're critical to the path it turns out that's actually that particular link this bottom one is actually the so-and-so logged in we're sorry uh so-and-so authenticated so-and-so has logged in that kind of like standard sequence like Dave's authenticating against this provider such and such as logged in on this TTYL that sequence is so common in the log file that that's a

critical path to looking at the longest sequence of them so again it helps you understand very complicated data very quickly and with that we were done I think with 30 seconds to spare so any questions for us oh man that's having a miracle okay thank goodness I'm sure yeah cackles convenient like that OT sneeze awesome right so like like TC is a great example of a low-dimensional embedding like I talked about earlier especially for visualization it's fantastic yeah it's nice I think they're all great and literally what I don't want to do here is imply that these are the methods you should use for security right what I want to imply is get curious about these

and you're learning kind of the right things in terms of the path both set you on the right path so in the full day course which there are three variants of course we cover TC we covered the code we go sure getting some data just it's a matter of time yeah yes oh definitely sure yeah so this will go up online def cons asked it to do this in depth on 101 next year with a longer session

GT - Intro to Data Science for Security - Rob Bird & Alex Shagla-McKotch

Related talks