Data Science or Data Pseudo-Science? - Ken Westin

Name: Data Science or Data Pseudo-Science? - Ken Westin
Uploaded: 2016-08-26
Duration: 41 min 51 s
Description: Data Science or Data Pseudo-Science? Applying Data Science Concepts to Info sec without a PhD - Ken Westin Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 02, 2016

BSides Las Vegas41:511.4K viewsPublished 2016-08Watch on YouTube ↗

Mentioned in this talk

Tools used

Bokeh Splunk

Frameworks

scikit-learn

About this talk

Data Science or Data Pseudo-Science? Applying Data Science Concepts to Info sec without a PhD - Ken Westin Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 02, 2016

Show transcript [en]

hello um uh thank you bides for having me I really appreciate it this is uh you know I think it's one of the better security events uh I've spoken at bsid Las Vegas San Francisco Vancouver Portland uh DC as well um last year I actually presented at defcom which was uh pretty scary actually um so you guys are a little friendlier hopefully um don't throw anything please um so today I'm going to be talking about data science or pseudo science applying data science Concepts to infc without a PhD um that's my Twitter handle my email so if you guys do have questions and you don't want everyone to hear it you know feel free to um email me um I do want to

have a disclaimer here I am not a data scientist um uh yeah it's uh but um I I've actually had the opportunity to work with uh data scientists um I actually went to a hackathon with uh the company I work for which I I w't say where I work um but um I went to a hackathon with two phds one was from MIT one was from Purdue um I have a lot of respect for them and what they're doing um I think one of the challenges has been though is communication uh with uh some of the data scientists I don't know has anyone actually talked to a data scientist yeah it's kind of difficult sometimes to get a straight answer from

them uh when you're actually having a discussion and I think that's one of the challenges is we're sort of we need to you know uh yes no types of answers um and it's very difficult to get a concrete answer sometimes from a data scientist um and it's also very difficult to understand some of those Concepts and how they actually apply to infosec um and so I I really wanted to dig in deeper and try to identify how we can leverage data science um and try to explain it in in in more simpler terms um so just a little bit about me uh my name is Ken Weston I've been uh involved in security for in technology for the

last 16 years uh trained in both uh defensive and offensive security um I actually presented at Defcon last year where I talked about um being a professional cyberstalker um where I actually put a lot of bad guys in jail um using data um so I was really interested in collecting information data mining um open source intelligence in particular um I'm currently focused on security analytics um and I hate tomatoes and math so a little bit about me um at Defcon I presented U about you know how I can actually mine information and track criminals um so I I've been really interested in data for for quite a bit long time um and when it comes to

um information security particularly when it comes to the network there's a lot of data uh that we have available to us you know it starts at the end point we had that with AV uh we expand that out now we have network security intelligence where we're looking at IDs firewall we started actually doing correlation rules you know fairly simple your sim use case um then we started incorporating threat intelligence and now we're dealing with even more information more complex correlations um does anyone here actually write uh rules for your sim right does anyone here like writing rules for your sim there's always one all right um but you know from this it's really great for

us when we we get start actually um doing these correlations CU we learn a lot more about um who is attacking us uh particularly with some of the threat intelligence stuff that's been coming through with sticks and taxi we're able to share a lot more information um but it's really difficult to write those corelation rules and sometimes we can't actually identify all those different patterns um particularly when we talk about zero day threats and I'll talk about um particularly around Insider threat or Advanced adversaries um usually when you're dealing with compromised credentials and things like that so when it comes to uh data science I'm going to be focusing mostly on machine learning because that's kind of

the area of Interest right now um I was really excited actually to see that there's a lot of machine learning talks um at besides and I'll I'll kind of list some of those at the end I highly recommend you take a look at some of those um and my goal here is to have you guys have an basic understanding of machine learning um and data science Concepts um so that uh when you go there you guys will be experts um I really like this uh this is actually from an internal Google Document where they're talking about um uh leveraging machine learning um so apply machine learning like the great engineer you are not like the great machine learning expert you

aren't aren't right so what we can actually do is we can leverage things like machine learning but we don't have to be data scientists ourselves I'll walk through how some of those models get created in the process um and you guys will sort of understand why you don't want to spend your time doing that um there are a lot of tools out there uh a lot of more things like uba type tools and I'll talk a little bit about that how they work um that you can actually leverage um but you also need to have an understanding of of some of some of those terms particularly when you're buying a machine learning product um sometimes um there's what I call data

Scientology that's actually happening particularly when we're talking about uh with the marketing terms when you guys are walking the floor of blackhead probably this year in RSA you're going to hear a lot about machine learning um and a lot of times some of these security tools aren't actually using machine learning they're just doing Advanced correlations or maybe some statistics um so I'm going to walk you guys through and you'll understand some of the terms um understand the difference between unsupervised versus supervised so that when these vendors come up to you you can ask those kind of questions what sort of uh machine learning are you guys leveraging uh what sort of graphs how does this work um because I I think it's

really important for us to understand how those black boxes function just an example of some of the I did a little word cloud these are all the different terms I saw around data science and um it's really confusing there's just a lot of um kind of word soup um I I'm going to start a little bit with big data because that's sort of how U the data science got started um and this is the definition I looked up um and I really thought this was funny because if you look at the uh the actual uh example sentence here they say um of how how to leverage how to use the uh big data in a sentence right so much it

investment is going towards managing and maintaining big data so it's not talking about the um the value that you're actually getting out of leveraging big data but it's always about the cost right of collecting this information um and so um that's what I also have here is let's solve this problem by using big data that none of us have the slightest idea what to do with and I see this time and time again I'm even see seeing cisos do this with some of the customers I go in and talk with they start collecting a lot of information they think that they're going to use this information at some point but what they end up doing is

just storing that information and a lot of times that data becomes a liability particularly when you have log data in there um might have credit card information you're actually opening yourself for a lot more liability as a result to that um and sometimes um it it guys don't quite understand that so that's when uh with data we're talking about when the shark actually jumps you uh big data was a big buzzword um and so a lot of it organizations started gathering this information um and then security guys are expected to also make sense of this information um here's terabytes of data or pedabytes of data U you know what threats can you guys find in this information and that's really

difficult to do uh particularly when it comes to security there's a lot of U Insight that we can actually get from uh big data from information but the trick is um whether it's security or not is actually identifying and asking the questions what do we want to do with this data what sort of data sources do we want to actually do to accomplish this we don't necessarily have to grab every single possible piece of data um to bring it into the environment and with security it's not just U big data so if you're actually doing net Flow full packet capture um you know we've been doing uh big data for a long time in security we all of our tools are very

noisy they generate a lot a lot of machine data um and when we're dealing with security we're actually dealing with morbidly obese data in a lot of ways right um so uh it's it's almost sometimes ridiculous how much information we're actually dealing with and sometimes it's good to sort of narrow the focus identify key data sources um and what sort of reports we're going to actually generate from that U what sort of threats are we actually looking for um I I'll be the first to say that I believe that all data is security relevant right but not necessarily all pieces of that data um you're going to want to look for uh you know time series information time stamps

IP addresses and things like that but you don't necessarily need you know the entire um like U the contents of of a log event for example um and there's a a term that comes up a lot in um data link and basically what uh what this means when I hear in it is I just think oh so you guys have a bunch of information that you don't know what you're going to do with yet um but you're going to start collecting that information um and this is what it sort of becomes right um particularly for security these data Lakes are not going to be particularly useful um if it's just a bunch of stagnant data um when we're dealing with

data it has a halflife um the longer we we have that information the less valuable it actually is um our adversaries they they change their tactics their tools constantly um so if we're um holding on to this information the attack information that we're actually running models on a year ago um is not going to be relevant today um so what we want to do is we want to actually um I'll talk a little about how we make that data flow so there's some differences but then between Big Data versus data science and I like to compare this as hoarders versus an Amazon warehouse right anyone see the show hoarders okay yeah um so um the idea

here is like people collect all this all this crap in their house they can't get rid of it they don't know um you know it's like it's a mental disorder almost um but with a Amazon warehouse the idea is to get information in and get or sorry to to get things in and get things out it has throughput right in order to do that and actually to to accomplish that they have to be highly efficient um they have to be able to tag those particular uh packages um to actually get it out of of the door so what we're doing is we're talking about collection versus Insight when we actually start applying data science to our our our

data flows um we're actually able to gain Insight from that versus just hoting that information so we want to make big data flow we want to move Beyond just the data Lake but we we want to uh move move into U this sort of flowing type of environment that's how we're actually able to to gain insight and so one way we can accomplish that is with what's called a Lambda architecture you guys know where this is from am I dating myself right Revenge of the Nerds Lambda Lambda lambdaa right so um so you know Lambda in physics actually represents half life for a wavelength it's also a pretty awesome game right has a the symbol for halflife U the video game um

but a lambed architecture uh so Lambda architecture is useful framework um to think about designing Big Data applications um Nathan Mars who worked at Twitter uh he uh designed this generic architecture for addressing common requirements for Big Data based on on his experience working on distributed data processing systems at Twitter uh L architecture is a data processing architecture it's designed to handle ma uh massive quantities of data by taking advantage of both batch and stream processing methods this approach to uh to the architecture attempts to balance latency through um and in Fault tolerance by using batch processing to Pro provide comprehensive and accurate views of batch data while simultaneously using real-time stream processing to provide

views of streaming online data so there a lot of kind of technical go good what it actually allows us to do is to deal with a lot of massive uh data sets uh we can actually uh move data into the master data set and allows us to run bat processes on that information um this is important because this is where you're going to see a lot of machine learning algorithms leverage that data um but then we're also able to look at real time information as well and then we're able to run queries and we can actually see uh this information flowing through so if you look at a lot of the uh tools for uh machine learning for security

this is the type of architecture they're actually built on top of um and if you want to get uh a little more Technical and try to build your own these are sort of some of the tools the open source tools you're going to hear about um Hadoop of course Cassandra um spark for the speed layer and then a before um gathering information if you want to learn more about this highly recommend this book it was actually written by Nathan Mars from Twitter and how he actually built out this architecture um if you don't want to learn about how to deploy how do things like that you don't need to but the first maybe three to four um

chapters I highly recommend at least you understand a little bit more how that works so now we're talk about how to uh collect information how to get it to flow um so now I want to go into a little bit more on the actual data science part so data science how does it work so this is um uh by Drew Conway uh it's a it's a VIN diagram the the data science ven diagram uh where he kind of talks about all the different skill sets that actually are important for uh for for data science um you know the hacking skills are not necessarily you know um trying to hack into to networks things like that we're familiar with um it's

actually more just command line skills right the ability to collect and manipulate and extract data um and how to manipulate the text files of the command line is important uh math and statistics knowledge um that's where I get a little scared sometimes uh some of the the the guys that are on data science they're just Wizards when it comes to this um understand math and statistical methods which requires at least a baseline familiar familiarity with some of the the tools and I'll talk about some of the tools as well um some something of experience I think this is an really important factor this is where there's a bit of a disconnect when it comes to uh using data science machine

learning insecurity um let's just say for there's a report that maybe is out there um and there this report is really popular it talks about all the vulnerabilities that are actually um that are targeting organizations here's the top five vulnerabilities one of them might have been a freak right um but the problem with that data was it actually came from some data scientists and they actually um were able to pull this information from IDs and vulnerability management applications but they didn't quite realize there's a lot of false positives um in that data um so data scientists they may go look at the data but without any sort of knowledge when it comes to security they they don't

really understand how to make sense of it and they start to make assumptions and that's a that's where things get really dangerous when we start talking about this danger zone here right so you have the the the the the skills um just enough to get dangerous but if you don't have that substance of experience um it it's it um a lot of things can can fall apart so when I talk about machine learning a lot of times people will say oh you know it's AI they're going to come take my job I guarantee you that um security analyst in particular um the salaries are increasing um the goal here in in actually leveraging data science

and machine learning is not to replace you the goal is to enhance you so this is more what it's going to look like so leveraging things like machine learning we're going to be able to identify patterns we're going to see things in our environments and our data that we weren't able to see before um so that's the whole idea of this and a lot of times too if you actually leverage machine learning um you don't have to hire maybe quite as many analysts um but um you'll also um have a fatter paycheck as well if you actually understand some of these con and can deoy them so some of the tools that are actually used um

in for machine learning there's Java Escala um R so these three are usually what you're going to see at data scientist use um I stick with python because that's basically what I know how to program in kind of a shitty programmer and I know python so that's what I work with um and in Python there's actually a really cool Library called scikit learn so has anyone used this library before curious so I'm kind of curious how many of you are actually using machine learning in your environments very cool awesome it's really good to see um so there's a lot of algorithms that actually come with the this uh you can actually look here if you go to the website um there's

algorithms for classification regression I'll kind of go through and Define some of those and what they are um but one of the challenges too is that you also have to uh visualize that information I think one of the critical things for the analyst is not necessarily just gaining that Insight but how do you actually make that information useful to the analyst so would make sense to them um and so what I actually use is uh Splunk um they actually have a machine learning toolkit that's free um you can actually download the free version of Splunk and then you can run all these exercises I'm walking through I'm actually going to use some of their demos actually um you

can uh download Splunk directly from their website you run it local on your computer which I'm going to do um you install this application and then um it'll show you there's another dependent applic that you have to install um and you're good to go and the nice thing about this is that I don't have to worry about visualization there's all sorts of visualizations that are available to me um so I can run and create models and then create these nice visualizations without having to mess with a bunch of other dependencies um I'm kind of lazy so trying to install these different dependencies and do visualization and Python's really hard and it takes more time than I want to spend on it um so

I'm going to talk a little bit about different types of machine learning um there's two main uh types there's supervised machine learning uh where the focus is to build models that make predictions based on evidence and labeled data in the presence of uncertainty as adaptive um algorithms to identify patterns and data it sort of learns from those observations um and you create models as a result of that then you have unsupervised machine learning um this is where we're able to draw inferences from data sets that may not have labels and I'll talk about the differences between label and unlabel data uh there's also semi-supervised machine learning uh which get a little more advanced um but it's talking more

about um using labeled data to um to go out and actually label unlabelled um unlabeled data so if we're we're able to do supervised learning we're generalized from uh from some of the uh the label data so this is an example U we're able to identify that usually that data is some sort of a table um but just um nice visualization there of it uh supervised machine learning there's three core areas that it's actually used so regression so a regression problem is when the output variable is a real value such as authorizations over time and I'll show some examples of that uh classification is a classification problem when the output variable is a category um sort of it could be binary

could be malicious non-malicious authorized or not authorized spam not spam um and then you have anomaly detection where you're able to identify unusual activity uh learn what normal looks like example would be a history of normal web authorizations to then identify anything things significantly different Um this can also be used a lot in fraud right so here's sort of the process that we go through with with to create a supervised machine learn uh learning algorithm right so we have raw security event data um the uh data scientist will create take a sample he'll start developing and training the model um and then he'll test that and it's a highly iterative process he'll write an algorithm and

then he he'll have a product right but he's not that Done Yet a lot of times when they write these algorithms it's one thing to have you know to run these models on very clean data but when you actually release these out into the real world and look at real data um that's when sometimes things will fall apart and the wheels will come off um and that's where we you have to have um a sort of the verification process then we actually move that into production so this is very similar to a process you might have for maybe developing correlation rules IDs rules firewall rules um so just imagine that process but maybe 10 times more complex

more math and a lot more profanity and then this is what it might look like so this is a sample so U maybe we're running um um a model where we're trying to identify malicious uh domains um so um it's going through and it can identify you using this is all sort of tag tag data um and it can create and identify that yes these two are malicious um you know hey a string of consonants you know numbers um all these sorts of things that has identified malicious uh domains what it's actually looking at is a list of known malicious domains so you have to have a list of known U malicious domains to run this

sort of thing uh where I can actually start to differentiate and identify the differences between those so uh with supervised machine learning with regression uh so it's used for predictive modeling to investigate the relationship between a dependent uh Target and independent variable uh so a few examples of algorithm that you're going to run into so this is you know memorize these use them at cocktail parties you'll sound smart um I'm going to focus just uh here on on linear regression in our example um it's one you're going to run into quite a bit so I'm going to do a demo where we're actually going to predict uh VPN usage so let's say you've been tasked with

identifying or predicting um how much uh are your vpn's going to be used and I know this never happens but the it guys they forgot to enable logging on the VPN never happens in real life I know um but uh so what we're going to do is we're going to actually look at internal applications we're going to look at how they're used and we know that in order to use an internal application they have to access the VPN so we can make prediction about VPN usage based on that information so without further Ado please work all right so I have a there's a CSV file adap usage is basically that same same data I I I showed you um here's what it

actually looks like it's just a simple CSP file with the data and it's tagged um so I have CRM I have Cloud Drive Erp expenses so what I'm actually doing here is I'm bringing in the CRM uh the Cloud Drive HR and web mail I'm going to use those for my prediction uh then what I do I can run this I fit this to the model takes a bit to run and this Maps it out to me so remember what I talked about why I use blunk for the visualization right this is super simple this is built in into the app so I don't have to to um to create the visualization and actually plots it out so what it's doing is

actually showing the the VPN usage um of based on uh the usage of these applications so we can start to make predictions based on that so I don't know what the actual VPN data is but I know that there there's a relationship between the applications in VPN and then I'm able to make these predictions so another one is with machine learning classification so in classification we have data that we want to sort um into predetermined categories so um I'm going to be using binary classification so uh it's a it's a little bit different than some other ones where we're actually looking at yes no or true false um so again I'm trying to keep these really

kind of uh dumb down and fairly simple so this is what our our our our CSV file looks like in Excel we've cleaned it up so here s basically we have a a training data set that's been cleaned up for us uh we have uh data samples from multiple firewalls some of the firewalls we know are affected by malware and have critical vulnerabilities we may know that the number of our firewalls are affected by a critical vulnerability but are not sure if they have been exploited by malware targeting that vulnerability or the inverse may be the case where we suddenly see a number of firewalls hit by malware and we want to identify if they're affected actually by a known

vulnerability right so this is again is another example of it's important to have domain expertise so what we're seeing um is there's a slight anomaly in the data that hints at malware Behavior Uh then we add a particular host um has known vulnerabilities to actually predict make a prediction so I'm going to do example demo Gods so in classification we have data that we want to sort into predetermined categories so I'm going to click on this work so here I'm bringing in that firewall uh traffic CSV data so here what I've done I'm going to be predicting is you know is this firewall affected by malware so some of the um Fields I'm bringing in byes

received byes sent destination Port um if it has a known vulnerability or not U maybe this is coming from our vulnerability management tool packets received packets sent I can then run and fit my

model so now we're down here with our classification results so what this does is it's looking at the data and it's actually uh looked at how successful it was with its prediction so for the um the predicted no is it affected by malware it has an 83% confidence predicting yes it has a uh 78% confidence and I can add additional Fields maybe if I can remove remove one let's say I remove I don't know if it's vulnerable or not there's a vulnerability on that host sorry it takes a little r process here so here we have the ability PRI is is at 90% but with our not there's not really confidence it's about

5050 so another one would be uh anomaly detection uh numeric outliers so what we're going to do is look at log versus predicted over

time so here we see we have our our logins page we're actually uh have cleaned up that data we've loaded it in so what we want to do is identify any sort of anomalies that are that are outside the norm it sort of establishes this Baseline and we see a number of outliers right that are actually in this graph another nice thing is the visualization um so let's say I want to up update the threshold for this so then I'll have fewer outliers of course okay and now I see that okay there's something that's happening here so this is another example of having the substantive experience when it comes to this so on November 26th we were

attacked right all of a sudden we had a bunch of people logging into our e-commerce website and they're they're trying to hack us right no black Friday right so there's sort this sort of thing that the the machines they don't know about Black Friday they don't know about holidays so you'd have to go through and train the model and you would add some more additional um um exceptions and things like that to the model so it understands that or you know maybe this comes through you pass this to your uh to to your incident responders um and you know it's a false positive so this is just a few examples of actually uh supervised machine

learning so as you can see it's it's a lot of work to go go through this process you're actually creating these models making U assumptions you're looking at the data it's you're constantly tweaking it it it just it takes a quite a bit of work I have a lot of respect for the folks that actually leverage this um now I'm going to talk about unsupervised U machine learning so unsupervised machine learning is where you have no labels in the data so we had nice Columns of information uh before with our supervised machine learning uh with this we don't know um a good example of this is um I think a good analogy is Netflix so when we're looking at movies uh we

know hey there's romance action right those are things that's a good example of supervised machine learning things we're telling the system and we're categorizing that um with unsupervised machine learning it's more like okay well um maybe things we don't know about identifying well did you know that girls between 12 and 30 like uh Renee zeler types of movies um maybe not always romcom um but there's a lot of insight that uh that people like Netflix can actually gain from that as a result of it maybe for marketing purposes and things like that but we can also do that with our our adversaries particularly when we're talking about um user behavior and we'll show some examples of

that um we can actually identify anomalies within our environment so unsupervised machine learning is the general understanding of the present data by discovering hidden patterns uh so in contrast again to supervised learning we are uh working with the data that is not marked or indicated by labels um and here are some examples again of um some algorithms that you're going to run you're going to hear about again us these parties you sound smart um the most popular one you're going to hear about is c means um and that's what most most folks use to actually identify those clusters um the unsupervised machine learning process looks a little simpler uh you have raw security data you have

an algorithm and then it has a sort of automated clustering um and uh primarily where this is going to be used is going to be it's called UA UCB called just uba but Gardener changed that U now it's user entity Behavior analytics um and really it's used a lot for identifying authorized unauthorized users uh using authorized credentials doing unauthorized things so this is an area of security that's very difficult for for Sims to to track Sims are really good at knowing about uh you know uh known threats incorporating threat intelligence maybe um bringing in IDs signatures and things like that but what about uh attacks that don't have those signatures compromised credentials or if someone is using Advanced malware to get

into your organization or you have an Insider a malicious Insider who's who's accessing uh data they shouldn't um so a couple of interesting use cases uh that the UA helps with um is um you know account takeover so privilege account compromise uh data exfiltration uh when you have lateral movement within the or within a network as well it's good at detecting that uh any sort of suspicious activity malware attacks botn net command and control um especially like things like ransomware the where the the C2 infrastructure is constantly changing um you can't necessarily rely on just on thread intelligence uh particularly with some of the more sophisticated U um Ransom ones coming out uh you also do user entity Behavior

analytics as well so suspicious Behavior by accounts or devices so it can be compromised credentials or it can be compromised systems so I really like this uh Sans about a few months ago they had this U uh it was a summit for inine response and threat hunting and uh they actually modeled out the uh this threat hunting maturity model um where you know most folks are using ad hoc search statistical analysis visualization techniques and aggregation so here we have you know basic Log search we have um maybe creating some dashboard visualizations leveraging our Sim right we have uh maybe ADV even up to some Advanced correlations and then what we're starting to see is more machine

learning and data science being leveraged and they actually pulled them the people that are attend in attendance and this is what it worked out to at least 85% are using search you kind of have to have that um so even if you identify something strange here you have to be able to go back and access the the original log data U for your for your forensic analysis statistical analysis about 55% visualization techniques are on 50 so um 32% are leveraging something in the this machine learning data science area either they're doing it directly themselves or trying to or they're um they're using a tool an existing security tool that maybe has an add-on or like what we saw with an

app all right who is this guy that's right Casper up so how why is he famous right he's not famous because he's a good chess player he's famous because he got his butt kicked by a computer right um what's interesting is that Casper after he uh he did get be by a computer chess um they actually did some additional exercises where they actually did what they called freestyle chest um and where they actually found was that um a weak hum plush machine and better process was Superior to a strong computer alone and more remarkably Superior to a strong human U plus machine an inferior process so someone who is not necessarily U you know good

at chess having machines to uh strategize um it's it's really um interesting that you know the combination of that the strong humans the machines in the and Fe in the process were better right so um so this carries over into what we're doing right so leveraging machine learning we don't want to rely on it completely we still have to have the the strong analyst we still have to have a process around it um so one um thing I want to talk about too that we want to incorporate um from data science is threat modeling with graphs so um I'll show some examples of this but um graph based threat computation where we're actually able to build graphs of

anomalies and detect neighborhoods anomalies that indicate that there's malare activity uh we're able to do multiple multiple internal nodes beaconing as an example to the same IP address uh it's geared towards detecting malware and um those types of threats we also have pattern B based comput ation uh where we actually compute threat based on patterns that can be observed over sets of anomalies this is where when you actually are identifying anomalies within the environment uh we want to we want to map that to the threat models um that's where the substantive experience comes in so this is what the the uba model uh looks like so what we're doing is we're bringing in raw security data uh we're running

machine learning and a lot of these uh like uaba tools it'll use a combination of U unsupervised machine learning machine learning statistical analysis a lot of other um components and um and algorithms that will actually be leveraged to actually identify those anomalies um they can even um incorporate things like thread intelligence data so a lot of the data that you're already collecting and using in your sim you can actually bring um into these systems and it'll start to identify some of those anomalies um then what we do is using the graph mining concept so here we use anomalies graph on any relationships what that'll actually do is it'll start to score and it'll look for Connections in those

anomali ales um and then from that what we have is what I call anomaly chains we actually do is identify the threats right so now what we're doing is uh for example is we we see an anomaly from a strange IP address uh we see something on an endpoint uh that connected that IP address and um and we're able to see that hey there's a registry change um and then we see some weakening activity out to another IP address or something else right so now what we're able to do is it's not just an anomaly something strange that happen uh CU if you go chasing those anomalies you're going to spend a lot of time you

think you know firewalls and IDs false positives were a pain if you go chasing anomalies that it's almost a dead end a lot of times um and that's actually been a challenge with some of the early U uba types of tools is that people would get these anomalies and they wouldn't have the information to to do anything with it okay something strange happened uh what does this mean within my environment I don't understand right but when we start to chain them together and we actually identify those threats um it's going to make a lot more sense to the analyst and so then then we map that to threat model so we may uh map it to

um you know lateral movement right so um uh bacony or you know all the different threat models that we would consider um in our organization we start to map those anomalies you know is this does this look like lateral movement does this look like beaconing uh land speed violation you guys know what that is the Superman problem basically if uh if I log in here from Las Vegas and then all of a sudden I log in from Moscow there's no physical way I I can actually um you know be in two places at once right so those are you know based off goip things like that um and then what we're doing is we ma that course the HCI component

the human component where we actually can map those threat models when we do the visualization where let's map that to the locked Martin kill chain um so if we're talking about um this particular case here is like hey we see where the the actual um uh where they actually got in we see that there was some lateral movement where they were actually collecting data um and then we see that hey there was an exfiltration event um and we can actually map that out and visualize it for the analyst um and then we can also go in and do forensic artifacts We Gather other information so now the analyst can not only identify that there's a threat he sees the

connection he can also dig deeper and actually look at those events um and um and either you know pass that off to someone else to go and remediate those systems or identify that we have a larger problem this is part of a larger attack as well um and also you know threat and risk foring things like that so so that is it for my presentation um since you guys all sat in for almost like 40 minutes of a talk on data science I'm giving you guys your Doctorate of divinity in data science you guys are all now phds um so you get your slide print this out show it to your employer all right um so I again I

think it's really interesting that um there's a lot of great talks on machine learning I wanted to highlight some of the the other talks that are happening um this one here Joe S and and Roto they friends of mine really recommend taking a look at that one but these all look really interesting I think it's really interesting how machine learning is actually being leveraged um it's it's gone from you know something that was sort of theoretical um to some of the tools I'm seeing actually identifying some of these threats I've seen it in um actually in some of the customers I'm on site with where they started to run these tools um it takes about you know

it could take anywhere from about 6 weeks uh before you'll actually start to identify those anomalies and and um it's very different than your your sim use case where you can write the correlation and see it now um it takes about 6 weeks or so for it's actually learn what normal is in the environment um but it's really interesting even highly sophisticated security organizations where their their maturity is very high they have very uh good correlation rules they're able to identify a lot of threats they're finding things that they wouldn't have seen before right so um I believe that machine learning isn't going to replace your existing um security best practices it's not going to replace your sim but again just like

it's going to enhance your the analyst is going to enhance your security program you're going to identify threats you didn't see before um another thing I found twoo is that you'll start to look at anomalies right so um the the socks going to look at the threats they're going to look at the chain of anomalies whereas your Hunters um they might might actually go dig into those anomalies specifically because a lot of times what they'll find is that is this something that maybe one of our correlation rules and our Sim should have picked up and sometimes it's yes so it'll allow them to also go through and fine-tune some of those those uh their correlation rules

as well so and that is it for my talk a little bit early uh I'm on Twitter K Weston feel free to email me um death threats whatever works um I guess we have a little bit of some some time for questions I hope you guys don't get too technical on me because again I'm not a data scientist so any questions you're use a question please use the

microphone hey yeah great talk so far but um you mentioned you using the the tooling from Splunk for your visualizations and some of your modeling you aware of any tools related or used for like elastic search or those platforms um I haven't seen any for machine learning does anyone else is anyone using elastic and machine learning anyone using elastic machine

Le libraries if you don't need a you don't need a pared

solution yeah I yeah I use I use one because I know it and uh that's also where like data that I have I want to go and look at is going to be in so that's just the tool that I use there that's why I wanted to list all the different tools there's tons of them out there I don't even understand how some of them work frankly um that's well well beyond uh what I do um but it's important to understand what tools that some of the data scientists are using so at least you can understand hey they're using Scala um well maybe that's not going to work in my environment I think elastic on their website has a

to yeah yeah actually that's what I was going to say if you're if you're doing a lot of data scci scientists with like python uh I python not notebook everybody's familiar with that and now there's this thing called jupyter notebook book that's kind of like the same thing and you can do a lot of really cool visualizations with that as well yeah yeah and it's great too like I'm just experimenting and and learning how these things work um I'm I'm never going to become a data scientist myself um I don't have the time but I'm I'm more of the security analyst but I still want to understand how things work right um I don't just want to trust that hey

you know use data science this black box is going to solve all your security problems um I want to understand the differences between the different algorithms um at least so I can have a conversation with uh folks when we we actually look at some of this data um it's also good I think to be able to understand some of those basic concepts so uh when you actually go to get funding for some of these tools um you know how does it work um well here's here's how we can actually identify new threats and things like that so um it's not necessarily important where the data is and how it's stored um you can use elastic you can use any sort of uh other

data source like I'm just using CSV files right um but you know the the important thing is it's just more around the general concepts I think so cuz there's no other question I just wanted to make a quick comment that uh there's a project that I'm working on that almost is exactly that Lambda architecture slide that you had and the components and things like that it's a new Apache project called Metron so if people want to play Metron cool yeah thanks great so any more questions well thanks a lot I really appreciate it thanks for having me appreciate it

Data Science or Data Pseudo-Science? - Ken Westin

Related talks