Adventures using trillions of records to find "bad" things

Name: Adventures using trillions of records to find "bad" things
Uploaded: 2017-10-12
Duration: 1 h 4 min 27 s
Description: Adventures using trillions of records inside and outside of the enterprise to find “bad” things: The value of large scale, powerful analytics to the cyber fight Analysts (us) want to ask big questions of large data sets and get answers back quickly enough to be useful. This approach can be somethin

BSides DC · 20171:04:27148 viewsPublished 2017-10Watch on YouTube ↗

Speakers

Eric Dull

Tags

StyleTalk

About this talk

Adventures using trillions of records inside and outside of the enterprise to find “bad” things: The value of large scale, powerful analytics to the cyber fight Analysts (us) want to ask big questions of large data sets and get answers back quickly enough to be useful. This approach can be something of a Holy Grail and enabling that type of dynamic, optimal performance requires an appropriate blend of technology, data, and analytic skills. This session will describe an existing analytics platform with over a trillion records and 20 data sets that gives analysts fast answers to big questions, the data ingested into the platform, the skills and approaches by analysts to ask their big questions, and the successes realized analyzing the results of those big questions. Eric Dull (Specialist Leader at Deloitte) Eric Dull is a Specialist Leader at Deloitte, leading data science development in the nexus of cyber security and analytics. He has been working in computer network security and data analytics since 1997 and uses data science, behavioral analytics, software engineering, cloud computing and high-performance computing to ask big questions of vast data sets.

Show transcript [en]

the besides DC 2017 videos are brought to you by threat quotient introducing the industry's first threat intelligence platform designed to enable threat operations and management and data tribe a new kind of startup studio Co building the next generation of commercial cyber security analytics and big data product companies my hands who here knows P and NP ok I got a couple ok this is gonna be fun because my backgrounds in large-scale data analysis cluster computing high performance computing making things work and I backed into security about 1215 years ago so I'm coming at this from that lens and this story is sharing with you the things that we've figured out as we've been trying to move around big pieces of data

and leaving with you with some of the first principles and the the key elements that we focus on when we try to solve the problems I'll talk some about some of the stuff we're finding but for me the really hard part is how do you actually carry around a trillion records and how do you make it useful so you go to Def Con and there's like two talks there's like the one really cool talk where the guy's head hacks of satellite everybody seen that talk okay I got a snicker that's good and then there's let me tell you a story i'ma let me tell you a story guy and the title the talk is Adventures with trillions of records and

so every adventure starts someplace and every analytic you or useful analytic exploration starts someplace and my argument is we got to start with a problem I've got a trillion records isn't a problem it's an opportunity to geek out for a while here are some problems you have to think about why we're looking at the data what question we're trying to answer it took about 10 years again my soul broken by my bosses before I realized it just wasn't enough to answer a cool question because that cool question actually did nothing but it was a cool presentation and it was fun and so none of these actually have much to do with TCP or servers and any of that a buddy of mine

is out in Seattle and he keeps talking about presuming you're gonna get hacked yeah everybody gets scanned every day cat5 hurricanes which were the attacks you should worry about which are the ones that are actually getting close to the stuff that is core to your business he runs an educational institution so he worries about making sure labs work and students PII doesn't get stolen at my church we've got the people who go there we got the employees and we've got the PCI information so we can donate and that's about it right this is the core infrastructure if the Wi-Fi goes down in the church me nobody really cares but we got to start with something that's not a

truly the it's not how cool is it to carry stuff around

and then we start talking about the data that we apply to it the groups that we work in we not only look at large amounts of data but we look at many different types of data my a I professor pointed out that bias is good and this is good because otherwise you have a huge island directed search through multivariate space that you're wasting lots of CPU and people time not going the place where our gut everybody in the room knows kind of what is hinky that bias also tells us the the data sets we should pull together if we're doing insider threat analysis there are some things we actually don't care about if we're doing external exposures analysis

there's a completely different set of things we care about the reason CVE and CPE are up here besides there and one of my favorite graph formats that we'll talk about in a few minutes is they give us a common way across data sets to represent technologies and candidate threats those technologies defense in cyber is an all is a team sport that requires everybody to play the same way unfortunately that's impossible it's the opposite the NFL where usually one guy in the right place at the right time to blow the whole play up unfortunately for Bill Belichick right now he's suffering a little bit because he's not having a good defense we're unfortunately the 2017 Patriots

currently in cyber offense consequently you need one guy in the right spot at the right time and so what we end up doing is not saying this is the attack that's going to get you but saying this is the coalescing or the gathering of the energy that represents the highest amount of threat security is a risk reward trade-off anybody else here run into the problem of don't turn off the boss nobody okay cool again apologies for being a revivalist preacher this is a problem where you see this wireless mobile device coming in to check web mail it's on 3:00 p.m. on a Friday you turn it off it's awesome and about 10 minutes later you get a phone call

because the boss was on the golf course and he was checking his whatever his important email was as he was getting ready to do a deal with the guy he was on the sixth hole with and all of a sudden you stopped the deal and now instead of being a hero you were actually just gonna get kicked in the teeth right and that calculation of where's the risk reward continuum is exactly what plays into using the CVEs for as good and bad as they are as a representation of available technological risk and mapping that back to resumes and technical indicators that you've got about the organization there's no sense in hunting trying to patch things that don't exist

in in your world there's only sense in trying to patch the vulnerable technologies that you actually have running and are actually externally visible okay so now that we've got an adventure to go on we've set a couple questions we've got a set of data to pull together we can just search for something and we'll be done right this will be good it doesn't work that way and the reason it doesn't work that way is every data set is different just like a special snowflake when you get a data set you actually have to measure it you have to understand what the diurnal patterns are in it if there are any you have to understand what the weekday weekend you

have to understand which holidays you're keeping you have to understand that kind of just how big is it on an hourly basis all of these things together inform the bias that we talked about two slides ago you need to make sure the biases you've got in your head are the same as the bias is present in the datasets I was chatting with a team that I'm friends with and they had just turned on DNS monitoring and they were getting all of these DNS tunneling alerts off of this appliance that I plugged in and they were all getting ready and get excited because it had critical vulnerabilities and big piles 30,000 over about 4 days and so they actually bothered to talk to

the guy who helped them plug it in he's like oh yeah that's because of where the way that you've got the redirection set up in your network they didn't look at the alerts before they started to get ready to call the cops on them they didn't perform sense-making on their their data sets we also call this exploratory data analysis this is one case where it is completely okay to ask wacky questions you want to know the foibles about your data how tall is it what is it's for tea time all of that craziness because then you know what are the questions you can actually ask of it and if you don't go through that stage

at best you're getting lucky and at worst you're reporting on the 9:00 signal and you're wasting people's time

okay so we've got a question we've got some data we're getting ready to two cents make it but we need to use the right tools both for the cents making and for the the answering of that question we said at the beginning and these tools are different and the tools depend on the size of your data anybody hear good on sparse matrices cool okay the reason this matters is six degrees of separation is very real there's this thing called zip slaw it's present in almost every element of natural behavior the vast majority of human activities or actually cell pairs in strain of bacteria etc are dominantly the same and the long tail is where you get a lot of

very useful information or Shannon's entropy okay so why does this matter this matter is because joining the data sets whereas you bring the right datasets together to answer your question becomes a computationally difficult problem dense graphs ones that are not zipfian negative power law are a well-filled in matrix they are well represented by ones and zeros just in a table sparse graphs which are called small world graphs or negative power law graphs are what we find in LinkedIn on your network the way that your clients and servers interact with each other the way to my clients and servers interact with each other the way that you interact with the intern the Alexa one-million is interesting

because after you're gonna have the top out of the top few then you're getting into really weird behaviors that are highly discriminating technically the way to do that merger is not through sequel because sequel is built to do financial analysis so the key tool that we all use started as a mainframe program so that actuarial and accountants could add up sums and do arithmetic there's no bearing in those numbers and the underlying patterns in them to the underlying mathematics and organization of human behavior so the fundamental computation that sequel and is optimized for actually slows down the exploration process because everybody here is try to do a left join and then pound your head on the table like it just doesn't work

and above about three joins you're looking at an np-hard problem matching across multiple data sets without a primary key is a pattern match when we transform this into a graph world that becomes np-hard it's called a called a sub graph isomorphism well the way you cheat on that is that you keep everything in memory and you represent everything as a graph so instead of it representing things as matrices you start representing things as lists and you can iterate through ordered lists in constant time and you can find the joins and start walking it through being in memory is a hundred times faster than being on disk this explains the difference between SPARC and Hadoop as

computing platforms so when we combine the difference in the representational model and just using RAM like it's going out of style we get a lot more performance and that's kind of fun and that performance becomes really important because the most expensive thing we've got is all of our time there's a reason b-sides is on Saturday and Sunday is because we couldn't come on Hawaii right hit I was in two cities this week and my wife still doesn't like me we travel too much we work too much we're using our spare free time to do this because our time is too expensive so we shouldn't use platforms to do this exploration that are not designed to be

most efficient of our time one of the few of the design principles we've learned is that the time to prototype an idea is the that clock time is the most important thing because if you only got 20 or 30 minutes to analyze a problem that's the amount of time you have to discover something to learn something to identify something that you say hey that's kind of cool I should give this to my other analyst who's got another 30 minutes so I can keep the ball rolling and so you can make technical choices in engine type and in data representation that make putting in new data sets really easy in any database we've got a trade-off between density and

flexibility you can hand write C++ representations to represent something like NetFlow and if you work at it kind of hard and you're a little smart about it you can get it down to about 24 bytes that's awesome but it can only take NetFlow and if you want to merge it against a different data set it's going to take three to ten days of a really good programmer writing a completely new data structure and looking at the merge if you're willing to sacrifice and look at 500 to 700 bytes a net flow record you can put it into an XML like format that gets you a lot of flexibility that makes the merges easier and the reason one want to make the

merge is easier is bias is good because bias tells us which datasets we should put together and this is where how we should down project the data for multiple dimensions into two in this form what's called a Joint Distribution or you've got one data set or one feature on this side one feature coming up this way and an XY plot of all of the data in these two dimensions and if you choose your dimensions well you end up this really cool thing happens that makes you feel really dumb is that you've got these outliers here that you say what the heck is that and then you start this exploration process over again on that cluster and the speed of

putting data together so you can ask that question get that representation and start that cycle again it's what we call it hide the second solution so I've done it once let's do it again hey I had this cool idea can I do it over again and then the key metric is how easy is it for us to spin up more people I don't know about where you guys work where we work we never have enough people Yeah right and they said oh well just make more of you like okay no nobody wants more of me all right and B if you take the crucible that got me here you could never recreate it I started in 97-98 the world's a lot

different than last 20 years the speeds are different now these things called cell phones actually do a lot more than that does all we can do is get smart people and put them into an environment where they can go through some of the growth process we did hopefully with less side trips that were unproductive there's Milla there's millipede development and grasshopper development I mean making martial-arts reference here in a second but has nothing to do grasshoppers grasshopper development is hey guys we've got 40 million dollars we're gonna do this great thing we're gonna be so awesome cyber in like three years Yee okay so you take this big leap and you jump off the stage and you got

Superman legs and my son is so proud of me and I I misjudged I hit the exit sign over there I didn't go through the door right kind of got a bloody nose I wasted my superpower didn't catch any bad guys or I can walk off the stage the agile cycle doesn't just feel good and scrum master isn't just a thing we put on our resumes it is actually a better way of doing analysis in this experimental world when you can actually ask questions and get answers back in seconds two minutes you can mature your way of thinking and accelerate that process of growing people by forcing them to make decisions okay so I'm gonna ask another question

anybody here got an MBA Wow okay so my wife does she's super cool makes me look like I am really flighty and what she taught me going through her MBA besides the fact that it's a 40-hour a week job on top of your 40-hour job is it's a hundred business use cases it's a hundred small decisions it's a hundred examples of choose something now and what they're trying to do through the use case model is the same thing we actually all do which is here's a problem try to solve it no you're wrong driving bias into people there's some a MBA programs that you stop and you do it for a year and a half and that's all you

do and that's fine there's some programs that you do it over 17 or 18 months and you do it evenings weekends it's about the compress of decision-making cycles to force you to learn to force you to accurately sample the underlying feature space which is good and bad decisions in business we've got exactly the same problem how do we get new people on the team how do we make them productive members of society and the trade-off for us is we cheat two ways exactly no that's a good snicker we cheat because we give them languages that are very easy for them to use because that's what we want we want them to be interacting with the data and accelerating that

growth process so we give them things that are either sequel or look very much like sequel so you guys say hey Eric you just told me sequels bad because it's got underlying table representations that are really dense and are an appropriate decipher yeah that's because there's a sequel for graphs it's called sparkle it's an open standard it came out in 2001 I don't know if the rest of you notice we're still not in the Semantic Web Nirvana that we were promised but sparkle is like the one thing that hung out after that sparkle processes RDF and does pattern matching natively cool well CVE and CPE already come in RDF dbpedia already comes in RDF so you can kind of

get all of dbpedia either through a Web API because there's this thing called a restful spark a sparkle endpoint that you can give sparkle to and gives you an answer back which means that you can use anybody spark and a sparkle all RDF engine engines or really big ones depending on how much data you have because you did your sense-making so now you have a way for your analyst to do native merges of their datasets using a language like sequel because it is restful we can embed it in Jupiter notebooks this is really really handy because it helps us crystallize people's thinking as we're growing up up in that Forge right a guru as we're

helping them mature because by giving somebody a Jupiter notebook with an analytic workflow of how do you actually evaluate this type of IDs alert well you drop it in here you get some not really attractive but highly information dense tables back you look at it you find the break point you put it in you keep running you ensure that the only thing they screw up is making the important decision they don't screw up where the equal signs are they don't screw up the query itself instead on Jupiter notebooks become this handy thing that you could hand around you can put under source code control you could actually instantiate knowledge into people which means that then we've got

Python in the Jupiter notebooks which means we our data scientists can start to go crazy without actually bothering people with words like nonparametric goodness of fit or zipfian like you guys are really nice to listen to me talk about negative power law stuff so you would jam these teams together where each of them knows how to contribute to a jupiter notebook and sit around have a beer and say this is what this means so we can answer one of those questions we said we decided on at the beginning okay so I said trillions right you guys are all waiting for the big payoff pitch so I'm gonna let you marinate on that on this number I've got a friend who's got

a network who's got about a quarter of a million things on it who put this stuff who rebuilt their layer three so they could get full east-west north-south net flow that's kind of cool they average about a quarter of a million net flow record records a second you can do the sums to seven trillion is a lot good news is they're collecting this because they know they have questions that need to answer this they just don't know how to answer them yet

and that's a scale that makes me a little nervous and I'm the crazy guy who's been spending my life doing HPC scale comes in density oh you know in just transactional volume and that's about 95 odd percent of your record set when you start looking at a corpus your credit card transactions for instance are just transactions they're highly structured they come in repeatedly over and over and over again you know exactly what they're gonna look for about four months ago I got this phone call from our American Express at 10:30 at night nothing ever good happens what American Express is calling you I know Wells Fargo calls me all the time to tell me about my credit card debt I don't bank

with Wells Fargo but I think I actually work with the mare Express and so they called me and they said guess what you're shopping at Walmart calm and I said I'm going to bed because I legally know that I am out for $0 on this one and the next morning the folks who had gotten their hands on my number had gone to Hot Topic calm to Walmart calm and then to some no-name gas station and run of $85 of it with a clone card cool the way that American Express found that had nothing to do with the transactions themselves necessarily I don't reliably go to sleep at 9:30 every night just most nights so they can't just use time

of day what they use or what we call enrichment datasets it's all of the stuff that brings context to the table and it's what we use to reason about whether this NetFlow record is okay sort of okay awful or I should pull the plug on the network we do the same thing with our credit card statements and so when you're doing that analysis this scale you've got 95% of your data set as transactions and 5% of it is everything else that gives you that context that you bring in with the BIOS that helps you answer the questions worth so I've got a handy triangle here and this is my answer to everybody who says oh we should just use machine learning

it'll predict the bad guys then we're done do you guys hear that - fantastic we found an answer okay this is it's an utter fallacy because there's published research most recently by Vern Paxson and Ethan's summer at a raid 2010 this has outlier detection in anomaly detection doesn't work because the true the false positive false negative rates are too high why is that Eric well I'll tell you the reason it is is the multivariate space is so high there's so many dimensions to an IDs alert that machine learning works off of over sampling the underlying behavior in each of those dimensions because there's a continuous curve or a discontinuous curve and pockets of good and evil all

through this 25 dimensional space that looks like Swiss cheese and you have to have enough examples of good or bad to actually identify with discrete points where the good and the bad are and if you're sufficiently under sampled you don't even know what some of the dimensions are there's no way that machine learning is gonna get anywhere close to doing anything but getting you fired in about three days exactly the first day you're gonna be like no no boss I got this I just I just set the knobs wrong after the second day you're gonna look a lot worse because you didn't sleep overnight because you were nervous about using machine learning improperly the first day and then by the

third day you will just have handed in your paper because you shut off the guy in the golf course too many times so you have to start at the bottom yes you start with a trillion records first and then you search them and for some people that's grep and that's okay if it's fast enough to help you answer the business problem we set out of the beginning if you can't search the data just delete it unless you need to have it for a retention requirement but if the only reason if you cannot search it just be super up front with your OGC and your legal folks when they come to you and say do we have a tense

evidence of blah you say I don't know the data is unsearchable I have a team that I'm friends with who said we had an event I'm like okay events are bad and this event we we've got the data that can tell us about it like okay that's that's a step in right direction and we know we have the data okay they said it's taking us a month to get return off of one query said okay so the Earth's crust is going to cool before you actually analyze your event which means that the 178 days or whatever is in the Verizon report you guys are well behind they can't search it and I told some of they have it context is that

handy thing Google gives you in that little snippet it's that stuff we carry around our head so when we look at port 23 we know we don't want that on our network and port 21 is worse it's summarizing the data in a way that is informative to our analysis process and so once you can do searches you start pre thinking about the next search you want to do and building those tables ahead of time is the contextual piece and once you start doing the context well enough then we can do outlier detection let me say hey wait that that doesn't make sense we understand the context of well enough to highlight this automatically to show to people and then

you can do change detection my diurnal pattern changed because I had a good enough definition I know that the number of mobile devices on the network fundamentally changed because it went from here over to there and then stayed there there's this thing called a threshold random walk which is also known as sequential hypothesis testing Vern Paxson who had this point of reference twice so you can tell I like him he wrote a really nice piece of math around using sequential hypothesis testing to do too identify port scanning based on the relative frequency of successful and unsuccessful TCP connection attempts in your network well the good news is that this same math works and this same

process works for any event stream process and as one as soon as you can put good evil bounds around how frequently should this behavior be happening if it's not happening at that frequency start counting because all a sequential hypothesis test is under the covers is a number of steps one direction or another if you make the number of expected steps you should have about four or five in one direction hey I'm benign this is expected behavior cool well being paranoid cynical grown-ups that we are that just means we start writing the test again because the next five or six events could be the the ones that show us that something bad has happened conversely if you have five or six steps

in the wrong other wrong direction five and six is actually defined in the paper then you know something Bad's happening and you keep watching it to make sure that it stays evil while you call the cops and get page two three AM cetera this is a very simple nonparametric which means we don't actually need a model we can use past behaviors to set the good and bad thresholds and then just run this it gets a little fancier that's but that's about all you it is you just need enough statistics to do that cool that's how you did you change detection and once you've done all of that you've got a well-defined analytic problem you've got the data pulled

together you understand the underlying behaviors with a reasonable likelihood now you can do what's called classification which is a fancy way to say I I think based on this set of attributes about this data point this is whether it's going to be good or evil and there's where you start getting your being your blinky red light you can't do that for everything on your network for all behavior because the underlying feature space is so big you can do it for things that you understand well enough that you can give the system an expectation of thumbs-up or thumbs-down to use a rating system like analogy and so this gets into a one of my rants on ai ai is traditionally used

for a fast thing is this a bad DNS query is this a bad host name is this login good or bad cool though that's a good application but that fundamentally won't actually help that'll help us do some automation that'll help us catch some of the underlying bad guys that won't help us have intuition or use the computer to start with intuition and this is where I believe AI is most applicable to the field that we're all in because AI can do large amounts of unsupervised machine learning so I just threw a new term at you the difference between supervised and unsupervised machine learning is supervised machine learning I have a really good label space I know if this

is good or evil I've accurately identified my twenty five dimensions unsupervised is hey where are the hills in the valleys where the Questers and by being able to express your underlying data set in that way you can then start saying do I believe these clusters should exist Watson playing jeopardy is an example of what I call slow speed AI it's asking questions now it was trained to ask questions and I had a whole bunch of data but here's where we can use large amounts of machine learning and larger piles of data you know at the trillions level just say did I have new connections show up did I have a new important bubble up is there a change in

the underlying bias that I can highlight so in those two hours Friday afternoon well I'm depending on your organization having a beer or thinking about having a beer and you've decided to that your to-do list is done enough for the the week and you actually like unclench your brain and you start having these really cool ideas that for me depress me because I'm gonna mostly forget them by Monday or I know that the we won't have time to do them but in that time we should be looking at the weather on our network and how it's just fundamentally changing I'm a big hurricane nerd I'm also a snowstorm nerd so the National the NOAA's National Hurricane Center

website is the coolest thing ever we should have the same thing for our networks one of the dominant questions I get asked is how many connections do I have like how many physical connections do I have leaving my network I'm like you don't know is anybody here willing to stay with certainty they know how many they got right nobody puts their hand up because it's your network you still don't know I've got one client with 47 they're global they the only thing they know is they don't have coverage of them that's a problem how do those look how are they changing being able to understand the underlying behaviors so that we you know what a

bump of an ice it sounds like that's what we should be using like prediction for not saying this DNS thing is more likely to buy me in the tush in about three days okay I gave you some bullet points to think about it mostly comes to bet down to really good fundamentals I don't know if you've tried to SC P a terabyte of data around right okay exactly it's easier to fly a pimply-faced youth for all of you who read the register a long time ago put them on a plane send them down with a hard drive get them to bring it back yeah we actually do that because it's easier why because data is heavy it's better

off to make fundamentally good decisions once or twice about where you put it and the tools you put over it using some of my considerations on the last slide then saying okay well the first thing we're going to do is we're going to build an awesome data Lake and it's gonna have all these channels and it's gonna be like the Netherlands and it's got this great technology and we're set maybe if you've got farmers like the Netherlands make sure you're using the right representation for the staff so that your team can ask the questions because otherwise we're back down below search we've got data that we're paying ec2 money for or in the other cloud service

for that you don't know what to do with and you physically can't do anything with remember what the question is you're asking I am guilty of nerding if in case you guys haven't figured that out by now this means I think that things and dad are really really cool you guys do too otherwise you wouldn't be spending your Saturday walking around here like listening to mostly interesting people talk right we all want to go down rabbit holes that doesn't answer the underlying security questions we use a lot of partial summarization called questions question focused datasets primarily to pre summarize the data and use bulk compute for what it's really good at turn a lots of those

transactions into something slightly smaller that's easier to handle yeah like okay that's pretty cool and we would get to a point where one of our guys would run something and he'd reduce a trillion down to a million and he put two thumbs up and he said I won and now I'm gonna pull it off of that big beefy server and I'm gonna bring it down to my laptop and I'm gonna run a million row Excel spreadsheet in my laptop we're like you just gave away 128 gigabytes of RAM you're so stupid and I still love them don't move the excel until you're ready change all of your thought processes around where it's where is the most

computationally efficient place to answer the question and that building box approach is is really important it's really easy to pull together a bunch of data to bring it to make it into a pile to make a sandcastle to win the sandcastle competition and then the tide comes in and then you got to build another one did you actually automate the way you built the last data set so that the next one is exactly the same as that one so all of your same sandcastle building stuff works usually not that's okay we all hack it up try to force as much rigor on yourself as you can when you're building your subsets your data to get the benefit of learning from the

first time and then plan for asking the next question because as soon as a prediction thing happens you're gonna start searching immediately because now you've got a new thing to ask questions about the reason I've got this brace down here on context and search is that's about where 90% of your questions get ants and get asked there's small questions they're generally not big computationally complex questions they may take a long time to answer because of the scale of data and the nature of the engine you put them in but you're gonna ask lots of them you don't ask very many big prediction questions but that's where you spend most of your computational time don't expect that flops put in is directly

equivalent or sometimes at all proportional to the analytic value get out I'm working with some friends and they've got a an event you know that's exciting and we were given about 85,000 IP addresses and they said what can you tell us about these and we said well we could actively scan them if we can get permission and OGC and legal we could go to show dan or some map or something similar because the map because that's fun we we couldn't just ask our passive and reverse DNS resolver so we did and his dumbest that sounds we looked up 85,000 IPs and built a spreadsheet and we handed it off to the guys doing the day to day triage and they thinks it's

amazing why because they don't have to go to another tab in the web browser because they've got just a place to ask a question good so what are some of the things that my jumping up and down and waving actually helps find we don't do signature based stuff that doesn't there there's lots of tools that are really really good at that the technology for doing pattern match that way is well-defined we hunt the invariance in the behaviors and then depend on the enrichment from the additional data sets coming in to give us that context to help us find the things that are worth people spending time on so and we also try to focus on the things that are most

important we don't try to focus on infections coming in begin because you've got lots of technology that does that we spend the majority of our time focused on what's gotten through that what has gotten all the way through your stack and now looks like it's doing something like beaconing that looks like it's doing something that's starting to exfil an exfil and a download and an upload into a cloud hosting service are identical the only difference is the scale who look who we think might be doing them what time of day it is a number of things that indicate it violates expectations and is worth a person taking a look at we keep people in the loop on 99% of our

analyses because we know that we don't know what we're really looking for but we can reduce a trillion records down to enough that somebody can take a look at it and make actionable decisions this interactive traffic over web ports is some of my favorite stuff actually just because it highlights that the more people try to be sneaky the more obvious they are there was a paper that came out a year ago it broke on around detecting SSH brute-forcing and this is cool like they showed there's a piece of machine learning that's fast machine learning that does this there's math that does it it's now in Breaux actually that it'll detect a sage sage brute-forcing just

from the nature of the connections their speed and the number of packets that are in them you can determine empirically how many possible packets there can be if you go through all of the authentication methods and get your password wrong three times or wrong twice and right the third time if you have a session that is more than that it is established and somebody dropped to a prompt somewhere or is doing an SCP cool it once we can identify those we then have elements of the packet stream that are directly indicative of whether it's interactive or a bulk upload or download at that point you've got an inner activity detector you detects things that may or may not have

push set on the the packets but are not acting like web or in the other bulk thing now there are lots of things that should be running on your network and there are lots of interactive things that should be running on your network and that should be going in and out of your network and indeed we all use TLS encrypted things to talk to our buddies outside of our network and this is where that enrichment data comes in because as soon as we can identify the interactivity and we can white lath lettest out a number of those things it's then what else is running that shouldn't be running it's a while it allows us to say this is normalcy for

interactivity what's the weird stuff which is what we get excited about and where the really evil bad guys are

so what we've still got a data merge problem the hardest thing we've got is not a math problem I mean you and I can have a chat over a beer and I can give you the three references that give you all of the math that you could possibly want to go implement you got a visibility problem I know I've got one set of buddies who have good coverage and good visibility I got a hole I know a whole bunch who don't even have the metadata to be able to answer the question of what happened these techniques are one similar that I used on a bunch of incomplete a close several years ago something happened somebody gave us a call and said here

are 1600 a close like what can you find and we found lots of interesting things doing exploration we found that the assisted min was a high roller in Vegas actually which was kind of concerning considering the Matty was being paid and we found that because the website he was using to book his hotels used open source path like clear text passwords and we saw them in the post that was kind of like something and then we found other evidence of beaconing that weren't supposed to be there we did it on 1,600 Echols I mean nowhere near a complete set and this is where I try to leave you with a moment of restorative hope I just

talk to you about a bunch of theory it's practical theory but it's not something you go by in a box it's a way of thinking I told you they I give you a martial-arts reference and here it is I've been training for about 10 years it's annoying I don't want get up in the morning I go to the dojo every day I have really good people kick my ass I'd go to work and yet I still keep doing it one because I'm incremental getting better too I look good in a pair hakama let's not lie and three the dojo is a forge and it is a reflection of what I bring to the mat and I am making myself better

I'm refining myself and the way that I think through getting thrown around and tried to hit people with swords etc it's all mental so these principles that we've talked about that are hard can serve to guide us in that maturation process as we bring new members of the team on to help them grow and mature so that when we are doing analysis it is like being at the dojo where we make sure we execute correctly when we do analysis so we get better at analysis we're paranoid and we're cynical because they are out to get us we're also paranoid and cynical because that's the way that we've learned we have to be with the data sources we have

and the threat behaviors are well in excess well ahead of the things that we we have tools to see so the only way we bridge that gap is by training ourselves to ask better questions and to perform better all source analysis very rigorously I've got a buddy who works for a big software firm that has that gets crash reports in crash reports are really really cool okay so about 60% of them are not cool their video drivers crashing fundamentally uninteresting and I was that guy once you know I'm gonna date myself I played unreal in college and I wanted to get the cool to cool video cards but you know the cable between him so then they do the the

interlacing yeah I was never that good woman exactly right the Nvidia okay yeah I'm old the point of this is [Music] I didn't say that I'm really sorry forge learning better at whatever will come back to you what I remember it oh yeah and the other thing the other two things planned for operationalizing what you find if you find something really cool it doesn't matter until unless you can actually change behavior on the network oh okay so I remembered what I was saying about the the software guy it just took me a second and so then they find other fun things that are happening and they just get this crash reports and for them the fun part of it is looking

at the piece of malware that's only ever seen once or twice well let you hypothesize why that might be the case and so the guy was telling a story much like I am now he had much better graphics I only had one and he's saying ok and we kept our eyes on the open source research and we saw about six months before the open source research was getting close to finding this particular zero day it was used to hack a large organization for significant impact before that it was used three years earlier once the entire way around the world they used the crash report to say hey this is this happened once and then they

were able to do with some fancy new attribution why did I bother you by telling this story well it's kind of cool right it's fun to tell stories by catching bad guys also every piece of data we need to use we need to think about with that analytic mindset that frame of mind be it trying to figure out if my ten-year-old is sneaking minecraft again turns out we've got to hide the ps3 regularly by regularly I mean every two days or you got a trillion net flow records okay back to the slide operational your discoveries I've described using large piles of RAM in let data slam into itself to discover things this is called interactive you

know analysis and in the spark stack this is one of the three types of engines that they build they tend to put graphics in there then this is a place the next phase of developing an analytic is something that MapReduce does really really well test it you've got a trillion records how many times does it occur is it statistically stable did you find a behavior that's repeatable that you're seeing happen over and over again this is a bulk process that you'll care if it takes a minute an hour or a weekend now you'll be a little annoyed takes a month or so the interactivity is not required there and then you want to put it into an event stream processor

spark streaming storm a number of commercially available conflict complex event processor x' that look specifically for this behavior at line rate as dumb as it sounds 40 gigabit SAS ekend isn't that fast if you know exactly the behavior you're looking for there's only what four billion things on the Internet except for ipv6 okay which means with a sufficiently large pile of RAM and a very carefully written C++ representation you could have a single sensor keep track of a set of behavior of the entire internet which means if you discover a behavior now what I just said is not free to do what right but it's possible if you discover a behavior you test it you're not sitting on four

Yuga's gigs a second on your uplinks cool well now you've got something to go look for plan for that cycle my buddies with the seven trillion records a year they plan for knowing they were gonna need to answer the questions they know they don't have this stuff in place right now to do it that's okay they've got the data that supports search that's letting them handle their bottom set of analyte problems so that the quick wins are happening so that the decision-makers realize that this story that we've been telling actually comes in three acts and it's gonna be a great blockbuster and I'm going to cast the right people in it all those nice things

but if you can't get through the first act nobody's gonna let you record the second one so plan for the whole thing so when you discover something you're ready to operationally and this last thing use time to your advantage this is a particular hobbyhorse of mind whatever is happening to you with the possible exception of some fantastic ransomware is not instantaneous everything has a wait phase in it there are elements to that wait phase that talk about the beaconing periods or the callback periods to know when to go active in queueing theory we call this an hour-long an ER long tends to be the the length of the interaction and when you call up on the phone because we

still call on the phone to solve whatever problem the airline created for us they use or longs to figure out how many operators that you need to have based on calls and volume all of that well if you call the ER long the same time period that the the person doing whatever it is it's got the malware on your network wants as the responsiveness of whatever that activity is they have to sample it half the ER long to be able to pick it up and be responsive in that time which means there are now outer balance for how frequently the beacon a or phone hum behavior has to be happening to catch the ongoing underlying effect cool that means they

can't be super stealthy and only phone home once every three months unless the impact that they want to have it has a six-month window on it that gives us a better fighting chance which means that gives us more stability more rapid to start finding clusters of things to start saying hey what's that now we've got plenty of math we got lots of cool computers ec2 and other cloud service providers are making things you know they cost money but you can get a computer by clicking your mouse a couple times what we don't have is enough people who think about the problems and you say how what am I actually trying to answer I got all the IDS alerts ok why

I'm running this big blah blah blah instance ok how is that furthering your security goal if you want to get a hold of me I'm local feel free to email a call email is better but just because contrary to popular belief I don't like talking that much I'm gonna stop now you guys got any questions yes sir just

yes where I started that was really helpful for me is a study in and by the way thank you everybody I know I'm the last thing before lunch I really appreciate yo listen [Applause] so I started by studying what small sample size statistics was and a really smart statistician told me it's about 13 to 15 elements in a data set I studied the difference between I study something called bootstrapping bootstrapping is a way to take a small sample size data set and kind of estimate it larger so you can do more things with it when you don't have more than 15 I actually looked at numeric recipes in C because there are cogent explanations of things like high square

similarity coefficients there and then I read a bunch of vern Paxson's papers burn is out of UC Berkeley and he's been right on this for 20 years and he's got a really cogent description of why parametric methods which is where you calculate some actual likelihoods don't hold here Verne is also the Burrow is Verde's baby and we use verb row heavily because it runs up to 40 gigabits a second given the right conditions it can it's a finite state machine out of the covers which means you can write arbitrary meta data extractors for it it will give you paired level 5 net feel like stuff natively as well as because it's a finite state machine it does a

lot of kind of really fancy behavioral analytics stuff and gives you great signal for doing these sorts of analyses and if you want to give me a call I love just getting together and talking about nerdy math more questions yes sir

no what I'm saying is that use machine learning very no the question you're asking before you start running it machine a spark and so there are there's a thing called a Turing machine which is this intellectual construct which says that this problem will eventually halt I like to take it and say that there are analytic Turing machines which is that there are analytic engines that possess sufficient mathematic operators or arithmetic operators for all of my real crazy math homies in the crowd that say that you can compute any piece of math that you need to using the operators present the engine spark is one of those math ml is a great place to start I

would encourage you to look at some of the some of the DEF CON forensic puzzles as a place to get good data that has known badness in it so that you're doing machine learning over something interesting additionally there's another interesting thing you think about user agents are very descriptive my iPhone on average and this was a few years ago lets out 13 user agents on average over an hour my MacBook does about six or seven the reason that my iPhone is so many more is that every one of the apps has its own user agent now using the apps themselves to do operating system prediction that's an interesting email problem now you might you say if I know how to do it

okay I beat on it for six days in an environment we had 1,300 devices on the network and 4,000 devices and the DHCP pool that's an analogue problem so I love SPARC we use it extensively I my perspective is no the question you're asking before you just start jumping into ml more questions okay there's one other thing I have to say do not file your taxes based on what I just said okay seriously thank you very much I've had a great time [Applause]

Adventures using trillions of records to find "bad" things

Related talks