
all right good afternoon welcome to b-sides las vegas ground truth this talk is attack flow from data points to data paths by gabriel bassett um a few announcements before we begin we'd like to thank our sponsors especially our diamond sponsors lastpass and palo alto networks and our gold sponsors amazon plex track and blue cat it is their support along with our other sponsors donors and volunteers that make this event possible these talks are being streamed live except of course in underground and as a courtesy to our speakers and audience we ask that you make sure that your cell phones are set to silent um questions will be at the end if you have a question use the audience microphone so youtube can hear you i am holding that mic i will put it back on the stand in the middle of the room uh as a reminder the b-sides lv photo policy prohibits taking pictures without the explicit permission of everyone in frame all the talks are being recorded again except in underground and will be available on youtube in the future please keep your masks on at all times uh if you want to hear better see better feel free to move closer to the center and front of the room keeping social distancing in mind um with that let's get started please welcome gabriel bassett it's good that we get the clapping in now because you don't know what i'm going to say and if it goes bad at least i've gotten one clap for the presentation so i'm gabriel bassett um we're going to talk about tag flows so we're going to do a quick introduction to it and then we're going to do this like in the data driven way right i'm going to show you a bunch of data particularly data we use in information security before like attack flow what it looks like now and then we're going to kind of walk through the process of taking data and turning it into attack flow and then we're going to look at the data structured as attack flows then we'll wrap things up a little bit and so who am i um this is me this is my twitter handle i have done a lot of graph things and i want to take a second to talk about where that comes from because i've been doing graph stuff for about 15 years now and it started when i was in the government um when i was in the government we were doing risk models right and so this is like way back because what we did is we came into our executive and we said hey we ran nessus and we showed the output of nessus all of the output analysis and it's a little too much and she said go back and find a better idea so we came back again and this time we group things we said this thing happened this high risk happened 100 times this one happened 50 times and she goes that's great i i don't know what that means right like so what so okay we go back to the drawing board we come back we got like the five by five chart right that has impact uh likelihood and impact on it and it's got some red yellow and uh green areas and we go now we know this this wrist is right here right um the problem was we never could agree because it wasn't just like one person making this decision like i was representing the government i had a contractor doing the testing then there's the people that built the system and their government representative and we all disagreed and the problem was why do we disagree like why can't we all just get along well it turned out that it mattered on how you made assumptions like when my tester makes assumptions they would make the assumption of like hey here's the server the power button isn't like covered by a door or anything what if someone walked in and turned that off this is a really important server that's a high risk now understandably the other side of the government's contractor said yes but that server is inside of a room that has its own um access system and set of credentials it's locked down to like 20 people and that's inside of this portion of the building that is only locked down to the people who can get access to it which is inside of that half of the building which is locked down which is inside the building which has its own access which is inside the fenced area of the base which is inside the base and so really the threat over here is probably not what we're worried about we're worrying about the one out here and see the idea is there's a path here right the narrative for the threat outside the building was very different from the threat inside and so we realized or i realized hey we need to be writing that down and so i did what everyone does when they first start to do this they get an excel spreadsheet and they say what's a is a threat inside the building or not if you answer yes it's high or we add like you know 10 and if it's no we add 1. we come up with a bunch of questions like that and then we add them up and maybe we multiply it by some random number we came up with because then it looks prettier and then we say any number between this and this that's a high risk and you know these are the lows and that doesn't work and i apologize if anyone in the room is actually still doing this a lot of people do it's not just you um but the problem is people already know what they want the risk to be and after a few times using these kinds of tables they know how to get it out and so they come up with some narrative that fits their mental model for what that risk to be i want it to be high so i know i select this this and this thing to make it high and so now not only are you not getting the narrative you needed you're getting some false narrative and so i said well let's let's scrub that we'll go back to the pi by five but when you put that dot on that five by five matrix you're gonna have to shoot tell me why you're gonna have to write out a little paragraph that says i think the threat does this and then this thing happens and then this and then and when i realized we were writing out this is where i started to kind of get a little bit different from what we'd normally do today and we realized that what we're really documenting was this attack path right the attacker does this to this system they do this and it has this effect they do this it has this effect a sequence of things and so we started to build paths and that was actually really cool because once you start building path paths you start asking this question what does that mean in context it's cool if you've got one what if i have 10 what if my 10 all include the same thing how can i combine those together and i remember this moment really clearly i'm kind of grappling with this idea it's the end of the day right everyone's kind of tired it's me and this one guy there left like i'm walking past this game he's like hey i know you've been working on that problem my company had this project a couple of years ago that involved [Music] these things called brass i'm like cool what's the graphic because i don't know like he had no clue but he he kind of he understood enough to know that it might apply in my situation so i go back home and i started searching for it and it turns out it's a really great solution by the way if you don't know what graphs are we're going to talk about that three sides um and so i go back and i build um bayesian inference networks in the back of an excel spreadsheet which sounds like a bad place to do that until you realize that i was in the government and they don't let you install things in the government and they really don't let you install programming languages because then you can run whatever you want but they do let you write visual basic because it comes with microsoft office [Music] and so that's where i built this and i left the government after a while i got a patent around graphs i wrote a bunch of blogs i published some stuff i've done a couple of talks and that brings us to today and so it's been clear for a long time that atomic information atomic infosec data is not cutting it right we need to be able to describe the paths and graphs that graphs attackers take you know flows so to speak but we lack a common language to do that attack flows that common language it's a schema for storing paths and graph data and it's really cool because it's incredibly simple and it's incredibly strong and before well if anyone wants to clap and and tell me that i was great at doing this now that's awesome because on the next slide i'm going to tell you there was actually a team of people that did it um this was something done through um fighters uh center for threatened form defense with um like raj attack iq great uh gauge anomaly apollo at fortinet market city ryu and fujitsu andy who's now at apple market title and we we were lucky we had many of the right people um ryu worked on this stuff back like in his phd years in the early 2000s like and then this guy is like so smart that like i just sit in aw um but he he could be wandering around here and he's so mild matter that like you know he he passed by we would never see him but he's so smart and so awesome i love ryu andy was instrumental in writing caldera mark was instrumental in writing six and of course i've worked with brass for a while and i also maintain the vera schema for the used by ryzen for the data breach investigation report and so it was this team of people that did this and so what's what's it look like and so graphs are these mathematical things that they're actually pretty simple they're made up of two things they're made up of let me we're at c should i just push buttons to figure out which one's the laser or should i actually think about it ah yeah there we are there we go it's made up of dots it's made up of lines for graphs and so the idea is that the dots are nodes the edges are the lines every edge has to have a node connected to its ends you can't draw a line in a graph and have nothing on the end but nodes themselves can have multiple or even no lines so this is a graph this is a graph this is not quite a graph because there's no no time and there's a lot of different types of graphs um there's simple graphs like you know there's no direction you know you don't know which ones to start at the end of the edge there's directed so it goes from something to that's why we have this little arrow on this one there's tree graphs there's acyclic graphs there's cyclic there's directed acyclic graphs or dags which come up a lot in a lot of kind of different situations there's also hyper graphs and property graphs but we're going to talk about a very specific type of graph we're going to be talking about linked data and that means a couple of things the first is that every node in edge is defined by a single string a single uri specifically and so every edge has a uri and this one it's rdf type and i'll explain the audio thing in a second every node has a uri this is i think attack flow action one or varus phishing um the only time you don't have those is when a node is a literal so an actual like a bunch of actual string is not a uri or a number something like that and so you combine those into what are called triples here we have a triple of attack flow action rdf type embarrass and it's just three things right like you can see a table with three columns and you put edges in there and you have your triples and you have your graph um but you'll notice right that i didn't spell out a tag flow i defined a test flow over here as you know the af means this uh namespace and namespaces are one of the cool things about link data because you don't have to reinvent the wheel if someone else has an entire definition of how to explain things such as actions or assets you can just use that you don't have to come up with your own um in fact a lot of it's predefined things like rdf define um this is a type action is a type or actually that should go the other way fishing is a type of action or action is a type of fishing um there's other name spaces you'll see a lot and i'm going to bring these up for a specific reason so like rdfs is used for labels so if you want to give your name node like a common name or something that sounds a lot better than you know uri blah blah blah blah there's dublin core that adds things like here's the description for the node there's time for time stamps and there's al web ontology language and it has a lot of useful things like this node is the same as this node or this node has an object property or a data property of this thing or this is a named individual like there's phishing the concept and then there's the actual phishing action action that happened as a named individual and the reason i bring these up is because you know if you go back and you go google this or if you're googling it right now it's going to scare the vaginas out of you because al does scary things because the people who invented al were a bunch of academics who were like you know what's cool we can do actual cool reasoning over this um we can do first order logic we can do all these fancy things and we come into and go you know that's cool and when we need to know that we can know that what we want right now is to say that one node is the same as the other node like we're using this much of this big thing and we don't have to understand that big thing it doesn't affect us in any way um but the nice thing is you have this name space if you have a data set and i'll explain this a little bit later but if you have a data set that includes something like the city of sydney you don't also have to say that sydney is in australia that australia is a continent and that it has this location someone else has already built a graph that explains all that stuff if you reference their sydney node then anyone that wants to know that stuff they'll be able to find it so now we know what graphs are right what about attack flow attack flow is five things actions assets properties relationships and flow so the action is the thing that happened the asset is the thing that has its state change properties are these nodes that describe actions assets other properties things like that they're the descriptions relationships are just the edges between them and the flow is the set of all this together and so when we look at it as a graph it looks something like this we can see this causal path through the middle of it and one thing you'll notice here is that it goes action asset action asset and that's very very intentional because we'll get into a little later but in security different people think about causality in different ways if i'm a blue teamer i think that i have an asset and this action happened to it because that's kind of what my logs look like um if i'm in on the red team i might say i did this which is an action then i did this which is an action and implicitly it's the assets that are between that i did this action there's like two something um but to be able to capture all the ways that we think about um security it's important to have these and how they go back and forth we'll see some visualizations that show that later so now we're coming to kind of the [Music] fun part what's our data look like today this is the opportunity to look at a bunch of data but not use it um so starting with red team report right this is just something i downloaded off of pintestreports.com and what do we see here we see a lot of freeform text right not a lot of structure and really also not a lot of solid grouping you can't say like here's a concept and here's there's groping but it's all over the place so this is what our renting day looks like today now tax simulation data a little bit better it's structured right it's a tree but a lot of times in attack simulation data you get a lot of raw code it's hard to describe anything more than a single thing without having some clear clone attacks this one single thing points to some code and what we don't want is we don't want to have to describe our attack simulations as like arbitrary code because we don't want to be like hey i need you to go run this attack simulation in your environment with your tools here's just arbitrary code have fun like we wanted something a little bit better than that um signatures right again structured looks very similar to the attack floyd it's because it's like hierarchical data um but there's no real links we're seeing one thing happening in an instant you know and that's we want to be able to detect more we want to be able to test subtler things we wanted to be able to detect multiple things then we get into intelligence data right and this is a subset of an eye chart a very small subset of an eye chart these are two records just off of showdown right and there's some substantial problems with this it's super dense um also you have duplicate data so if i have like the same vulnerability down here i'm going to get duplication of references duplications of summaries like those are going to be in every single record that references this vulnerability right that's going to take a lot of space but the biggest problem with this is that when we use these data structures we don't know what's in them and so this one like so over here this is an http test this one i don't think this one's actually an http it's a different test it's like the https test and there's nothing that says that they need to have the same information so if i'm looking at my dictionaries for key that's like five layers down in every one of my records it's not there in every one of my records and this becomes this huge problem as you look through data if you don't know exactly what the entire structure of the data looks like you know and who wants to know the structure of all the data justifying the few things they want to look for i say this from profoundly depressing experience okay so moving on incident response data right this is all very textual it's all kind of these text files by the way big thanks to chris sanders uh and his training he donated the text files for this um he's also helped out a lot and so these are text files of indicators and information um that were collected during this uh simulated ir and the problem right is it's all text even some of this is structured this is structured data but it's being stored as tests and there's nothing that links 222 to something in this file it's right there or over here or even in the same file other than like wrapping the raw text you know that's not very searchable that's not very usable you know you're never going to find that in six months [Music] and so moving on to like thread and tail this is uh some noise donate some data donated by grey noise this is actually pretty nicely structured it's relatively clean this one is but you know you can end up with raw data scanned this one they scan two ports what if they scan every port right then you get this thing that's like this big um also what if you have 40 000 of these um and you need to find every place that 23 is mentioned um do you really want to look through 40 000 pieces of data to find the one everyone that's got you know port 23 in it and even when we aggregate things like threat intelligence like i like to think that this looks good i like to think that because i made it um you can all tell me that it looks good you don't have to believe it though um it's all aggregated and it's pretty it communicates it's clean it hides a lot of nuance because of that aggregation and we'll see that a little bit later you know but even keep moving up the chain let's say that we're making decisions about our architecture what we're going to invest in you know how do we do that these days we make it one endless you know at an organization i was at the way we did is the entire security department could propose projects and they said hey we think it'll cost us much here's some text maybe a paragraph explaining the benefit we expect to get from it and we build you know one endless and at some point the money runs out you draw a line and you find this stuff and you don't fund that stuff abov