Building Intelligent Automatons with Semantic Reasoning

Name: Building Intelligent Automatons with Semantic Reasoning
Uploaded: 2018-04-25
Duration: 32 min 6 s
Description: Anton Goncharov - Building Intelligent Automatons with Semantic Reasoning and Horse Glue Proper data modeling is probably the most underrated aspect of security data analysis. Our addiction to logs and string pattern matching as a primary source of knowledge have painted the security industry pract

BSidesSF · 201832:06203 viewsPublished 2018-04Watch on YouTube ↗

Speakers

Anton Goncharov

Tags

CategoryTechnical

StyleTalk

About this talk

Anton Goncharov - Building Intelligent Automatons with Semantic Reasoning and Horse Glue Proper data modeling is probably the most underrated aspect of security data analysis. Our addiction to logs and string pattern matching as a primary source of knowledge have painted the security industry practitioners into a corner. The data never tells the full story, and the path to discovery is laborious and painful. We'll discover how graph based ontologies can help consolidate all relevant information across technical verticals, model expert knowledge, and serve as a single source of knowledge. We'll discuss how semantic reasoning can revolutionize low-level data analysis and reduce 'zombie workflows' by automatically drawing hard logical conclusions the same way a human analyst does. And lastly, we'll touch on how Bayes belief networks can help trace cause and effect in events reported by common monitoring and detection tools, establishing chains of events.

Show transcript [en]

Yeah. See I see a lot of new faces at B-Sides. I guess that means a lot of you guys have been breaking the first rule of the B-Sides club. It's going to be a lot of these going on. I apologize for that ahead of time. Uh so yes, my name is Anton Goncharov and uh we're going to talk about uh semantic reasoning and semantic technology and uh how it can help in a lot of areas that we're uh struggling with today. Um this is an attempt to condense about uh 3 to 4 years of practical experience and a lot of really tedious research with lots of long and boring words into about 30-minute conversation. So um I

apologize if we go a little bit fast. Um there is should be uh in your schedule there should be an attachment with a slide deck. There's a lot of links and references in there. I try to make sure that you guys can follow up on any of these things uh if it sounds interesting. Uh and of course if you have any questions just let me know after the talk. So let's just breeze through this. All right. So um what are we going to talk about? We're going to talk about uh why is semantic reasoning is important. How did we come to the state of things where we are today? Uh what is the semantic technologies all about? Uh how

do we look at things versus how do we look at uh uh strings and and patterns and fields? Uh what is up with the ontologies and how we can take advantage of that? Um how does intelligence can be achieved with semantic reasoning and also talk a little bit about uh sort of stargazing some of the cool things that we could do uh with this tech in the future. Um so that's who I am. I'm just going to skip that. Uh I've been around for a while. I've done some things. Uh you know, take a look at my uh profiles. Uh long story short, I spent quite a quite a few years uh trying to figure out what is the

story that the data is telling us and how to take that story and how to tell it to a larger context, how to solve problems with it, how to solve pains and how to fix things in such a way that they don't broke again break again in the future. Um so now that you guys know a little bit about me, I would like to know a little bit about you. Uh how many uh folks in the audience are on the defense side? How many of you would say you're in security operations? Fantastic. I'm really sorry guys. Uh Um any women in the audience? Woo. You guys are the best. We need more of you. Um all right.

What does this talk about? Uh again, two parts. Talk about why things are broken, not just how but why and what semantic reasoning can do uh to help. Uh but before we dive in, I promised myself, okay, I will do I will do just one anecdote and uh I I hope you guys will like it. So um you guys use Lyft, right? They're right there. Awesome guys. Uh I'll I'll I'll take them over Uber any day. And my experiences with Lyft are always phenomenal. I always run into these tremendously fun characters, uh talk to them, they share their life stories. Uh so a few weeks ago I was coming back from vacation and uh the driver uh he just very quickly

kind of started talking about himself and he's like, "Yeah, you know, I'm generally like a nice guy. Just got engaged but my fiance doesn't know that I have a gambling problem." I'm like, "All right, that's uh sounds like something you should talk about." Uh and he's like, "I'm like but uh well, you know, sometimes I'll I'll just buy things sporadically. Uh I'll sit on late night on QVC and I'll just I'll buy things. Like I bought I bought I bought two Star Wars uh light sabers for her for birthday for my fiance." I'm like, "She's a Star Wars fan?" He's like, "No, they're more for me." Smart, right? I'm like, "Wow." Uh and uh he's like and I'm like, "All

right, what's the most recent thing you bought?" And he says, "I bought this litter box that's fully automated and basically it's really cool and it just sits there and it combs through your sand and it picks up the the the bits of uh cat poop. Um and he's like and it it totally works too. Like I tested it last night. I I I I melted a little fun size Snicker bar and I and I put it in the sand and it just totally cleaned it up. I'm like, "Cool. Cool. Wait, do you even Why why do you put a chocolate Do you even have a cat?" He's like, "No, I just thought it was really

awesome." So I think part of what my conversation uh I want to have about is that yes, it's a talk about technology, it's a talk about technical capability, but it's very focused, right? And we have to be very judicious with our time and the tech that we're using that we are not like that guy who has an awesome litter box doesn't even have a cat. Um you know, we actually want to to have something in our environment that is useful and we can do something about. So let's move forward. Uh a day in life of security analyst. You guys have probably hear this from during the vendor sales pitches. Uh you know, things are horrible, things are falling

apart, but I always ask why. Why are they horrible? Why is Why is the life of the security analyst is so difficult especially, you know, for tier one, tier two? Uh guys are doing just doing triage. Uh and in to me it kind of boils down to, you know, we we have this data, right? And then we have uh insights and we try to get from data to insights and uh so how do we do that? And that's the big question. What is the What does it take for us to look at things that are coming in, all of the alerts, all of the detections, all of the uh you know, third-party notifications, uh threat intel, and so that's

that's where the pain is, right? So the pain is all of these things, all of the above. I don't have to go into details. You know them better than I do. Uh but essentially it's, you know, let's just face it. Like it's not working out, right? This whole security log-based analytics is wasting so much of our time uh and it's making our life miserable and it's making people leave. And they go to another place where they think things are going to be better, but they're not. Uh here's an example, a couple of vendors that I love to beat up on. I've obfuscated the names so you don't know who they are. Uh but uh uh so this is actually So this is a an

app inside of from one vendor inside of the other vendor and it's a threat intel feed which is you would probably pay money for actually to get to this point, uh right? It's going to cost you money. Um and first of all, look at that query, right? I mean, I know a little bit about SPL, I mean, the query language, but uh you know, like that would that would drive me nuts. Um and even after after you get this like these are essentially reliable indicators. This go This should go straight to the incident response, right? Well, congratulations because now you have to What happens?

Now you have to investigate every single one of those IP addresses and figure out which system it is, who does it belong to, what else do we know about it, when was the last time it got patched, who logged into it, who logged out from it, what are the binaries on it? So So that's just the beginning of the pain. Um and uh you know, looking back at how we got here, we started out with log management, you know, grabbing, pattern matching, very that pro you know, promises uh a single pane of glass. Uh instead we got, you know, a single glass of pain. Um and uh but but why? Like why are the SIMs so painful? And and part of it has

to do with technology. The the uh you know, identity models, asset models, things that we need to do our job. That's as far as the SIMs had been able to get. Like even in Splunk there is Oh, sorry. Didn't say that. Uh there is maybe, you know, five or six models in there, but no more than that, right? Uh and we have to kind of finagle our way through it and the more things we string together, the more they break because of the joint statements and and things just don't scale. Uh you know, in the end it all boils down to this one very simple fact. Uh and I this is something that I I am I

truly believe in, something that I'm willing to scream from the rooftops, is that the tools that were given to do our job, just like the guys in the factory, you know, beating the rocks with hammers, they're just not good enough to do the things that we need to do with them. It's as simple as that. So how do we fix that? Yay. All right. So you guys have probably seen kind of the traditional uh you know data information knowledge wisdom pyramid. Uh I like this one a little bit better. Uh I uh uh I I I stole it from a gentleman named Shawn Riley, so I'll I'll admit that. Um but uh I I I tweaked it a little bit

and and basically the idea is that most of our time is spent on the left side of this diagram, right? We dig in data and very rarely we kind of get into the information part and almost never we get into the the knowledge part. The knowledge part is something that just naturally kind of happens to you as an analyst the more you spend time in the data. Um and if you think about it and you don't have to answer that but think about for yourself. How does your organization move things from data to information and how do they move things from information to knowledge? And chances are it's it's very manual and very tactical and very occasional,

right? Which as you can imagine does not help things out. Um another couple of nice aspects of this diagram is that all of that is context and yes, there is overlap obviously and so yeah, some of the data has context but really the context is is between information and the knowledge. Uh and uh yeah, if you spend enough time in the trenches eventually you'll start getting experience uh which is kind of the uh super set of uh of knowledge. So, how does uh semantic technology help? Uh there's several uh very key things um that um you know, that enable us to do things that were not enabled that were the possible before. First of all, we can represent the

things that we see as a network of facts, right? Which is very important. The data is stored the same way that we humans think about it, which is super critical and very important. Now that we store this information uh and knowledge rather uh in a in a network of facts, we can also automatically make conclusions based on this information. So, we'll talk a little bit about uh how how we can do that uh in a few slides. Uh we can also fill in the gaps in the information provided by data, right? The data is never complete. There's always that missing data stream or that that that the data source or a parser or a mapping or a

threat intel or or you know, notification or alert or or some kind of a critical tie-in between the asset management database and the you know, the privileges in active directory and the configuration file uh on the server that just got deployed that just does not connect together. Uh and we have to be able to have technology that fills those blank in for us when they are obvious. When they're obvious to us as analysts when we look at them, right? We have to be able to have machines do the same thing, right? It's if it's obvious to me that IP addresses don't exist in the vacuum and there must be a system behind it, we should be able to

represent that. That it's not just an IP address that talks to another IP address. They're just addresses. It's actually systems that communicate over over the network identified by addresses. Uh but I try not to ramble. Um and the most importantly most importantly, by automating these low level this low level data analysis, we as humans can focus on problems of higher order, which is very important. That means our skills that are better suited for things like remembering things, following our intuition, hunches, kind of you know, getting a hint of something that's possible. Machines suck at that, right? But machines suck at take this thing here, put it together, put it over here. Let them let's let let let's have

them do that. So, that's the that's the premise at least, right? Oh, and by the way, semantic means language-based just in case you guys don't know, which means it looks and functions like language, which is guess what? It's what we use. Uh smart. All right. So, there's going to be there's going to be some components to this and you will find diagrams uh similar to this that are 10 times more complicated. I purposefully only left these blocks on on on screen and uh there's essentially these are kind of like the components logical compo- components that make this whole magic uh happen. Uh so, on the bottom you have a triple store, which by the way does

not have to be a graph database. I mean, you can even model this in in a spreadsheet. Um right? Because in the end it's just tables. Then you have RDF which is a basic data model expresses everything in triples, uh meaning uh we'll we're going to talk a little bit about that. So, everything in RDF world is triples. Um everything. Um right? And uh that makes things easy but also makes things certain things hard. We'll touch on that. Uh then we have RDFS, which is kind of the extension of RDF and it and it gives you more complexity and more flexibility, which is very important in information security field because sometimes one thing could be

two things and we need to be able to recognize it as both. Uh for example, inheritance. Uh you know, we can have a host that could also be compromised. So, it's also a compromised host but it's still a host. You see? So, it needs to be both for us. We'll talk about that. And then OWL OWL is the mechanism that provides the uh ontology, which is kind of like a data dictionary. All right. And yeah, and then you have a graph query language that just kind of goes in between and uh essentially OWL is what teaches uh the uh the store how to store information. So, then you can you can kind of query it.

All right. All right, moving along. Uh so, this is from uh I think this is from a besides Las Vegas maybe a couple years ago. Somebody made these labels that said this is not a camera and and I think he he made like a thousand of them. He was just like giving them away. So, like literally in that uh uh in Tuscany hotel had a label on it said this is not a camera. That was hilarious. Uh but the reason I bring it up is that you know, we spend our most of the time looking at what we call labels, right? In in the semantic technology but actually fields. They're field values. And the through

the field values and through pivoting through these fields is how we do our investigations. It's how we figure out what is what and how does it act and and where does it go. Uh so, you know, I'm obviously you know, I I I I I picked a you know, a dumb example, user Bob. What is Bob? Is Bob Is Bob an account? Is Is Bob a login name? Is it an attempted credential? Is it a person? Is it Microsoft Bob? I hope not. Uh so, the point is you know, a lot of a lot of these uh standards or um you know, mappings uh they're done for the analyst benefit but they're actually not really that

helpful. They don't they don't help me understand what does a user mean here in this particular case because by user I can understand a whole lot of different different things. And for machines it's even harder. So, uh we try to weed them out and try to be as specific as possible. User account is a much better I would have been much better. And uh having a domain to which that user account is tied to associated, maybe it's a host, maybe it's active directory would be even more spectacular. Um Yeah, so for example, uh root, right? Uh I always say, yeah, okay. But which which one, right? I have Yeah, I have 10,000 Unix servers in my organization.

Each one has a root account. Which root are you? That seems important, right? And that's part of of that complexity that is inherent in in in uh in our everyday activity. Um Now, basically I'm trying to I'm trying to deliver a very simple message here. Um you know, when thinking about things versus strings, uh we have to think of data in terms of objects, not not in terms of these little uh strings. And we do we do that but we constantly have to translate it from from the lines that we see on screen. Um And they're actually much less relevant, right? So, for example, I'll tell you that uh I have a friend and his name is

George. What does it tell you? Like literally nothing, right? You still don't know who the person is. But if I tell you that I have a friend I have a friend, I don't know, you know, I'm not going to tell you his name but you know, he does late night shopping on QVC and uh you know, he owns a automatic litter box but he does not own a cat, you're kind of beginning to understand what kind of person my friend is, right? Um and that's the idea. The idea is that we understand our world by the as objects and by the associations those objects have with other objects. And that's why the semantic technology fits into this model so perfectly

because that's exactly what it does. So, let's talk a little bit about how that works. Uh So, semantic semantic concepts. Uh let's say now I use this kind of loosely, right? I can use my domain account to log into my laptop. That can be represented as two objects. One is my account uh for Acme Corporation. I have my laptop as an object and an account as a credential exists on this host. So, that's a triple. Subject, object, predicate. Both of those things are called vertices or vertex if it's just one and the edge is the uh relationship between them. Objects and relationships, very simple, right? So, now let's take a look at how that looks like uh for the event data.

Um we have uh basically the same thing. And they both have a relationship to an event. Event is also an object. Uh, so this object uh, this event has actually been initiated by Anton. So, I am logged in into my laptop uh, and so from this host, I am trying to log in somewhere. I am typing in Anton. And I am trying to log in to production server, whatever that is, right? Now, notice that I have been able to infer a fact that since I am logging in from my account on the host, then I can safely say this account exists on the host, otherwise I wouldn't be able to log in from it. But, because it's an attempted

authentication event, it does not mean that Anton exists on the prod server. So, in this case, it's not a user account, it's an attempted credential or authentication token. See how that works? It's very cool. Uh, and the prod server has an IP address. Now, that's basically So, when we look at when we look at strings, right? When we look at the logs, this is what's happening in our minds every single day, but we don't even think about it. It's just happening automatically for us. Uh, now Semantic Graph can extract that and can map that, and it can put it as a part of a larger network of facts. So, now you have this gigantic knowledge

base. So, you can track, you know, what other accounts exist on my laptop, right? Who else tried to log in into prod server? What other IP addresses have been assigned to prod server? You can pivot on that on those facts very quickly and very easily. And you don't have to know the query language, you don't have to care about where the data came from, what format it was originally, you don't have to restructure your mappings. Uh, that's the whole point. So, now let's a little bit talk about ontologies. And this this is uh, definitely a favorite topic of mine. Ontologies is basically what separates Semantic Graph from just like general graph data source. Um, it's a it's a

formal way to describe the way the knowledge should be structured. It's a basically like a schema for your schema, if you will. Uh, a data dictionary. So uh very powerful stuff. And uh, you know, using ontologies gives you uh, a very certain benefits, which I would like to actually spend a little bit of time on. So, first of all, it federates data in a common language. So, no longer we care which firewall which format this firewall vendor X is using or this uh, vulnerability scanner uh, or whatever. In the end, there is a model that says vulnerability is a vulnerability, a host is a host, and when the vulnerability exists on a host, I don't care where the data came from.

The fact is vulnerability exists on the host. And it's tied to piece of software. And by the way, here is the other things, too. So, that is the main advantage of using ontologies. You can search your data across domains, vertical domains, without having to learn the query and the data structures of all the those things that have come in. Uh, second of all, it facilitates reasoning. So, if we can look at two things and say, you know what? You give me two facts and you tell me that they're true, I will always give you the third fact. That's reasoning. Um, so we'll see how that works in a little in a while, but it also means that a lot of this

processing, the low-level processing, can be done automatically without having to go hunt for things that are actually obvious to you, but they're not obvious to the to the to the data that's coming in. Analytic pivoting. Uh, this is where you are actually able to dive into the data and look at all the facts that have been collected without necessarily knowing where they're going to lead you. So, you can you can ask questions like, I wonder what interesting has happened to this host in the last 24 hours. And you can get this sort of a network of facts where you did not know that you you know, you might see uh, you might see attempted uh, failed logins. You might

see account resets. You you might see service interruptions, right? You might see reboots, which you did not necessarily even know how to ask, but because they're connected to that object, they're right there in your neighborhood of that object to be able to be discovered. And that's a very powerful uh, opportunistic sort of uh, way of doing analysis. So, you can test things out very quickly and prove them if they're true or not. Um, you can also chain attack evidence, right? So, if we look at this beautiful thing, uh, and see how many stages it goes through, that's just one tool, right? And there's millions of that stuff out there. Being able to string these things

these things together in such a way that they make sense and they can be represented as a part of one data set is extremely powerful. And then guess what? Then you can take parts of the chain and you can see if you have other places in your knowledge graph where those parts of the chain take place, and maybe they will lead you to some very interesting things you did not even know that you had. So, that is that is the uh, other part of it. Okay. So, to make these things work, uh, we're going to need some we're going to use some concepts. First of all, inheritance. Very cool stuff. This is something that we have

uh, used at my previous place, Gemini Data, uh, and it works it works great. Uh, so for example, you have a person and person has properties like every person in the world has a date of birth and uh, driver's license or maybe an ID of some sort, right? That's how we identify them. Maybe it's a place of birth. Uh, but then if you have an employee, then employee will have yet additional property, which is an employee ID. Right? So, each employee is still a person, but not every person is an employee. And you can query it either way. Very cool. Uh, reverse edges. Let's say we have a host that belongs to the domain and it's a one-way

relationship. We will probably also want to be able to ask domain, "Hey, which hosts belong to you?" Right? So, being able to automatically uh, read the outbound uh, relationships is also very important. And support for actions, uh, which is the uh, automated rules that can be built right into the ontology uh, and help filling the blanks. All right, here's a very quick example uh, of an ontology. You don't have to go through that too much. Basically, you can see how classes kind of bubble up to the up the hierarchy. There's a couple of things I want to I want to mention. So, generally the class hardware has vendor as an organization, for example, which means all of its children will

have vendor. So, it doesn't matter if you're buying a laptop or a printer, it will have a vendor, right? But, only some of them, like for example, stationary machines or stationary computer, they will have a deployed at uh, uh, relationships to location, and the hardware wouldn't necessarily. So, you can manage your knowledge in the following way. Um, here's an example of an owl. We're kind of running close. What you see here is the object of a class that describes user account, then one of its properties, which is the name of that of the that is associated with a person's name that's associated with the class, and then the third one is an edge, meaning that this object is allowed to

be a member of a group, and you see in the bottom it says range group, which means it's only allowed to have relationships with the objects of type group. All right, very quick. Uh, here's a slide I'm not going to go through this. Uh, here are some references to existing ontologies. So, some people have done some work and published them in an open source possible way. Uh, so you can go dig in. Uh, very useful. So, if you're interested in this field, that's a that's a good place to get started, especially this uh, Icas DARPA-funded uh, initiative. Very cool stuff. All right. Getting close. Semantic reasoning. Very quickly, how does it work? There is two ways how

reasoning works. Uh, one is that we can create an object, which could be the vertex or an edge, that must exist always, no matter what, usually by definition. So, if a if if a person named Ira has an uncle Olaf, right? We can say that always we can say that one of his parents, and we don't even care which one, has a brother. Because that's what that's that's what uncles are, by definition. You see that's a definition drives relationships. Um, another way is to promote a vertex or an edge into a more specific subclass based on information that you receive. So, again, here's Ira and his uh, beloved uncle Olaf. I don't know, maybe he hates him.

Uh, but let's say we find out that Olaf's mother had two brothers, two boys, right? And not a boy and a girl, which means the second sibling is male, which means that Ira has a father now. So, he had used to have an edge has parents. We don't know anything about that object, but we know that they exist, but now based on additional information, we can make that relationship more specific. How does it work in infosec? Very quickly. Say we have a host. Let's say there is a software release. Let's say that software is installed on that host, so the instance of the software installed on that host is its own object. It's somewhere in that path.

And let's say that the software release over here has a vulnerability. So, if we collect these three edges, we can say if these three edges are always present between these systems between these objects, we can always create a fourth edge, which is that host is vulnerable to that software. So, that's an example of a hard logic that I mean, you can quantify it, too, but for the sake of argument, let's say it will always be true. And we have figured out that that host is vulnerable not by running a vulnerability scan, which we can, but by simply using the three facts that were given to us. So, that's pretty cool. So, where does it lead us?

There is several things that that enables that weren't possible before. Contextual analytics, we talked about about that a little bit more. So, instead of looking at the raw data and statistics and top 10s and bottom 10s and all of that, we can actually look at the logical facts and see which facts make sense and which facts do not. Clustering is being able to identify objects simply by the relationships that they have. You know, I have 20 web servers and this system, I didn't know there was a web server, but everything that it does and everything that it says is exactly like a web server, so maybe that's a web server. Outliers, hey, this sales guy has a really strange

and unusual permissions for a sales guy cuz everybody in his organization does this, yet, you know, he's in all the admin groups. What's up with that? And the tricky one, similar subgraphs. If this is an attack, find me another subgraph in my knowledge graph that is very similar to that because that means it's also probably going to be an attack. I think that's about it. So, uh parting words, remember, semantic web technologies that have been around for a while, they've been since the beginning of the web, XML is built on it, uh Google's uh PageRank was built on it. It's not just for Google anymore. Let's get in there. Let's use it. Uh even when you're looking at fields,

you're always dealing with things, so remember that when you're doing your data modeling, keep that into consideration. Uh pay attention to your data modeling. It's very important. It's going to save you a lot of heartaches. It's It's like taking the trash out. Nobody wants to do it, but you got to do it. Uh manage knowledge. Think about how you manage knowledge in your organization. How do you retain all of these lessons and all of these learnings that come after an incident? Where do they go? Who takes care of them? Can they be read later? Can Can that part of it be automated? So, think about that. And stay in touch and thank you very much,

guys.

Building Intelligent Automatons with Semantic Reasoning

Related talks