
the besides DC 2017 videos are brought to you by threat quotient introducing the industry's first threat intelligence platform designed to enable threat operations and management and data tribe a new kind of startup studio Co building the next generation of commercial cyber security analytics and big data product companies good afternoon thanks for being here my name is Tim Mathur I work with a start-up in Silicon Valley called pattern X that does artificial intelligence and applying that to machine learning I'm a security guy a longtime security guy unfortunately my partner who is the data scientist true data scientist a PhD from MIT got called away he was not able to fly here from Madrid because of an
urgent good for us but bad for you an urgent customer requirement that's due on Monday so that kind of bumped into this so anyway we'll go ahead and get started encourage your questions I don't have a clicker so I'm stuck here behind the podium and this isn't working I'm frozen [Music]
okay that'll work okay thanks okay so okay I'm stuck with the system we should just hold this what's in it okay okay so problems today what is the problem that we're actually trying to solve and that is the fact that as I think all of you are aware I'm making an assumption that all of your security folks in here is that we're trying to deal with actually detecting threats we're not doing a very good job of that today in our models and so what we're trying to do again is apply AI so we can do this on a much more flexible basis I'm sure many of you if not all of you have some sort of sim deployed or using
an MS SP to do that on your behalf we'd like to be able to intercept this as far left on the kill chain as possible unfortunately that's not working out very well the fact of the matter is is that we have way too many false positives we have way too many false negatives all you have to do is read the paper every week and see how many companies are getting breached off of that and all of this is taking a very high demand and resources to actually investigate these incidents so is there a better way to do this what we are doing now with our models is we want to be able to detect at multiple stages in
the kill chain and so we're able to do this at the moment on a couple of different stages of this using artificial intelligence and modeling to do that so we're able to do it for delivery for command and control and for exfiltration off of that we use something called entities and entities can be for example source IP address destination IP address that can be the domain the user it can be the application etc and so we're using those entities and I'll explain that in just a minute with various features in those entities and then we build pipelines into that to be able to do the modeling and I'm going to go into this in some detail here
just a moment very importantly we're using analysts we're using people the loop to speed up the learning curve off of this and then what we want to do of course is it's not nearly enough to tell me that a system has detected a problem how do you actually do something about remediating that problem so we want to be able to at least facilitate the response we're not doing the response ourselves but with whatever product you have whether that's phantom whoever that might be a Yahoo etc we want to at least be able to facilitate that and I think probably most of you are aware of the fact that there is now a standard for that called open c2 which is being
managed by Oasis to do that which is specifically designed for detection products to be able to talk to response products so if you're looking at your log data and you're probably only looking at a fraction of what it is that you're even logging and what you're logging is probably a fraction of what you're actually generating off of that you're probably looking at some combination of billions or even trillions of combinations per day okay this is something that obviously no human no group of humans can actually go through so very very few of those are bad and you want to be able to quickly find those that are bad and the problem is that we have today with that massive
amount of data is that you know probably 99% of the analyst time has actually spent chasing things that they don't really need to but then again they don't know that and unfortunately from third party research if you take the Verizon data breach report as an example I'm in the last report 82% of the time the attack data was actually there in the logs but no one even found it it wasn't until the breach was actually detected and usually that was by an external party not the company itself that was hacked then that they it forensic Lee can go back and say oh there's the data well at that point you know almost so what maybe that's good for forensics
purposes but it certainly didn't do anything to stop the breach to begin with so how do we do this again we're working with artificial intelligence models to actually get something that is much higher fidelity for the analyst there one of the types of modeling that we use is graph analysis which actually then not only shows first-order connections between the entities and are getting more into the entities in just a moment but actually allows us to find second and even third order connections between those entities which makes the investigation process far far faster that means there's a lot less work on behalf of the analyst to go through and do their investigations so this isn't really easy there's a lot of talk about
AI and even in the security space just about every vendor today claims to be doing some sort of AI our machine learning but you know as much noise as much hype as there is around AI it's a little different than other types of AI meaning AI that's been employed in the InfoSec space it's not exactly the same as AI and other spaces so there's a lot of work that's been done on AI for example in facial recognition Facebook in particular has the largest set of individual photos in the world off of that and that's great because Facebook has this huge training set to work with there's more than 20 large data sets that are available for facial
recognition unfortunately in the InfoSec space outside of malware when I'm talking about exploits there are very very few if if really almost none none that are current large-scale data sets containing dirty traffic to be able to model with so this is a very difficult problem as far as availability of the data sets as far as University allottee is concerned it's very very difficult again in information security to have data sets because you have probably hundreds of different products that are out there if you take the versions of those products if you take the versions of logs of those products we're probably talking several thousand different formats that have to be worked with and tried to be modeled with if you take
faces I can model that very easily because there's probably four different formats I have to work with I've got PNG I've got TIFF I've got gif I've got JPEG and so I can normalize those training sets very rapidly to work with we don't have that luxury in information security when it comes to labels meaning I actually have determined what this is I've determined that it's malicious or benign and if it's malicious I actually know what it is okay there are large data sets that are available for malware but when it comes to exploits those simply don't exist so it's very hard for us to get labels to work with we have to generate them basically ourselves and working with our
customers on the photo side I mean I don't even need a human to do this I can go have Amazon's Mechanical Turk go through and label this for me so that's easy and as far as static versus dynamic is concerned you know the attacks as you all are well aware change on basically a daily basis they are very very dynamic ok what does a cat's face look like doesn't change very often ok so once I know what a cat looks like or once I know what a person looks like ok it's pretty darn easy for me to then zero in and then actually recognize that specific object ok we don't have that luxury currently right now in the
InfoSec space so here's an example of being able to find domains that are used for fishing delivery okay so here's a simple one and that's fine but now what happens if the bad guys change it you go to virustotal you say yep yep that's fine that's malicious so the bad guys now go and change it and it now looks like this thank you thank you so I here's my original and I just changed it a little bit and I change it a little bit and I change it a little bit okay now if I'm working with black lists for example this is a game of whack-a-mole okay you're never gonna win but this is a
perfect example of where machine learning through models can pick this up because these are very subtle changes but those changes are enough to throw off a black list and you have to go through and then update that or your vendor has to go through and update that and you have to apply that on a daily basis but from Amelie probably from a machine learning perspective this is trivial to do this is quite easy so what does this training process look like well first of all again we have to acquire the data to begin with we need a large set of dirty data to work with and we need that on a constantly refresh basis because again the attacks changed
on a regular basis so what I do I'm the security guy and when I work with the data scientist guys and so I go through and point out well here's a type of attack and then we say okay based on that type of attack what are the sources where we would look for log data that might be able to reveal that type of attack so we have to go through and do that and look at those different sources that may be a firewall that may be edr on the endpoint whatever it might be so identify the type of product that is going to be a source for that information we then have to go through
and document what are the specific data elements in that source of information that we want to look at and then have to go through and look at what the different thresholds for that is it high is it low is it first is it last what is it that that might be and we then go through and say okay we're looking for that and by the way we're collecting all these different features out of that log data and I got the more of that in just a moment and then from those features and more specifically the feature vectors then then we can use different types of AI models to go through and look for that type of attack
okay then the important part here is we're not looking for a specific attack unlike for example we're not looking for specific strain of malware or something we're looking for a type of attack based on the model the behavioral model that is associated with that attack so what am I talking about here okay so we work with entities and you can think of an entity as these examples here there's other ones but let's start with some very basic ones you have a source IP address you have a destination IP address you have a domain that is associated with that destination IP address you have a user that is associated with a source IP address you have an application that that specific
user is working with and then there's other ones that go along with it so these are entities then that we are going to use to model in our pipeline and we then go through and we say okay out of those sources that we're looking at for that different type of entity we're looking for these various features that go along with that okay and these entities are used with these features when we build them into pipelines as we call them this is an AI term if you will and so for this one for the source IP as an example there's 32 different features that we are capturing out of that pipeline okay so when we're talking
about source IP it's not just a number we're looking at all these different attributes or features that go along with that now some of those I care about as a security guy when did it actually happen when was the first time it happened when was the last time it happened what was the largest size what was the smallest size and there's a lot of these features that really don't mean anything to me as a security guy but they're extremely important to the data scientists because that's how they build their models okay what was the average size of this what was the mean of that what was the standard of that so about half of those 32 again probably
don't mean anything to me as a security guy but they mean everything to the data scientist so that they can go build the model then that looks for that type of attack okay as a second example here when it comes to a domain okay we don't just care about the domain name again there's all these different types of features that go along with that so we're capturing all kinds of data and in this particular case currently that's about 20 different features that we're looking at that are related to the domain probably about half I care about probably half the the data scientists care about so they can do their modeling okay so as we go through this process
for us we use something called active learning there is unsupervised learning there is supervised learning and then active learning is a combination of the two but we definitely want that human-in-the-loop for two reasons number one is that human allows the model to be trained much faster the human looks at the predictions that we make based on our modeling that we've done with the features and we turn that into what we call feature vectors and we throw up a probability on that and we say is that benign is that malicious what is that and the analyst can look at that and say yeah yeah that's benign all right no that's wrong that's actually malicious or we say we think this is malicious and
it may be for example an outlier and we think it's malicious simply because it's an outlier we haven't seen it very often and the analyst can look at that say no no no I know what that is you know on the last day of the month we do this transfer to this entity for example its payroll or something like that and that's why you haven't seen it okay and that's why you see it so infrequently so it's really not an outlier it really is benign instead of malicious but that feedback from the analyst then allows us to train the model much faster than if we were expecting the model to learn only on its own in an unsupervised mode
so we will go through them based on the feedback from the analyst then and we will adjust the model accordingly so this active learning process that we have then is key for us to being able to generate accurate labels for what is that behavior is that behavior benign is that behavior malicious if it's malicious what type of attack is it okay for those of you who might be familiar with mitre and their work they have an ontology called attack in which they've laid this out looking at part of the kill chain in which they say here's ten different tactics ten different types of attacks and they have techniques under that and they say here's the different
ways that that tactic can be accomplished and so we're using that ontology as a starting point for the model why well because it's a third party it's not just us as a vendor making this up it's a trusted source from mitre who's done work for the US government for years and it's pretty darn accurate so we feed that into that and we base our labels on the feedback provided by the analyst and accordingly then label that into a classification that has some standardization and the standardization becomes very important when it comes to the transfer learning which I'll get to in just a minute and by the way questions are welcome at any time please you don't have to wait
to get a beer to ask one ok yes please you're raising your beer okay good job one for me I guess I have to wait then okay so transfer learning itself is not new this has been around for years okay I used to be at Symantec all of the major anti-malware vendors have had so-called zoos for years dirty little secrets so to speak is they actually share their samples between zoos and have for years off of this if you look at snort and it's community rules there's another example of transfer learning again snort it's been around since like 1998 okay so nothing new there the ice acts were created in action they were launched in 1999 but the the
presidential directive was signed the year before in 1998 the whole idea how do you share that data etc now of course we have I don't know how many dozens and dozens of threat intelligence threat feed vendors that are doing that okay those are all examples of transfer learning that have existed within InfoSec for many years so what's new now when we're doing this with AI is we are not sharing just if you will static rules it's not that strain of malware it's not a snort rule for this type of attack on our PC or F PC or whatever the protocol is it's when it comes to threat intelligence it's not this specific domain it's not this specific IP address
no we are sharing behavioral models okay this is a if you will more of a generic approach to us it's a much more flexible approach we don't care what the specific source IP address is as an example we don't care what the specific department destination IP address is what we care about is what's the behavioral model that goes along with that and that's why we collect all the features from these different entities when we do that because that allows us to build the model them which actually then can be shared so it's the models that we're sharing not if you will static rules the other thing that we're doing which to date has not been done previously is
we're not sharing any private information so when information is shared between the entities we're not sharing what the source IP address is we're not sharing the the user name we're not sharing what the host name is we don't need to share any of that all we're sharing is what does that behavioral model look like okay and for the ice axe this has been of course a major problem for years is the reluctance to share that information certainly not publicly but even privately because much of that information that is shared is specific to that organization so there's been a lot of reluctance to share that information that's why it's a closed community off of that we don't have to
worry about that because we're not sharing anything that is private all we're sharing is the model itself which is generic so what this transfer learning then allows us to do is we're addressing the silo problem you can easily share information between organizations and there's no threat again of private information being released we're just sharing the label data and saying what that is so this enables the analyst then to collaboratively train off of that and say well actually I think that the model looks like this in my organization etc so the analyst can work on the models and not the data itself and again just importantly there's no private data that needs to be exchanged so that in itself
is a big win so we have three objectives here for doing this one of them is we want better detection rates we need to get that false negative rate way way down we want to be able to learn faster that's why we have the human in the loop that's why we're doing the active learning and then we want to be able to detect tax today again that are not being detected there's a tax flying by that no one's even seen so we want to move that slope up make it steeper and get to a higher constant level faster so we can detect that we want to vastly improve on where we are today now is it
perfect no it's not perfect I'll give you example that in just a moment but it is improvement on how we're doing things today
okay so we put all of this into a Sheeran label repository it looks like this this means something to the models that we have this may not mean something to the individuals but quite honestly at this point we're not concerned that it means something in the individuals it needs to make sense to the models instead so we take this information and we upload it to a label server and the label server is a collection point where we take all the different models then that have been seen and we aggregate them together now that doesn't mean that every organization is going to accept every single label that's up there it's just like snort rules you're not going to download and install
every single snort rule no you're going to filter that you're going to tailor that rule set - what's applicable to your organization what OS is you're using what OS is are you not using what protocols are you using what protocols are you not using etc well we're gonna do the same thing with the labels here okay so it's likely that a label from the retail environment may not be applicable to financial services or vice versa so we need a way to do that and we're going to do that through a training set and we're going to go through this process and be able to say is this label applicable to this organization again because that
organization is not going to download every single label there it's not applicable frankly just slows down the learning process so we need a way to do some a be testing if you will and say is that work for you does that not work for you we can label the label and say what industry sector came from well it's probably not let's start with labels that are already been vetted into your vertical and we take it from there as far as the training sack is concerned okay we can easily share this data can be shared today over six taxi that's no problem whatsoever we just extend the JSON format with entity and with features at the bottom and it's already
there so there's no change other than a slight change to the format of sticks to include the information that is specific to this new type of capability but just use sticks and be able to share that and many of you probably all / me already are using sticks and there's no change whatsoever to the underlying taxi protocol for as far as the actual transport is concerned so here's a quick experiment that we did you can read the numbers there we took three different organizations we looked at it through January of June of this year we ranked this this was for a fishing model that we tried we ranked it against them the alexa top 10,000 off of
that hopefully all of you know what alexa is the rankings for websites are concerned and this is the number of phishing attempts that we found and the number of phishing domains that we found and by the way those were found before they were accepted somewhere else this lost some formatting off of that sorry about that not sure what happened here this it gives you some analysis of the data itself in terms of what we saw okay hmm okay look find in Google slides but it didn't work here so you know we're using a random force just happens to be one of the four different types of models that we're using off of this there's numerous
different libraries that exists out there to help train these sets off of that are is quite popular we happen to be using skip off of that okay so then again from the active learning part we've got number of different types of models on the Left that we're using to do that depends upon what that is we take the top several thousand events we're able then to analyze those and come up with well really sort of what it is and then of course most importantly we're using the feedback from the analysts to say is that prediction correct or is that prediction wrong and then we track ourselves internally because it's important for us how good are our models what is our own models
false positive rate what is our own models false negative rate because we need that information to improve the the models ourselves and frankly the analysts may have missed some things themselves I mean they're only humans so we need that sort of data to in as well okay here's an example then of an organization that is doing this on their own okay and we're going through and doing the classifications in this particular case it's random for us and here's a second organization that we're doing we're not perfect as I said not sure what happened here with the fall-off in the last month with organization number two for that and that's them on their own if we
supplement that and say we're going to do this with active learning you can see then that the curves are steeper and you get to a higher point at the end so the transfer learning does work is it absolutely necessary no it's not absolutely necessary but why not get there faster and if you can use in other organizations labels their models why not do that so why not share that particular type of information now when it comes to in this particular case obviously this was looking at fishing domains what's interesting here is that if you look at what machine learning could find before blacklists 90 more than 90 percent was able to be found before the black lists again the black
lists your threat intelligence feeds I'm not bashing them I'm just saying that they're there static if you will in the fact that it's based on static IP addresses static domains meaning I have to know the exact IP address I have to know the exact domain those will change regularly but I have to know what exactly what that is then I update my blacklist accordingly if there's a Bay to do this on a generic basis better much faster why not well in this case 90% of the time the models eight can beat the black lists so this is pretty strong indicator that the process actually works and one of the ways that we do that is we look at these different
types of features off of this you've got the vowel ratio the digit ratio the number of fishing names other ratios the the TLD frequency the domain lengths etc etc the consonant ratio and all of these things can get factored into the model then to say go look for something that has characteristics of these different types and they fall within ranges that we see okay in English for example the consonant ratio is fairly constant the valoración owes fairly constant etc in French it's slightly different but it's fairly constant etc off of that there are certain keywords that show up repeatedly that are a pretty good clue that it's a fishing domain download free that sort of thing so factored that into
there as well some of this can also be done using engrams off of this this is fairly common of course not all the world speaks English so you're going to have to have various foreign language libraries that can account for this and then you have to account for the TLDs that have been internationalized now as well so what is in English of course doesn't necessarily have any bearing on what Cyrillic what's Arabic etc so all of that needs to be accounted for which is a lot of work but that's where we are so the results that we're seeing so far it's early on but this has been peer reviewed through academic conferences specifically at MIT and
through I Triple E but the results so far are pretty good it's not entirely positive as you see I mean there are mistakes that are being being made but the machine learning is learning and so that's good if you're interested in this there's a paper called ai2 which has been published again peer-reviewed it's out of I Triple E you can read about that these numbers are not just a marketing ploy but these have been validated academically we've got a lot of next steps to try and do again there's progress that's been made but we've got things that we need to do we need to develop new techniques that we have the fishing is very accurate that
we've seen today finding DJ's is very accurate we've seen to date unfortunately now you see less use of DJ's as attackers are moving more towards common websites if you will platforms to do their C&C through them specifically through Twitter as an example Microsoft's tech net has been used for that Facebook has been used for that anywhere where there's open pages that can be written to so the DGA is accurate but unfortunately like you know guess unfortunately for us that the DGA use has actually been declining the last couple of years and again of course we're working with logs doesn't need to be syslog but that's the most common variety unfortunately for us syslog is not syslog is not syslog every vendor
seems to have their own slight variance of that so that's difficult but it can be DNS logs it can be bro logs but you know whatever the log format is and then of course actually finding the most recent attacks is very difficult off of that we think that we're doing a better job than some other sources off of that but that's a constant challenge off of that and there's other things that we're working on as well it's all about labels is the traffic benign is the traffic malicious if the traffic is malicious what type of toxic what type of technique is it the more refined those labels are the more accurate they are and therefore the greater ability to
actually share those through with other organizations through transfer learning but this is the key this is the goal everything is about labels and again well labels exist fairly widely on the malware side of InfoSec they are very very thin on the exploit side and that's really where we're focused on is how do we find those labels on the exploit side how do we bring down those false negatives and that's all I have so I'm happy to address any questions there or you can go grab a beer yes please
yes with the entry on approach are you doing so word segmentation of the domains followed by a grams and then like EFI I'm not sure tf-idf is that it's one of the techniques that are applied there's as I mentioned on that one slide there between constant ratio Val ratio length etc there's a number of different techniques that are that are used to then say what is the probability that that domain is actually a fishing domain one of the things that we want to get to as well then is to be able to do a real-time look up against who is one of the the registrar's and say so when was the domain actually registered if the
domain was registered yesterday there's a pretty strong indicator which is why we use the alexa rankings there as well off of that so to try and see is this an established domain or not so constantly trying to improve that model yes please yep oh yes yes the whole idea is to is to correlate across the different entities so it's fine you take you know the entity and this is where the what we call the autocorrelation comes in and using the graph analysis to be able to find the second and the third order connections between those is across the different entities so we're not looking just at the source IP pipeline not looking just at the domain IP pipeline
not looking at just the destination etc etc on an individual basis no so we start within that pipeline and the different features within there and then we correlate that against the other pipelines and that's what gives us that that model that that behavioral model then as a result of correlations across the different entities correct yes please [Music]
so the question is any difficulties in training environments attackers living off the land versus sis admins say a little bit more please and referring to external versus insider attacks
PowerShell so let's talk about that as an example obviously PowerShell is being used fairly extensively these days for lateral movement the problem with putter coming with PowerShell is not necessarily what it's doing the problem is actually getting the logs that do that that can tell you what exactly is going on almost nobody actually has PowerShell logging turned on so again we're back to the problem that we have is actually finding large sources of data that that log data that may contain that information and PowerShell is high on the list that we'd love to get to almost no one's logging that data so it's really hard to work with it believe me for as a security guy
it's number one on my list my ability to deliver that to my data scientist colleagues is pretty darn low right now unfortunately so yes please no not for fishing for command-and-control correct so anywhere any social media platform where I can write to various pages there so in essence I'm telling my compromised host you know communicate with the C&C server so I put the effectively I use Twitter as an example our Facebook or TechNet anywhere where I can write to that I'm gonna call it social media platform Microsoft TechNet is not a social media platform but I can post comments there so any platform where I can do that it's very hard for me going through the logs
to discern that that's actually malicious traffic to Twitter versus legitimate traffic to Twitter and since those direct messages in the case of Twitter for example are HTTPS I do not get the full URL the only thing that I get out of the logs is that's Twitter so it's very hard for me to discern them if I'm using only for example firewall logs or proxy logs to determine that that is actually some sort of command and control type traffic because it's going to some obscure address you know some sort of obscure page on Twitter so I need something from the host side that tells me that so I have to marry ideally I want to be able to marry host based
data with network based data to be able to get an accurate read on what's going on because if I only have one or the other I'm really limited I can see something that I suspect but I don't if you will have any confirmation confirming data on that so it's very hard right now to pick that up I again I've got to have both sources just firewalls just proxies doesn't give me enough cuz it's encrypted again so I can't get it from the network side I've got to have the host side to get something to that if it's being logged and given the fact that the you are all traffic is probably so high in volume
how many hosts are probably logging that not many probably about as popular as logging PowerShell at the moment so it's difficult it's a challenge getting large-scale dirty traffic in a timely basis is a challenge it's a challenge for all the security vendors there are too large government data set out there but quite honestly there are several years old now at this point so any other questions yes please say more please well we're constantly trying to improve that constantly trying to improve that we again we started with miters attack ontology because it's standardized if you will it's open it's free your tax dollars already paid for it but we're constantly looking for new features that are out there now we're
somewhat constrained by what is it that the vendors actually log off of that so we can't just invent something if it's not on the logs well we have we can't invent it so we're constrained by that and constrained then again by I want constrain but it's it's time consuming to actually determine what the vendors what the data values actually are in the logs some vendors are good about if you will describing their syslog okay this is but this is this this is the value that means that other vendors quite honestly are just absolutely horrible about it I don't even think that I mean there's a couple of them out there I won't say who but I mean I don't even think they know
themselves they they haven't even documented it for themselves let alone for their customers so that's trying to decipher the logs is it's frankly it's a it's a challenge for us but if it's if it's not in the logs we can't use it as a feature and even if it is in the log sometimes is it's a worthwhile you know sometimes it's just with regards to their products performance well we don't care about their products performance so but it's a good question thank you anyone else please
give me your business card that would be a better question for my my colleague who's there I'm the security guy not the data scientist and he could give you a much better answer so anything else okay thank you for attending go grab a beer [Applause]