GT - Reduce, Reuse and Recycle ML Solutions for Security - Ram Shankar Siva Kumar

Name: GT - Reduce, Reuse and Recycle ML Solutions for Security - Ram Shankar Siva Kumar
Uploaded: 2019-10-19
Duration: 54 min 45 s
Description: GT - Reduce, Reuse and Recycle ML Solutions for Security - Ram Shankar Siva Kumar Ground Truth BSidesLV 2019 - Tuscany Hotel - Aug 07, 2019

BSides Las Vegas54:45185 viewsPublished 2019-10Watch on YouTube ↗

About this talk

GT - Reduce, Reuse and Recycle ML Solutions for Security - Ram Shankar Siva Kumar Ground Truth BSidesLV 2019 - Tuscany Hotel - Aug 07, 2019

Show transcript [en]

please let's welcome wrong thank you [Applause] howdy hi thank you all for coming after the lunch talks it's like it's super hard act to follow like you know hi I'm John Seymour insanity bed so I'm just like going to build and stanitch you know the shoulders is these giants and the way this talk is structures like I my goal is not to get through these slides you know I will kind of like you know hope you will have like a lot of opportunity to ask questions and the second thing is like even though like you know I'm the one presenting this work represents the corpus of like 32 applied ml engineers in our in our group so it's not just

like me who did all this work so want to make that clear so and Gabe is not here but thank you Gabe for like inviting again so the first thing like I want to talk about is how we think about the current state of security right and if you think of like the red team kill chain there's something we're all familiar with you know you've got you know attacker doing some some reconnaissance you know establishing some sort of persistence by a compromising a service account they move laterally and they exfoliate data we all know about this so what we did was in 2012 and we were thinking about rejiggering the way our team thinks about security data signs we spoke to a

lot of like analysts we spoke to a lot of like blue team members and we found out that they also have a discrete set of steps that they take to kind of attain their goal so you know and this is what we call the blue team kill chain so you know blue team members they kind of gather data sets you know they author like atomic detections you know there's some sort of alerts that being written down and then it comes of triaging right you know this is when you know they're all like looking at multiple panes of glass I'm trying to see if there's any which is the one that's sticking out like a sore

thumb and then once they find the alert they you know the gather contacts they run their play books and then they actually execute some sort of like remediation or response procedures and what we found is that the rent if the length of the blue team kill chain is the same as that other team kill chain it's awesome like you know you are playing the role of an arson investigator the house is already burned down and you are now gonna see like what caused the burn for the house which really you know doesn't add much value so we asked a question for you know for attack detection like can we pivot the conversation to attack disruption and what it really means is

that you're trying to play the role of a firefighter the length of the blue team kill chain is now reduced it's not the same size of the rent lend to the Red Team kill chain and can you disrupt the attacker like you know before she attains her goal and that's what is gonna be the theme for today's talk it's like how can we disrupt attacks how can we pivot the conversation from attack detection to attack disruption when why can we do that today right for us like the biggest roadblocks that we kept hearing again and again was hey there's too many false positives you know there's so many big and round and alerts we're not able to get to the goal that

we want to chain and when we thought about this in contacts of our led Team Blue team kill chain what really happens when people say false positives and annoying is that the red the blue team member is not able to go beyond alerting like they get all these alerts but they're not able to triage they're not able to gather contacts and therefore they're not able to execute on their million-dollar playbooks that they have authored so you know being like the the bashful like young folks that we all are you know we were like let's solve this using visualizations let's get them graphs let's give them some sort of like you know pew pew graphs to kind of like make it look like

awesome so this is like an alert from our automated accounts like you know this is when an automated service account kind of logs in interactively it should never happen and be just monitor like alert on this sort of anomaly and we did what everybody else would do we had like dashboards we had like power bi charts and we basically said hey if you click on this you know you should just like see this graph and you know just just go do your thing and the problem with is in in the context of a sure there are API calls that are being made like thousands of them in there in order of a second so where all of this becomes like you know

instead of like intelligent triaging it's like where do I find Waldo and the second problem is that our red team members have shown again and again that attacks do not stick out like a sore thumb you know if you're I'm not able to find like what is anomalous in thousands of API calls there's no way I can expect like my analysts to do it so the first thing that you know that we kind of had to come to terms with is the scale at which we operate in so you know we see like six hundred thirty billion authentications per month you know if I have a false positive rate a modest one of ten percent that is still in the

hundreds of billions of dollars of billions ranges so I need a kind of like think of methodologies that really shrink or operate at the scale at which the data that I'm seeing and also it is at a much higher like false positive rate so beats we sort of like thinking about like had a couple of mind shifts and this is the first the first like five minutes I'm just gonna walk you through the different mind shifts that our team had the first thing is like we want to focus on what we call like successful detection so you know the x-axis is like sophistication of an algorithm it could be basic or something like advanced you know the security

domain knowledge you know are you putting a lot of effort into it or you know is that only little effort into like getting domain knowledge and then you know the last the other side is a utility of those alerts you know given a given alert is it actually useful or not and you know if you were to do some sort of a basic outlier detection this is like you know a number of failed logins greater than like two times standard deviation it's an outlier you know you're it's a it's a fairly straightforward method and you know there's a really much security domain knowledge that goes into it you know and you should not expect you

know big utility in terms of like your rate of return or that's what we found so the immediate knee jerk reaction is like oh let's bring in time series analysis you know let's really increase the complexity of our algorithm and you know this is what I actually did right off the bat like you know we had a look at weird times that people check in code into their system and we just like built very like regular like vanilla time series hold filters system to look at like you know fail logins when they're checking in code but here's the deal like the problem is that just blindly increasing the complexity only produces anomalies so the way we think about that

is like the way my system failed is you know I you know it works really well in like in the North America's region but the concept of weekend in like the Middle East is extremely different they have like in Israel like the weekend is Friday Saturday so people are not actually not checking in code on actually checking code on Sunday because that's their Monday so you know thinking of like just blindly increasing the complexity really does not give you much values and what be what we think is a true goal is like successful detections like or with security interesting alerts and that's what you really want to strive for so the second line shift that

we had is like you know as data scientists feel uh you know we have this like insane curiosity towards feedback right you know hey I'm producing an alert let me go talk to my analyst and let my analysts be like my Mechanical Turk and you know like reap off like the feedback that she provides to me and it's kind of really sad because like that's really not an analyst job you know an analyst is there to kind of like secure the system and helping and just looking at them as like proxy for mechanical jerks or vendors to get like your feedback was not useful for us so what we did was we try to tap into like

the other assets that we have to get labels so for instance like bug bounties we had like like simple bug bounties like controlling within Microsoft and then soon outside brain like we said hey if you were to come in as a cashier you know you'd probably get like a monetary reward for that and like you know red teams are an other like great way for us to like get label data my favorite one is actually like getting labels from other products which I'll talk about in a case study going forward the the third thing is like understanding how the cloud is very different from on-premise setting so in on-premise like you know we all have this like really nice comfy

feeling off our network of a private network we exactly know the the crown jewels you know like Active Directory and anytime like you know if you're active record then you know you get keys to the kingdom you could be like domain admin but if you think in terms the cloud it's a little bit different I knows a little bit same as well so it's same in the sense that anything that you see on premise you can also you will see like a cloud analog so for instance like you might see you know you might have like on-premise sequel server to host your files in the cloud it's you know sequel Azure you know you might have like your

domain controller which would be like Azure Active Directory there are a lot of analogs like servers become services but it's also very inherently different the whole point about the cloud is that you know there is no single point of failure so you know like what are the crown jewels of Azure what are the crown jewels of your cloud so asking this question and like really talking with our red team and our blue team kind of helped us identify it okay so you know the else ass over here is kind of like the key vault in the cloud or like you know the store protecting storage accounts is important because like hey protecting sequel server is important so

we try to like make these analogs and this was like done by our red team to help us like when we took like the knowledge that we had on-premise and translated that to on the cloud so things like you know the attacker her goal on on-premise she might be like I want to be domain admin but on the cloud you know it really translate very nicely into like becoming subscription admin or like things like pass the hash on premise is like a credential pivot wearing like you know an attacker she might like Koon a subscription she might get like her certificates and kind of like you know pass those certificates and authenticate so we try to do these

like translations to help us guide like intelligent cloud security detection methodologies the fourth one is like we want to solve for classes of attacks we kind of have a 33 member like data science team and we still have an extremely healthy backlog of things to get done every semester and what this tells what this holds like informed us is like there's always gonna be more security scenarios to solve then the resources that you have at hand and this is an important insight because if you were gonna have a unique custom solution for every security scenario that you're gonna that you're you management asks you to solve it's gonna be hard and this is like traditional machine learning

where and like you know every task gets its own learning system what do you really want to pivot to is trying to learn from related tasks you know if you can let if you've solved like detecting geo login anomalies in Azure Active Directory logs try to use that methodology and we will see how transfers detecting unusual SSH logs logins so this kind of like helped us to keep our backlog and try to manage it in a healthy fashion the last one like you know really want to embrace empathy I know this sounds a little frou-frou but there's one thing that's really helped us to to build what I think is like solid ml detections is to think and talk

to our security analysts we put them front and center and again like we do not look at them as mechanical jerks like this is like a picture of what we call the grading fiesta we host every by V cleaver and every member of our team comes together and be great a person's detection and if we're not able to grade it and if you're not able to like on what the anomaly is there's no way in hell like our analysts or like our customers are gonna be able to do it the second thing that we started doing is we'd really had like partner call trying to be asked them like hey we're kind of private preview would you like to test

it so really getting that feedback from our customers really helped us to think like okay we're building solutions for them and we want to be able to help them so given these like five different things that I spoke about I'm gonna tell you kind of ground them how we protect assets in the cloud kind of like on the host on the identity and on the service level so the remaining time is going to be the case studies in each one of these areas so I'm going to start off at the host and you know talk about like detecting like malicious partial commands so you know all don't have to say this out you know PowerShell is like you say

malicious usage of PowerShell is on the rise I love the graph from Symantec this is a nice like hockey snake I think it was right after power sploit was kind of like released and today we're gonna specifically think about like PowerShell obfuscation we run through this really fast right this is just a very simple like PowerShell command to kind of download malware from this like malicious website but first off I can escape the commands I can then kind of like escape the characters butts in the command I can even escape the arguments I could put a pointer to the new object and all of these does the same thing and then I could basically escape everything

and this is like you know from classic Daniel Bohannon who came in schooled us and like blue hat so essentially if you were to use all of these things do the same thing which is like you know go to the same website and get malware from them and if you were to decode like PowerShell command lines write rules really don't work that well because like you have to write or regex for each one of them and classical machine learning does not really work that well for us because like every command line is unique there might be no discernible pattern so what we try to do was our our previous approach actually use engrams then use the cool kilograms talk

that Hiram just spoke in you know in the last session but we basically used like six grams and our true positive rate was around 67% and you know we want to ask ourselves can we do better and our hypothesis was like hey deep learning methods are kind of like really efficient at capturing like semantic variations is it possible for us to use this in this particular problem problem domain and the overview you know the next couple of slides what I'm going to show is like how we try to capture semantic relationships using embeddings and how to use those embeddings to kind of classify those like command lines so quick kind of like a preview of an

embedding so it's really popular in natural language processing I think there was a fantastic talk yesterday on like embeddings and malware I'm sure you know they probably do a much better job than the couple of slides I'm going to show you about but essentially when somebody says embeddings all you want to think is like taking words and converting them to vectors and if you just the classic traditional approach would be like one hot encoding but really that doesn't kind of give you much because you lose semantic information so why not encoding is just like you know if I see that word put a 1 everything else in my vocabulary is a 0 so we're extremely sparse matrix but

embeddings and the other words are dense vector so the meaning of a word is captured and say like 4 bits of information and it's kind of smeared across that four bits of information and once once you use like embeddings essentially what it helps you do is it helps you capture these semantic relationships between words this is like the classic like hello world of embeddings you know if you you know if you take the vector for the you know the vector representation of Queen and then you subtract from the vector representation of woman and then you add the vector representation of man you know you should be somewhere in the domain of like the vector representation

of a king and the reason why this works is because like you know the word the meaning of a word can be inferred by the company that it keeps and our and our goal is like can we infer the meaning of a command by the company of the other command lines or the other commands next to it that's kind of like a problem we want to solve and we use ver to back and essentially like what you essentially get is like you're able to distinguish things that don't match so for instance like you know all of those are kind of like vanilla like boolean variables but you know that one is all the ones are

literals but that was a boolean variable you would be able to find that all those are kind of like windows styles except bypass you would be able to figure that out using contextual embedding but the real power of contextual embedding is you can also learn linear relationships like the one that I showed you so you know export CSV - CSV + HTML gives convert or HTML so what does this give you other than the fact that you know we spent like you know probably a day if I'm generous you know coming up with this slides the thing is you if you were to see in your training set if you only see you know X if you only see export

CSV CSV and HTML you can get convert or HTML you do not have to see converter HTML in your training set for you to realize because you would be able to get that by by the previous things and that's where the discernible pattern that's why we circumvent the discernible pattern of like the previous approach is when attacker when she uses a command that we have not seen before but we have a healthy corpus of training data we are able to infer what's kind of like not seen and this is basically the data set that we have you know we get benign scripts from what via our own like environment has but we also take from

the PowerShell gallery and we tokenize them we get like 1.4 million distinct tokens and essentially we learn the embeddings from the unlabeled scripts and we have a very very tiny corpus of labelled scripts and that is essentially what goes into our convolutional neural net for classifying whether if we're training our model so I'll show you very fast I'm gonna pray to a lot of folks now especially John hoes right there it's on so this is like what an embedding looks like using tensor flow so essentially like what what we've taken like 10,000 command lines and we put projected them on three principal components and imagine that an attacker you know you've seen you know in bulk

web requests in your training set but an attacker she now uses IWR like you know same reference to invoke rep request so if I can search for IWR it's something that I've not seen before in my embeddings the I'm not sure if you can see it but the first one is like invoke web request so that's essentially like trying to circumvent again the problem of like hey I've not seen this before so I'm just gonna check my embeddings see the nearest one to my embedding because you know just like how a word is a word by the company that it keeps a command as a command by the company that it keeps IWR is most likely invoked back

with requests so I'm gonna project back to this and this is kind of like our results that we have first off like the model's been trained like multiple times per day and you know i've got and the most important thing is its classification is completed in the order of seconds as opposed to like you know again coming back to lengthening the blue team shortening the blue team kill chain and we got a really like a good hefty like false positive increase in true positive rates by keeping the same false positive rate which i think is a good way to show our management why you know why our team should exist because we've not made the system worse if anything

we're able to catch more attacks we also put a link to the paper on archive and I'll be distributing the slides anyway you know you should please feel free to like you know the codes also publish on github please feel free to play with it so the second thing case study I'm gonna still be in the realm of the host is trying to detect like compromised virtual machines so our previous approach kind of use rules and heuristics so the problem at the problem at hand is is a virtual machine compromised or not and our true positive rate was 55% marginally better than like tossing a coin and deciding if a VM is compromised so we obviously wanted to do

better and this is in the solution that we gravitated towards was kind of leveraging the spam information from office 365 alongside the NetFlow data from a char and essentially like our hypothesis is ethyl VM is sending spam it's most likely compromised and we know that the VM is running spam because like you know we get the spam labels from office 365 so you know we basically you know they maintain like like a corpus of information that says this IP is been sending out spam and if you see traffic from you know a particular virtual machine from that IP address we we kind of like use that as label data for spam and you know there's a lot of good

reasons why I like Network data is good for detection x' you just first of all don't have to do any sort of installations it comes free there's no overloaded on the customers you know and essentially it's always independent which is important in like a sure we're like there's a good healthy mix of like OSS that's actually running on top of it and some of the features that we extract from like IP fix by the way it's kind of like net flow you know extract like from the net flow data we look at you know the ports of traffic number of connections which TCP flags exist so those are kind of like our features that's kind of like essentially the

columns of our data set and like I said spam tags from you know building 34 and redmond so this is this is what the training data looks like you know we got the handful of spam labeled data and we got like benign IP fix data and then we run it through like our random forest model and then you know given a new case given like a previously unseen like NetFlow data if we were to run it through our ensembles you know we make a judgement as as possible or not so I'm gonna just like I love talking about ensembles because it's my favorite and this is essentially like what it would look like to imagine like all the positive

examples are like benign traffic and like negative examples are like malicious traffic and now you have to find like like you know represented in two dimensions you have to find like hyperplane that kinda like divides them into both but it's not really possible because there's no one single line that would cleanly divide both of them so it has you need something with a nonlinear decision surface and you know the first time is like you would you know learn with one learner you do some classification and you'll see like you know this learner kind of like classifies gets some of these positive ones it calls it malicious so you'd now identify those you update them and the

next iteration for anything that you have down waiter anything that you've got correct you downgrade and anything that you've gotten wrong you up wait because you want to think of like you know like a really bad like teacher she only looks at like the bad things that you do and she wants you to get that correct and then you upgrade those and then your second learner gets you know those correct but at the risk of getting the others wrong and then you do one more layer so you boost those things now and essentially you iterate the different rounds of boosting and the final result is a combination of like multiple learners and you get this like nice

decision surface so the reason why we kind of like use ensembles is that it's ridiculously fast it's actually was used and connect for like post estimation so you know again you want to like think about shortening the blue team kill chain so we're able to like classify in the order of seconds and our training is relatively fast you know we're able to get like we run it actually multiple times per day and it's completed within the order of minutes and we got a twenty six point improvement just by using like an out of the box like you know ensemble method and this is also a good lesson for us to kind of keep in mind if

there's an easy solution please like take it there's no reason for you to kind of like look at fancier methods in fact like you know we actually start off with logistic regression we could have had like a high school student come and solve this problem for us but but but the important thing to keep in mind is if you can kind of like optimize for a business metric which is like hey I want to be able to detect attacks faster you know and out of the ball most of the time OtterBox solutions are pretty good at scalability the third case study is gonna be identity focus so we saw two case studies that were focused on host

now I'm going to talk about a case study focused on identity so you know one of the things we like to think about is the ml journey and this is something I got a lot of help with my product marketing team but I did a really good job so on one hand you've got like you know folks who really don't have much like machine learning powerhouses right but you still have to you know yearn to kind of like I want to put machine learning probably their you know their board wants them to do machine learning but the unfortunate truth is like getting security data scientist is like finding a unicorn right because they come in and I fell

style salaries you know they want to be able to like you can't have like a big enormous team so our problem statement was empower like businesses we're not able to have like security data scientists to do machine learning specifically like you have a solution to detect anomalies like Azure Active Directory logins you know open that model up for anybody to detect anomalous logins for their problem domain you know it could be SH could be like Linux logins doesn't matter so that was a problem at hand to kind of like help those folks who don't have any ml investments we're not talking about like you know you guys were like have like advanced ml investments for people who

have no ml investments taking the things that we have built and opening it up to them so they can build on top of that so in the use case you know the challenge that was presented to us was like anomalous login this is the classic time travel like ROM lives and works in Redmond you know I've been at one he logs in from Redmond and minute two he logs in from Hungary you know it's not possible so we actually have solved this problem in Azure Active Directory and the thing was like can be take this and open this up so that other people can build on top of this for their own login space there was no previous approach but

our solution was like can be reused what we've already built without like building a more generic model so I'll first like explain what we have and then I'll tell you like what modifications be dead so this is like the Geo login anomaly detection this is in production and identity protection so it has basically like three straightforward steps first off you capture like 45-day windows of like your login data and essentially like you know user it's kind of like mapped to some sort of like ISPs and some sort of like you know use like a geolocation service look it up excuse me so but essentially if you look like user one it's kind of like in the

Bellevue I mean in the in the Washington area and user four seems to be in the Massachusetts area but we don't know that yet so the first once we captured this like login data we calculate like user user similarity metrics so using like custom like mappings so all this does is that it helps you to find like how users are similar again each others and this like you can imagine and it's done within the context of a tenant by the way it's not done globally but it's extremely sparse matrix in the sense that users who tend to kind of like have you know the same same set of like login patterns or kind of like have the end goal should be to

have like same similarity mappings and this is kind of interesting because like when you think of the geo log and Omni protection I trips a lot of like vanilla systems because you know people travel you know people kind of like use like VPNs or proxies so the goal was hey if we wanted to take this in the context of a peer so if my manager travels to Israel because like we have a team there then when I log into when I if there's a login for me from Israel there should be like a small like finite chance occurred to it and this is exactly what the second step does and that and the third step is like we we basically run random

walk with restarts on this similarity matrix and this kind of like gives us a reach ability score given like you know at a point in time what are the different locations a person's login history should be so this is great in terms of like at Microsoft scale but it's uh but to open it up you had a lot of problems first off it's extremely heavyweight in the sense that its compute intensive and it needs like I showed you like you know it means like train data and order of billions whereas like if you know not many organizations might have like hundreds of billions logins that they see every month and it also uses handcrafted features which

means it really doesn't transfer that well to like other domains and it's it's really hard for us it rebound it very challenging to kind of like add new data patterns are very specific to different log sources so what we ended up doing was we use recurrent neural nets it's like you know if anybody hasn't said the word deep learning yet you know I feel I should get the pity brownie it's its purpose but we try to do as thoughtfully though the reasons why we rent Ella you know the LS TM route is first of all Alice teams are great in terms like they build for like sequential data and if you're trying to detect like under login

anomalies nothing works better than this and it also deals well with scale invariance it's a fancy way of saying that you know if I don't when people log in they don't log in with like specific intervals it's actually that'd be really weird if I just log in every day at 8 a.m. and log off at 5 p.m. so so when you have this like invariant scale like Alice teams work really well and essentially what we did was we took our bulky like geo log an anomaly detection system we said your bulky we can't like transfer you so we essentially use that as a teacher to get labels from our SSH logins and Alice seemed to have really

good capacity to kind of mimic that's one like interpretation of it and we use that Alice we use the labels from our bulky glad system and we essentially extracted features like timestamp users and location that kind of like if you look at the the features I've used it's really not specific to SSH logs it could be used and network logs we could be used in other types of log sources so we try to look for like fields that are common across multiple log sources and use LS TM for it and this is essentially what the data set it looks like you know get two weeks of login data per user and we take those features and we essentially put them in

sequence and the task at hand is is a login ID say time T 2 is that like malicious or is that benign that's essentially what you want to do in the context of SSH logs so you take like those sequence of like law logins and you know this is the feature Iser that's when you extract like user location timestamp and you can see it's very actually pretty shallow like lsdm model so we got three layers and then you get an output for that particular vlog in whether if it's like malicious or benign so if my you know if my login at t2 you know isn't contingent on my login at t1 but also login at t3 so you know if if I log

in generally from Redmond but then I saw like a login to Malaysia and then I see login back from Redmond you know Malaysia kind of should stand out and that's what that's what this kind of the scoring phase does so how does this look like in production I feel like this is like something I wanted to again feel very passionate about essentially what we do is that we built this in we we had to do this in near-real-time so we use like a data breaks cluster and we have like a high throughput event hub which is like a Kafka like service so it's able to suck in billions of logins per per minute and essentially we get

the state of where the user logs in that's kind of stored in Azure blob so you know for anytime like I see like a user's login I'd be able to like query like a sure blob very fast and get in our production pipeline and then it's written back to an event hub that customer can kind of consumed in so how does this kind of look like first off I want to say that this is still in private preview which is a nice way of saying it's a still in like a little bit of a research face so I do not have metrics to share for this but that will change very fast so the cool thing about

Alice gems there's so many people have worked on optimizing for scale and for throughput that essentially training only takes in the order of seconds and and if you run it on streaming mode the mean time to detection is again in the order of seconds once again we wanted like really think about shortening the kill chain per se and also like if you think about for this particular for this specific logging in is might be like the first vector for an attacker so you know you want to be really fast at blocking the login or trying to determine but at the same time you don't want to hire people out you don't want to be the person like

between what the person is trying to resource and them so trying to keep this in the order of seconds was super important and something you'd find in like as your sentinel so you know given that I'm going to talk lots about like how we protect the service using the mind sets that we had so are a rallying cry soon became like you know we do not want analysts triage incidents we only want them to look at alerts right and now I have like three independent alerts you know I've got like one for anomalous dll that's been loading you know some new process it started and there's another like separate alert for some sort of large transfer from a sequel

server and it really doesn't tell a story and you're just adding to the burden of triage to an analyst right and what we really wanted to see was can is there a way for us to extract say a story with these alerts that happen hate in the context of a host you know there was an anomalous dll and there was a process and here's now like a sequel large transfer that happened on this particular sequel box so we're gonna show this we we tried taking like Titans like what Matt's wanted and try to see can we do this in the cloud security scale so for us like our approaches that were out there did not work because just

of the volume of data that we had and in our entire goal was can we do this not on a host level alone but on a service level and a host level and network level together combined so so the you know unlike the previous approaches where like the data set expertise was kind of like I could talk to folks from different bill and understand in asia central where we had to add product value there are multiple different data sources so obviously you know first off like you know like Microsoft products for a like you know I could go talk to people I don't then I don't know but you know what customers also use AWS that's the

fact of life and what that means and we have to like understand how this enormous activity in AWS logs manifest and if you just look at like network level instances you've got like Palo Alto Cisco Barracuda so and even means like an endpoint we not only just looked at Windows Defender which we love but also Symantec and other like CrowdStrike so our entire like cry was you know given that a customer has got so many different products is there a way for us to stitch multiple alerts from multiple products and say some sort of cohesive story that is like what this case study tries to cover so I'm first gonna give you like the 50,000 foot overview and

then I'm going to drill a little bit into the detail so the input to our system is essentially like raw events from all of these products so and that by itself is actually in the trillions of ranges because like each product has like a voluminous amount of just raw events now those raw events get converted to like anomalous behaviors through either from an out-of-the-box alerts that comes from you know say if you have like as an information protection there's like an enormous lure that comes from there or it could be something that we built on custom on top of that and that also now that's like in the billions of ranges and which is really not actionable so I'm gonna talk

in the next slide about you know you've got those alerts you've got some sort of like anonymous activities which is really not actionable and then we use this concept called like a probabilistic kill chain I'm going to talk about that in just a minute and that kind of like helps us drastically reduce the amount of security interesting cases that we wanted ticket to customers and we do one more round of scoring to kind of like only have a handful of cases I mean all these metrics by the way are for a month so the the goal of this project is not to you know is not the Cheesecake Factory you know where you've got like

multiple assortments this is more like an artisanal like handcrafted like I don't want to say hippie but I'm Seattle I could say that you know it's customized for a particular tenant and the expectation of this is like it's not gonna fire every day but when it fires its defqon.1 like you know you want to put your biggest resources on top of this alert so we really want to focus on like we're using the alert fatigue there I know it has a lot of different drawbacks for instance you know you might like alert lessor but we made a decision where and like if we're taking time away from an analyst to look at an alert we want it to be productive and we

wanted it to be useful to them so why does this kind of work right and this by the way is I'm going a deep dive on that on the third and the and the last two boxes so the first thing we do is we construct a graph and you know we basically take wearing the vertex is essentially an entity it could be like a user name it could be an IP address it could be a VM it could be a host doesn't matter so if and I just any sort of a connection between them so if rom is logging into VM you know you should you should see in it's sort of like connecting between ROM VM and another

node for IP address and we and the events for this is just like we get from Microsoft and like partner security products really we haven't done anything over here other than we have constructed a graph and this graph itself has like billions of nodes and edges right and what we essentially did is we wanted to prune this graph using like a probabilistic kill chain like well be what we understood by talking to analysts is that the blue team kill chain that I showed you at the beginning a static kill chain it assumes attacker you know after you know she compromises the box you know she's only gonna do lateral movement or she's gonna go to

the next step it does not account for the fact that you know once she compromises a box she might compromise other boxes so we wanted to bring an element of probability to it and essentially we were thinking what does it mean to have a probabilistic part to a kill chain and we found out that defense blue team folks especially not all kill chains are kind of like treated equally right so for instance like if I see in if I see an alert when the kill chain has progressed all the way to exultation oh you know that's probably like a bigger bang for the buck then I'm like you know looking at an alert for just

reconnaissance you can just have one step for that kill chain so complete kill chains are preferred over like incomplete kill chains that was like our first criteria right there and the second thing was it's like we want to be time bound right if I see like an event for the unusual process but if all the preceding anomalous like logins happened like two years back then it's really not that interesting but if I see like an anomalous process if I see like a anomalous process and right before that there was an alert for like a malware on the host that's way more interesting for an analyst so you know we put conditions on like time on when

pruned this graph and we also look for commonalities right even though this is constructed in the context of a particular tenant we wanted to see like if those kill chains manifests across different tenants of the same profile that's probably interesting probably there is a campaign that's happening on a particular type of industry which means that an analyst definitely needs to know about so we essentially use these as priors to the graph that we constructed to essentially regularize them so we're gonna prune away any sort of like uninteresting noise that's coming through the graph and that would essentially give you you know that's the that's right there so the we we made a decision to have one more round of like scoring

because we felt that just by just by including domain knowledge it's it's fantastic but we also wanted to include the labels that we had from our previous cases so we basically took features like you know is this attack over different tenants number of high impact activity across the graph and we were able to you do one more round of scoring on top of these like interesting sub graphs to reduce the noise further so this is how it looks today if you look at the first thing to notice is that the classification takes still in the order of hours it's not in the order of seconds like how the previous approaches were we're trying to like try to run

that debt down we get like billions of alerts per day but our training time is the you know for for that particular volume from especially multiple data products we are kind of gated on the throughput of those products but still able to keep that within the order of hours and I have like five minutes but I'll show you very quick demo right here so this is like what let me show you this this is what like a case would look like in Azure Sentinel so essentially like this is when there was an let me shoot beautiful details this is man there was an anomalous login followed by suspicious like oh 365 like mailbox creation rule so for what we found in a

bank was we first saw like the bank that was headquartered in the East Coast we saw a suspicious login from Germany which by itself is not that interesting people travel but subsequently within a short period of time we saw like that person's mailbox was forward to an external gmail address which is like super weird and essentially what this does is that we were able to like because this was happening like these two events from multiple different services we were able to get them and try to stitch them together and this is what like an analyst would see so it's kind of like the classic like taking those two yellow alerts which by themselves may not be interesting and

trying to get that you know that high red alert and one of the big advantage of like having that graphical representation is it shows when the user clicks and investigate button oh I'm not I think like I might not be able to so these are the two alerts this is like the user entity but you know you could get any sort of like related alerts for the user entity because like we're pivoting we've already have that in our graph back-end and - I can switch over to Timeline actually let me not do that first let me show you this but I can kind of like see for the user that's under investigation all the alerts that

are related for that particular user I can also run this is like we're able to display this in a relatively fast fashion because it's already part of our graph back-end and for a particular entity you can also do things like show me let me show you this I can get like their office activity I can see the services that they created the user account that's failed so for any sort of like investigator it kind of gives that experience from our back-end when they click that big blue investigate button so I'm going to kind of like wrap up in the interest of time so we can have questions so the most important thing that you want to take

away is you know you've been protecting the cloud you really want to think of like mine chefs at least that we found useful is think about the enormous volumes of data that you're having and also think about the differences in architecture from on-premise versus a cloud and you know really there's a lot of interesting debate out there about like can machine learning even help in cloud security and we obviously feel very strongly and positively about it and hopefully you know we've taken away how we drove down how we were able to increase the true positive rate and how we were able to scale in the billions of ranges so you know I put some resources

in here I'll send the slides out you know but if you have any questions please let me know you can always email me if you don't want to ask me or I'll be hanging out over here super happy to help or with all of this well thank you [Applause] - three minutes left - it could take about only four questions so you mentioned that you were doing that probabilistic kill chain sort of sub graph or graph pruning I'm curious if you looked at other methods like there's the the unified kill chain that leverages more like the attack matrix and stuff like that or any gaps that you found with the kill chain especially given that you want to wait the later

stages more than those earlier stages and other ways you would have dealt with that it's a great question I think like one of the deficiencies for the attack techniques is that you did not find them mapping directly to cloud-based attacks and service level attacks I know that's rapidly changing in fact like trends trend share for my team is working on that but essentially that was like our biggest deficit C's we would not find anything cloud based that we can directly reuse so you mentioned that if a simple bottle works you know use that it's you know you don't need to overcomplicate the problem that how do I've seen a lot of times where people who are getting started machine learning

like they can run a model and the model produces something but they may not know whether it was was simple or whether them simple model was good or whether it was too good and you know they've got some overfitting problem and so how does someone who's starting out know whether the simple model is actually good it's a good question well I think the way we have we we kind of like convince ourselves like simple model works just like in terms of code maintenance like legacy code integration like if you're able to do nothing like doing unit tests on your code to code coverage and all of that so you get all these like goodness by using simple models but how do you

know a simple model is good enough like we have like really good program managers and we get good business metrics from them and you know if our system is able to meet or exceed their business metrics checkbox the overfitting problem I feel like you know if you when you're really evaluating take into consideration the diversity of your tenants like if if all your training data was only on say small SMBs but then like you know you have to put this in production to say data from fortune 500 companies you know we've got like data distribution disparity I would say collect very different sort of like data sets and then test them that would be like my advice anyway we've got to

stop but I'm gonna be hanging out outside for a couple of minutes but I also want to turn the next talk so so yeah I'm super happy to help though thank you [Applause]

GT - Reduce, Reuse and Recycle ML Solutions for Security - Ram Shankar Siva Kumar

Related talks