Automating False Positive Whack-a-Mole with Real-Time Behavioral Analytics

Name: Automating False Positive Whack-a-Mole with Real-Time Behavioral Analytics
Uploaded: 2023-06-11
Duration: 25 min 4 s
Description: Security Operations Centers face alert fatigue as detectors cast wide nets to catch even rare attacks, generating thousands of alerts daily with over 50% being false positives. This talk presents a machine learning system that uses historical contextual features across heterogeneous detectors to aut

BSides Budabest · 202225:04122 viewsPublished 2023-06Watch on YouTube ↗

Speakers

Salma Taoufiq

Tags

CategoryTechnical

TopicDetection Engineering

ResearchCase Studies and Incidents Analysis Technical Deep-dives

StyleTalk

About this talk

Security Operations Centers face alert fatigue as detectors cast wide nets to catch even rare attacks, generating thousands of alerts daily with over 50% being false positives. This talk presents a machine learning system that uses historical contextual features across heterogeneous detectors to automatically filter false alarms while prioritizing critical alerts, reducing noisy alerts by 52% while retaining detection of 90% of true threats.

Show original YouTube description

This presentation was held at #BSidesBUD2022 IT security conference on 26th May 2022. Salma Taoufiq - Automating False Positive Whack-a-Mole with Real-Time Behavioral Analytics Security Operations Centers (SOCs) strive to provide an infallible line of defense to their customers against targeted attacks and malware lurking in the cyberspace. For that, several security sensors are used to flag any suspicious activity, ensuring that even needle-in-the-haystack attacks are caught. However, not every suspicious event is truly malicious. As such, these detectors can quickly generate an overwhelming influx of alerts for SOC analysts to inspect. Thousands of alerts to manually sift through every single day, out of which only a minute proportion constitutes true relevant alerts that require action. In an endeavor to help analysts strike a balance between thoroughly protecting their customers and controlling the alerts firehose, we present a lightweight system that automates this grueling process. It triages critical alerts while filtering out false alarms. Our approach leverages the historical context around alerts across the underlying heterogeneous detector technology and serviced organizations. Using these signals, the model automatically filters out more than 52% of the noisy false alerts daily compared to the existing manual workflow, while successfully prioritizing more than 90% of the true critical alerts and bringing them to analysts’ attention. https://bsidesbud.com All rights reserved. #BSidesBUD2022 #ITSecurity #Salma

Show transcript [en]

my name is thoma tofi i'm a senior data scientist uh with sofos and i'm here today to present to you one of the projects we've been working on within the soho ci group and that is automating false positive wacom mall with real-time behavioral analytics so before we jump into the presentation i would like to give a shout out to all my colleagues who have worked on the bigger effort that this project is part of involving data cleaning pre-processing labeling as well as research and anyways brainstorming ideas and such um so we have ben gellman denis vosnyuk dora sabo constantine berlin tamash varus who will actually be presenting another project this afternoon and uh last but not least

victor hollow so before we dig into any details here's an outline so you have an idea about what we're going to be talking about so first we will go over the problem statement so what is it that we're trying to solve so we will explain briefly what a security operations center is sock for short as well as the problem of alert fatigue which is exactly what we set out to solve then we will go over our proposed solution which comes in the form of a machine learning model that uses historical contextual features to solve this problem and finally we will go over our evaluation setup and the results we've obtained so yeah what is the problem that we're trying to solve

so first of all it's important to understand what a security operations center is that would be a stock so it's a unit that employs three main building blocks people processes and technology with the objective of defunding defending an organization from any cyber security threats and incidents that might be out there it's also worth mentioning that the saw can be tasked and concerned with defending the organization that it is part of or it can be offered as a service by one company to many other customer organizations that subscribe to this service so the soccer has to defend all of them um so just to give like a very generic um workflow of the sock first there are

the customer endpoints so those are their machines desktops laptops everything and the stock works to collect that data so every single event that happens on those endpoints all the processes that are running everything is logged and um usually at the level of the data collection step they also unify that log the formats of these logs for easier processing down the line after that every single event that was locked now has to go through a phase of monitoring via security sensors or detectors and that's a layer of detectors those are usually rule-based or heuristic-based and they have been engineered to leverage the domain expertise of the security analysts in capturing what might be malicious now every single event that comes in has

to go through these detectors and these detectors can be as simple as a rule that says if there's a process that is writing to multiple files within a very short time span then notify on that maybe there's a problem underway it can be so if maybe like there's a ransomware that's running or it can be just some network administrator who's trying to modify multiple uh files very quickly so these detectors really try to cast a wide net to capture even needle in the haystack attacks so after our locked events go through the security sensors we get alerts those are our events with some added information from the detectors and one of the most important things is the

severity score that's a value given by the detector to the event saying that this is how bad i think this might be it can be a value from one to five one to ten whatever really the stock picks now once we have these alerts there's usually a thresholding step and that is where um this stock uses some empirically found threshold for instance if they have severity scores ranging from one to five they may pick three and so anything that gets the score that's higher than three requires analysts attention so after this thresholding we just keep those severe scores so for instance higher than three uh severe alerts and those are the ones that now the security analyst those are like

humans it's their job to sit down and look at these to investigate um what happened before this event after this event is this really something bad or is it actually just legitimate activity now um it's important to think about the amount of data that the sock has to deal with so on any given day they get millions of events that they collect those come from all the endpoints of all their customers all around the world possibly and those might end up triggering tens of thousands of rules from the detectors and finally that leads to hundreds of to be inspected alerts with more than 50 of them being false alarms meaning that this is well documented by multiple studies and

surveys that were done regarding socks and it means that because those detectors are cast in very wide nets even with the severity thresholding we end up with a lot of severe alerts that are actually nothing it was just someone doing their normal functions on the endpoints and nothing bad was happening there so this is precisely the issue here so that's what we call alert fatigue it's that the number of alerts that are generated far exceeds security analyst capacity to deal with them and that makes their job absolutely exhausting and demotivating so in order to illustrate this here's um over a one-month period from march 2022 to april 2022 and every single bar here represents the alert uh count for that

day um we are not disclosing the exact values because we consider that sensitive data but the most important takeaway from this plot is as you can see for from each bar for every single day the dark orange part is very tiny compared to the light orange part and the light orange are those false alarms so all of these the height of these bars is the amount of work that the analysts have to put that's the amount of alert counts that they get but only a small proportion of those are actually um real threats that they should be looking into so analysts time and energy is actually wasted on this deluge of irrelevant noise and that leaves uh that

that runs the risk of actual threats going unnoticed and for us we think that determining which alerts and incidents can be ignored so reducing that light shaded area is as crucial as deciding which ones are actually important and worth investigating so our proposed solution comes in the form of an intelligent safeguard system a temporal model that uses machine learning to really distill the critical alerts and suppress the false alarms and this this model can be implemented on top of existing detection pipelines because it acts on alerts so regardless of what detectors asoc may have they can benefit from such a solution and this model is supposed to act as a shield so it acts it comes between the severe incoming

alerts and the analysts and is supposed to really act as a filtering funnel where we get all of that all of the incoming alerts with the noise that's there and the false alarms and everything and we really try to bubble up those interesting cases so the pipeline is very simple like i say we are acting on severe alerts so that's that's our starting point after that we extract some features build our feature vectors pass them through an extreme boost model which scores them and those scores can be later used either to eliminate false positives if we find a good threshold for that or to prioritize alerts by importance so that analysts know that what's on top what they see at the top

are actually the important alerts and what's at the bottom is actually quite low priority so the most important step here is the feature extraction part so we are computing features over different time windows so ranging from one second up to a week we also do we also compute them on the basis of a variety of predicates so we have features that are computed on a customer base or an endpoint base or a detector base as an example here are some like two of the many features we compute so we can capture the number of customers for which a given detector has fired over the previous hour or minute or whatever time window and as well

as the for example the number of alerts that were encountered on a given endpoint over the previous day so we are trying to design these features such that they can capture signals like the endpoint vulnerability uh the customary state size and detector firing behavior patterns so i don't know for instance there's like a new detector that was added to the appliances that are already there it can happen that this detector was too broad of a rule or just poorly tuned human error these things happen and so as soon as that detector is uh deployed it just starts firing across all of the customer estates across all the customer base all around the world so we are it's really

important for us to capture these things to be able to tell whether um after anchoring this alert in the global context is it actually important or not and in order to evaluate how how this system fares we have taken a data set of real enterprise alerts uh from sofos and we collected those in collaboration with security and data analysts to really inline the labeling and have a clean data set and the data set ranges over a six month period and we do a time split so we take the five first months of that data and that would be our training set so it ranges from october 2021 to march 2022 and then we have our test set from march 2022 uh to

april 2022 and these alerts come from over 3 000 customers uh and over 15 000 endpoints or hosts on the left-hand side we can look at the labor label distribution for the two sets so for the training and the test uh we can readily see that there the positives are a smaller proportion um than the negatives meaning this also points out to the problem that we were talking about that these severe alerts that the analysts have to look through most of them are actually uh nothing and in order for us to evaluate this model we look at uh two we try to look at two different aspects so we try to look at the classification performance

using metrics such as the receiver operating uh characteristics or rock auc area under curve precision recall as well as we try to simulate the deployment of this system to see how it would impact that workload of analysts so first looking at the rock curve so the raw curve plots the true positive rate versus the false positive rate and it does so at different decision thresholds so what that means is so our model so our temporal model outputs um prediction probabilities so values from zero to one that capture how what the how probable the model thinks a particular severe alert is worth an analyst time and you can pick different decision thresholds to say to to translate that

prediction probability value to a decision that yes this is important or no this is not important so with the raw curve we can plot at every single possible decision threshold value our true positive rate and false positive rate and the summary or the most important one of the most important measures within the raw curve is what we call the auc that's the area under curve literally the area between the curve and the x-axis and the closer to when we are the better we're doing now in the in purple in the diagonal line we have the current situation where there is no model that's deployed and every single severe alert analysts have to look at it

and that's a 50 auc as opposed to that with our temporal model we are able to reach a an 81 on this particular data set um auc so we're doing like much better and we this model is able to really um um to really distinguish between the positives and the negatives the other plot is also a rock curve but it's a precision recall curve so it plots precision versus recall precision is the positive predictive value and it basically tells us out of everything that the model thought was a positive how many of those were correct so it captures this measure and recall is essentially the same as true positive rate and here we can see that we are able to

reach an auc of 0.63 and at any even higher recall values we are able to reach a precision that is much higher than the current uh situation with its baseline precision being just 30 percent now in order to look at this in a more tangible way to illustrate what we are doing here's a similar plot to the one we've seen earlier earlier we've seen without the model what the daily life of analysts look like and that's the orange scenario here and in blue we have with our model implemented what the reality would look like so we can see the first striking thing is that on that one month of data the alert volume for every

single day is much lower with our model deployed so we can say undoubtedly that yes we are able to reduce the amount of work that analysts have to do but the striking thing is that we are able to do that while capturing most of the relevant alerts so whatever we do to reduce the alert volume comes at the expense of default alarms and that can be seen here by comparing the dark orange and the dark blue so those on most days meet and that means that those are the critical alerts and we are able to capture them as opposed to that for the lighter orange and lighter blue those are the false alarms so we are able with

our model to to reduce those significantly uh especially on days like the fourth from there from the left and the fourth from the right as well where we can see that those were days where there was catastrophic false positives so there are some extremely high peaks of false alarms upon analyzing such days we found that usually it's indeed a poorly tuned detector that just starts firing over many customers estates or one specific customer or maybe there was some network administrator who was doing some tasks on that particular day and just the detectors may have gone crazy on it and we can see that on those particular day where it's very crucial to reduce the alert volume our model is

really able to do that to have a slightly different perspective on the same points on the same data points we can look at specifically the percentage of false alarms that we're reducing that's the gray bars and on average every day we remove about 40 percent of the false alarms at the same time we can look at the blue bars and that would be the recall or the detection rate or the true positive rate and it's quite high on every day and we are able to tune uh the threshold that we pick for our model to be able to improve upon this recall now to take a look at what these the model scores mean when taken in contrast

to analysts notes we can we have ranked by the prediction probabilities of the model the alerts that we've uh that were on our test set and if we take the top we see those are the the alerts that the model is quite confident that those are important those are relevant and should be looked at and we can compare them to the analyst notes and here we see just from the sheer length of these notes that there has there was some work done by analysts they had to do an investigation and then analysis to really look at this data as opposed to that we can look at the lo the ones that got a very very low

probability uh from the model and those are actually the true negatives those are truly things that are just noise and if we look at here the analyst notes we can see that they just write oh this is benign activity or wait this is known fp we've seen this before it's a false positive we know it so they had to do some looking into this alert but then they realized okay there's nothing here so the model is able to find those um in the third um in the third alert here we can see that although they had to do they were not sure they they hadn't seen this before and they have to do some investigation it said um it says then

that based on this evidence we believe this to be a false positive so the model was able to also capture this one now in order to look at those critical cases the uh the critical alerts that we are missing because maybe we need to tune the threshold higher and such we can look at these and these are the false negatives so these are things that are critical but the model is not so confident in that they are critical and if we look at the notes here so yes analysts had to do some investigation but for most of the notes from our analysis of these they contacted the customer to double check that it wasn't some pen testing

that they were doing or anything but the customer never responded so we couldn't really say whether these were bad or good so we just so the model was also a bit doubtful for these ones but in order for us to close the gap on these false negatives we are working on tuning this model on monitoring it uh every single week and retraining it on new data but we are also working on a different model and that's actually one of the next steps that we are actually already started doing so we are looking to take this model that takes contextual data and fuse it or combine it with a model that actually looks as looks at the

alert themselves so then we have two orthogonal models using orthogonal signals and maybe we by combining them we are able to reach even to get even better results and actually so far our results our experimental results have shown that we are able to have a much higher rock auc with this ensemble sometimes by plus uh seven percent and like i said uh this is a an ongoing project so we are continuously monitoring the results uh week by week because we are also working on inline in the labeling and such so we really want to we really want to keep monitoring how the model is doing as the detector technology evolves and changes over time and finally we are working on a research

paper that we want to publish this year hopefully that would summarize all of our results with the stateful model that uses information from the alerts and this model that uses information from the context so yeah finally to conclude um the the current detector technology appliances are rule-based and heuristic based as we've said earlier so they are notoriously error-prone so the consequence of this is that it ends up being super time consuming exhausting and very expensive for security operators to manually investigate every single incoming alert so we propose a system that can be implemented agnostically and regardless of the existing detection workflows and uh can automatically reduce the amount of to be inspected alerts while ensuring

that the really critical ones are brought to analysts attention and are bubbled up and while doing so it ensures that a significant portion of the time wasting for false alarms are actually eliminated um so yeah that was that was that uh thank you very much and yeah if you have any questions or want to learn more about what we do at sophos ai please ping me or yeah check out our micro site so thank you very much [Applause]

many thanks

anybody with any question at this time the contact information is up on the screen we have a question there just a moment please

um i really enjoyed your presentation and i have a question that what's the trait or what's that point where you say that this model can be applied and used to reduce the false positives and grant that the possible false negatives yes the post negatives are biased and take that risk to

to reduce the fatigue for the employees um so that's a great question because that's something we're thinking about like um at what's like what are we risking by missing some of the actual critical alerts versus reducing false of the alert fatigue so what we're trying to think now because this is still in the development phase like i said we are trying to use another model that would bring this new signals completely that are not seen by this model and we've already found that it's able to really improve uh the the model's results we're also trying if i just go back quickly to this plot um so we can tune which decision threshold we use and we can just pick

one that is at a very very high recall to be able to say that um because the recall means what what percent what percent percentage of the critical alerts are we catching so if we for example pick the threshold at one so 100 percent of them then we are sure that we're not missing ones but at the same time if we pick that 100 percent recall then we're also letting in many of the false positives as well so like i said our approach is first to try to improve on these models results and also that maybe we can frame this as a prioritization system and not as a system that suppresses alerts so it would just maybe

rank them so they still have to look at all of them but it's easier because then they know that okay at the top these are some that we think confidently are bad but they still have to look at the other ones um yeah awesome thank you for sure

is there anything else from anybody this time

it would appear not selma so once again thank you thanks everyone [Applause] you

Automating False Positive Whack-a-Mole with Real-Time Behavioral Analytics

Related talks