GT - Making & Breaking Machine Learning Anomaly Detectors in Real Life - Clarence Chio

Name: GT - Making & Breaking Machine Learning Anomaly Detectors in Real Life - Clarence Chio
Uploaded: 2016-12-14
Duration: 57 min 44 s
Description: GT - Making & Breaking Machine Learning Anomaly Detectors in Real Life - Clarence Chio Ground Truth BSidesLV 2015 - Tuscany Hotel - August 04, 2015

BSides Las Vegas57:4416 viewsPublished 2016-12Watch on YouTube ↗

About this talk

GT - Making & Breaking Machine Learning Anomaly Detectors in Real Life - Clarence Chio Ground Truth BSidesLV 2015 - Tuscany Hotel - August 04, 2015

Show transcript [en]

hi I'm Clarence uh thanks for coming to this talk do we have any machine learning experts in the house cool cool yeah that's uh good to know so if I'm saying anything wrong feel free to stop me I graduated last year in machine learning so I'm still new in the industry but I've just done some research and this is an area that I really love talking about because it Sparks off lots of interesting conversation and that's why I'm here so machine learning you know I think there's a lot of Hypes surrounding this area uh people talk about it a lot but not many people really know what to do when it comes to implementing machine

learning pipelines and using machine learning in their specific industry uh we're in a security field so um of course there's talk about using machine learning in in in in the security industry and uh there's also a lot of companies that are started based on machine learning using deep learning using classification as a way to detect anomalies to detect malware and um there's a lot of interesting work in this area but there's also a lot of pitfalls that people fall into and that causes a lot of confusion and you know gives us a lot of false positives as to whether we're actually achieving success or not in this area so the goal of this talk is not to

is is is is not to to give an account of how you build a machine learning anomaly detector or machine learning pipeline for in in in a security area it's really to give an overview of how machine an machine learning anomaly detectors work and to spark some discussions on when where and how you should use these things and also how to exploit these systems because these systems are far from being are far from secure and machine learning is not something that's designed for security lastly we we're going to discuss where we go from here because there's definitely lots of possibilities that that we'll have going forward in machine learning and security it's not a dead end and there's lots of interesting

uh developments even today so anomaly detection and machine learning what's the difference um I would say Mach anomaly detection has has has a role to play in machine learning's uh timeline um there are heuristics based anomaly detection just for example you can use use a rule engine waffs for example um if you if if you if you define uh an an IP address um access pattern and you you look at the number of requests that this IP address made to you a to a particular web server in the last 10 minutes um and if it exceeds 10 then maybe that triggers a rule there's also machine learning anomaly detectors whereby these rules are more Dynamic and uh it potentially

can help catch people who are flying under the radar and making use of the static of of of of heuristics to bypass your your rules so predictive machine learning is really the intersection between machine learning and anomaly detection um as I mentioned earlier heuristics and Rule based anomaly detection has been a norm for the last 20 to 30 years but because of how easily uh adversaries can bypass them people have been looking to machine learning techniques to uh try to uh to try to implement defenses that are less easy to by and less easy to capture in a single statement intrusion detection um for the purposes of this talk won't be anything too specific it's really General

anything that uh can be described as an adversary trying to G gain access to your infrastructure to for example sending spam into your email server um committing fraud on any particular uh payment accounts that you own or uh sending a Dos attack into on your on your web server that's considered intrusion detection for the purposes of this talk again there there's lots of debate to to be held there and um if at any point it's unclear in the talk feel free to stop me and ask a question so as I mentioned earlier machine learning has had lots of interesting applications and successful ones in the last few decades when it's seen uh a huge gain in popularity Gmail

popularly used us uses it for spam detection um this particular example that I'm showing here is one that's actually pretty difficult for for a basian uh model to catch because it's using actually some form of s art um instead of just uh plain text so that's harder to catch but surprisingly Gmail is able to to to catch that as um as spam also for just any generic time series data that you see in the top right hand corner um you can perform forecasting algorithms on them to use uh very simple statistical methods to uh alert the the the admin when there's an anomaly in the series credit cards have famously been successful with machine learning as a way of detecting fraud and

when a credit card has been stolen and again I work for a company that tries to detect if the uh it tries to detect if the person on the other end of of the browser is a human or a script so that's also another popular area in which machine learning can be can be used how do we find anomalies in these let's say you have a web server and you have a bunch of logs um it's not entirely clear how you would mine this data and generate a bunch of metrics that you can perform classification on for example um feature engineering is really a The Bu of the work in machine learning as many people would tell you

um it's a the actual the actual per Performing of the machine learning is uh you know just maybe 5 to 10% of the work uh a lot of the work is cleaning the data that you get and feature engineering so um because of how hard feature engineering is methods that use that methods that uh allow you to do feature engineering automatically like PCA and uh and uh feature selection algorithms have been increasingly popular and we'll discuss these later so why are machine learning based techniques so attractive compared to heuristics for one they're adaptive as mentioned earlier adversaries can't just devise an attack strategy that flies under The Radars such that you can't detect them um they're more Dynamic uh

in a sense that it adapts the traffic especially when you when you have a a model that goes through online learning online learning in in this context means that you have a model that goes through a constant and ongoing learning so the model adapts to your current traffic why would you want to do that for example if you ran a website and the popularity changes of your website would would result in changes in your web traffic model then you want to account for these and not flag those things as normally how would you account for these you would account for these by training your model periodically such that you would be able to to uh get a signature of of

normaly and again whether the signature of normaly exists or not we'll discuss them later machine learning techniques also Al require minimal human intervention theoretically that's that's at least the the goal um but we'll see whether this actually holds or not um given the challenges that we face in this in in in this area especially in the security field so of course machine learning is not a silver bullet um many people who have read about machine learning online or or have heard a lot of people talking about it seem to think that it's this magical Black Box um that can tell you when something is is is going right or going wrong um I used to in my previous

project I I I I was working on a machine learning uh platform that would listen to sounds that your bicycle chain made and it would tell you predictively if your bicycle was about to break down and without even understanding uh the how the actual thing work we had investors coming up to us and asking us you know you need money for this and we just found that really scary because um obviously they weren't interested in understanding the problem and um there's a lot of hype surrounding this I've even had Founders coming up to me and say that you know we actually don't use machine learning that much in our product we rely mainly on rules but

saying that we use machine learning helps us get a lot of extra funding so definitely something to be careful so on the other side of the argument why are threshold and rule-based humanistics good for one they're easy to understand if something goes goes wrong if a rule is triggered you know exactly why the rule is triggered and it's easy to reproduce it's simple and understandable for the same reasons and it can also be dynamic and and adaptive and we'll discuss them later the Crux is that it makes things easy and it's easy it's easy to understand so successful machine learning applications let's let's look at some of these and see how they differ from what we're talking about today um

Amazon uses machine learning for recommendation engines um some other things that use recommendations engin pretty well are Netflix YouTube they recommend you things that they think you like by building a profile of you and matching you with other people that match your a similar profile and what other people have liked um Gmail spam we mentioned earlier just using a simple basan technique um of course Gmail isn't using a simple basan technique today they're using much more complex methods um from what I know um but pretty much the idea is there this is a pretty interesting application um this came out of Stanford a couple of years ago where it's basically doing deep learning with a recurrent neural

network on a sentence and it's able to tell you if this sentence is has a positive connotation or a negative connotation um it's surprisingly and scarily accurate and I urge you to go try it online uh just look it up deep learning by Richard soer so we want to set some expectations for what machine learning can do for us and what it can't um in the context of anomaly detection um the idea is that you know if someone were to go about for formulating the problem and the solution you want to find anomalies in data and you want to be able to form a model of normaly if it exists and then you want to be able to alert the admin if

anything deviates from this normal from this model of normaly there is a big machine learning and anomaly detection problem and I'm not able to pinpoint exactly why this problem exists but there's a lot of machine learning anomaly detection research in the last couple of decades from the 70s ' 80s 90s onwards there have been like a lot of Papers written in this area and I know because like I I looked them up and I tried to read through all of them and um there's a lot but why aren't there more successful systems being used in a real world um we haven't heard of anomaly detection systems that are ubiquitous that have used machine learning in a way

that can solve a problem uh much more successfully than a her istic Spas one why so here's here's my my proposal traditional machine learning is meant for identifying patterns for learning similar things if let's say in the case of a recommendation engine you can identify patterns and how users like certain things and then recommend recommend similar things to other users that are like them this is the act of finding similarities anomaly detection is to find anomalies and the the the very the Paradox behind this is that when you're finding an anomaly you train the negative example on a very small subset of the large positive example data space so this is something that machine learning wasn't really designed

for um and that's why when you're training a classifier and you and and you're trying to generate the model of normaly um it's hard and the second thing is that um machine learning is based on reinforcement learning a lot of it is based on trial and error so for example when you have a large data set and you want to test out you want to train a a model to be more and more accurate over time uh you rely on the you rely on the feedback loop to to increase the accuracy of this model this this works in Gmail by if a piece of spam comes through your email and you see that it's spam why is it in my inbox

you just report a Spam to help improve the model in general and that's Gmail's feedback loop um Amazon and Netflix feedback loop is more passive but it still exists if you don't click on the things that they recommend you it's a negative feedback it's it's a piece of negative feedback in anomaly detection it's hard to really pinpoint what this negative what this feedback loop actually is because um as mentioned earlier the whole paradox is that if you can um if you can accurately and with a certain level of confidence identify what is an anomalous then what's the point of having the anomal detection in the first place so these are a couple of problems um in a whole plethora of them that kind

of illustrate the differences between using anomaly detection in the security space with machine learning techniques so the very high cost of Errors is also something that we have to consider um if a P of spam comes through your inbox you just it it doesn't cost you much it doesn't cost Gmail much actually um it actually helps them to train their model to become more accurate um however if a piece if an attack goes through uh an a normally detector it actually hurts the Integrity of the system quite a lot uh because of the high cost of false positives and and false negatives if you think about it analyst time is actually the most is the

is actually the most expensive um aspect of any security company or any large organization of a security team um when you have an anomaly detector that constantly flexs anomalies um then obviously analysts will have to spend a lot of human time verifying these anomalies and what we've seen in many companies and I'm sure you've heard of is that idas and IPS systems have failed in this way and have uh caused PE analysts to just turn on the mute button lack of training data as mentioned earlier negative examples have no you I mean there's just no way of generating enough enough negative examples to train a good enough classifier to classify something as negative then again there's the semantic

Gap the semantic Gap I would argue is the one largest reason it's the one lar largest challenge surrounding machine learning in general if something is flagged as anomalous in a rule based engine you can verify that by running the the example of the of the event that failed against the rule engine again and you'll see exactly failed in the machine learning engine it's harder to do so um because if something failed and you R and and you run it against the the classifier model then um you you know okay this failed but why why did the classifier model evolve from its original state to the state it is today um it's harder to know and with things like deep learning and

recurrent neural networks where by where where the internals are totally opaque to human understanding um at this point in time at least then it's even worse I would argue um difficulties in in evaluation there's I think uh it's pretty hard to uh evaluate how effective an anomaly detector is a machine learning anomal detector is just because how do you know what a real uh attack is uh especially in the wild it's hard to measure the accuracy of such things and something that we'll dive deeper into today is the adversarial setting we'll see that actually machine learning anomal detectors can be pretty easily and pretty effort effortlessly bypassed using poisoning techniques so it's really bad that the

system is is is wrong when we have a high false positive rate which means uh when we're flagging a lot of bots uh when we're flagging a lot of uh of nonanomalous activity as as anomalies then we're really degrading the Integrity of the system we're making we're taking up analyst time and we're burning money for the organization if you have a high high false negative rate maybe some maybe we're catching maybe the rules the heuristics are catching things that the that the uh anomaly detector isn't then what's the point of having it so it's very intolerant to errors and that's different from traditional machine learning applications lack of training data it's hard to clean data um if you're looking

at the input logs from the web server that you saw earlier it's not clear exactly how you would clean the data it's not clear that when you're training a model of positive examples you're not including anything negative in it and that also helps ad adversaries when they're trying to poison your model the semantic Gap mentioned this earlier so devising a sound evaluation scheme is more difficult than building the system itself which is so paradoxical because um then you're spending more time evaluating the system and a lot of papers that if if if you read about papers in in normally detection um they're all written in a very closed environment uh they all use the same two three data sets from the

1980s and '90s even papers that are written in the last couple of decades so um that's interesting um they're all always origin destination flow graphs and um that's that's weird and I think it's because fundamentally evaluating an a machine learning algorithm them and anomaly detector is hard so it's tough trying to evaluate how well an anomaly detector is just based on academic research without doing your own evaluation Advanced actors will spend time to bypass the system and we'll look at that later so how have real world anomal detection systems failed there's so many false positives um it's hard to find attack free training data it's hard to get any data at all actually and it's used

without deep understanding when people think about machine learning and security they think okay maybe I'll just start out with um a bunch of data from my web from my web servers from my uh event logs and then I'll use maybe scipi or you know psyit learn to create a a prototype and then if I get a working prototype then I can work off from there but it's actually pretty uh it's pretty hard to evaluate once you've gotten that first prototype and that's the hard part feature engineering and trying to make your model more accurate because if the model isn't accurate then it's pointless and model poisoning that's also been seen in the real world so is

it hopeless here um we'll spend some time looking at this problem because um I'm not trying to be uh you know I'm not trying to say that machine learning shouldn't be used in the security context I'm just is saying that um if you if if we were to use machine learning more meaningfully in the security context then we should uh reconsider the the problem parameters and we should look at a problem from a different angle from a different angle that we looked at this the problem from the the uh from recommendation engines from spam filters and from other common successful applications uh and we'll go about doing this by actually looking at how machine learning uh anomal detectors are are

built and how they function so this is basically um what we mentioned earlier we alert with incoming points deviate from the model of of nor normaly this is an example infrastructure that I got from a very popular paper um that proposes a kind of machine learning uh anomaly detector first of all we have the data sources uh on the extreme left and and this this data source will have to go through some kind of cleaning process this cleaning process will take up lots of human time uh and if it can be automated it probably doesn't work that well um time series construction is important because if you're doing streaming uh if you're doing streaming learning on this input

data then you have to uh construct a kind of Time series for that's a representation of this incoming data you perform some aggregation on on this data and then um you feed it into a model that performs uh that selects features in this particular case it uses PCA which is principal component analysis principal component analysis I mean it sounds pretty fancy but it's actually nothing more that nothing more than an algorithm that selects features in any data that you pass it and that sounds pretty magical and actually is pretty magical I'll be going to further detail on that later um but there are problems with it using this in the security space and we'll see why after generating

features then it will be fed into a machine learning algorithm um classifier uh threshold topk algorithms that then gets uh generates some results for what's anomalous and what's not that goes through man manual validation and that manual validation part is where a lot of systems fail um then uh it generates the actual anomalies so common techniques for for uh machine learning clustering density based svms neon networks um let's focus on clustering today because it's the most um it's the easiest to understand without getting into too much nitt details um and it's easy to visualize how this actually works when you when we talk about the model machine learning we're actually talking about different clusters basically a statistical representation

of your data and so the machine or this Statistics actually learn what uh different areas of uh your your your your input data uh are and the ideal is that when you feed it data it'll form clusters and if anything deviates from these clusters then they're um they're abnormal uh these are this is this in particular is a centroid cluster model and it's good for online learning because as you can see we en keep a running model over time and if new points come in we can simply add to the add to the old model how to select features is often the most challenging problem um if if you think about it it's it's it's

there's often a uh common explosion of how many features you can choose in your model and it's always a um problem between overfitting and underfitting and also computation and space if you have too many features to train on you require lots of computation and time and uh also you may overfit overfitting is is a big problem in machine learning in in all Industries and um yeah but isn't this just a parameter optimization if you can iterate through all possible combinations of of this um can't you just uh um evaluate which produces the best result and just select those subset of features so um difficulties it's impossible to combinatorially iterate through the subset of features it's hard to evaluate

for the reasons mentioned earlier and the notion of optimal is U is not clear what what's Optimal Performance accuracy is not the only criteria you you're not only optimizing for precision you have to optimize for both precision and recall and also interpretability which is harder to quantify and harder to compare uh you want to because it's a streaming model you want to optimize for shorter training times it has to be real time if it's not real time then it's kind of meaningless because if you're doing anomal detection in batch mode then um you know what what's the point uh you want to reduce overfitting as well so let me just dive a little bit into PCA PCA is a method that people use

to select features Auto atically selecting features is very labor intensive um uh I remember uh I read somewh that Yan Lun the Facebook AI director once said to his grad students in MIT that um one day all of you guys will be replaced by algorithms and I'll just have a bunch of workhorses that will um feature engineer for me there will be no need for machine learning grad students anymore because all of it will be replaced by machines and he's kind of right deep learning kind of kind of kind of kind of does this um you know to to a certain extent and um we're seeing that even feature engineering can be automated and this uses statistical

methods it transforms data into different dimensions so the list of features that it selects are actually latent um for example if you pass in uh BR column uh data in in web blogs that contains for example IP timestamp URL um it returns you features that are perhaps a weighted combination of each of these fields so they're they're latent and they're they're not um obvious what what they mean this is a small visual visualization of how PCA Works um on the left you see the original data set on two axis this data set only has two Dimensions so for for ease of of uh of a demo so the output from PCA aims to maximize the variance captured by any one

particular Dimension selected and each principal component is orthogonal to the next principle component um it's not that important to understand the math behind this but PCA basically does some fancy uh optimization uh techniques to do this in a computationally reasonable way this is PCA in a 3D space whereby if we change the model around you see the variance captured by pc1 um increases and this uh this uh uh organization of the principal components helps pc1 capture the most variants uh in the entire data set so when we do PCA we want to select principal components that cover 80 to 90% of the data sets variance and this allows us to minimize the computation required to perform form machine

learning and and and allows us to capture the most variance again this is a purely statistical method and there's no context PCA doesn't know whether this component chosen is an IP address or is a URL or um represents a certain amount for a transaction made so this is a scre plot um the xaxis is the number of principal components included in the evaluation and the y-axis represents the cumulative variance captured uh in this uh model so as you can see having a an earlier knee having an earlier inflection in in in a graph means that fewer components are required to capture a lot of the data sets variance so how do you avoid common pitfalls in using machine learning in

anomaly detection um these are things I'm proposing and feel free to you know uh I think that you have to understand your threat model well uh using machine learning on a statistical data set makes a lot of sense but using the results of this uh model without any contextual filtering um makes no sense because uh machine learning doesn't know context and you have to at least have a second layer or or or multiple layers of contextual filtering before you can use the results in a meaningful way you have to keep the detection scope narrow because um it's not a be all and N all solution um you you have to reduce the cost of false negatives and positives

however and also you have to close the semantic Gap this is a funny picture that I think actually actually happened uh when the two ends of the bridge didn't meet um so evaluation just a couple of points on how you should evaluate this how easily can you filter out false positives because if your model takes into account a lot of um events that are uh not that are if your model generates a lot of false positives how can you filter these out with a contextual layer also evaluating true positives is is important let's say a model performs perfectly in in your evaluation process um how do you know that it's actually learning what you wanted to learn um

there's an interesting anecdote here that I'm not sure it's it's it's mythical I'm not sure if it's true because I can't find anything about it online but uh it's been written about in at least a couple of papers um where the dod in the 1980s were were performing experiments on neural networks and they wanted to be able to do some simple image recognition at the time it was it was uh Cutting Edge because a machine taking in an image of a tank and a and a car it would output if the image were a tank or a car so the stories were that papers were written about it and that uh they were very successful in these tests and they

achieved 8 90% accuracy in uh predicting in in in a in a guessing whether this image contain a tank or a car um after that the paper was was retracted because they found that um if they passed in a car with a blue sky background it said it was a tank so you really have to understand what your model is learning because it's so hard to understand how um how a model learns and what a model learns um in it it's it's not as easy as as uh as as it sounds so that's something that we have to consider when uh we build machine learning pipelines as well we have to evaluate true positives okay lastly I'll go into how

we attack something like this um why do we want to do that we we we want to see how secure this is if we're doing it for a Security application we have to fundamentally see if we have to see this fundamentally meets the requirements the the security requirements of any system that that we're using a production so how do we attack this we want to manipulate the Learning System to perform any specific attack and also degrade performance such that people using the system can trust it so there's this notion of chaff um chaff is chaff actually came from um the stuff that fighter jets emit to uh confuse homing missiles when they're in flight and under attack so that's

something that papers have used to describe um uh data points that adversaries send into machine learning pipelines to confuse anomaly detection systems how do you attack it um this is a simple representation of uh aoid based machine learning uh anomaly detector it's a classifier um I emitted I I omitted all all the points and just included the the the center of of the scoid so you see injecting chaff it's possible to shift the decision boundary of uh any classifier um and of course you would shift it slowly and the the center before and after the attack would take into a what allow an adversary to shift the the model to his to his needs a different kind of attack would

be basically to confuse the system and this requires a lot more volume where you inject a lot of traffic to expand the the decision boundaries such that because there's no clear attack Direction it looks perfectly normal and um traditional algorithms for detecting if a dision boundary has been shifted meaningfully will not be able to detect this the Crux of the problem is that when you're using machine learning anomaly detectors in in opposition to heuristic bace anomal detectors you are doing this because you want to take into account a certain kind of drift in your traffic if you if your traffic were static you wouldn't be looking into machine learning you'll be just using rules um

if if you had to take into account some form of dynamic uh nature of of your of your data then you would be looking into this and that's exactly what the attackers are using to bypass these systems so the balling frog attack is interesting um there are a few Papers written about this and uh I tried it out um basically what it means is that to avoid detection go slow you play around with the traff volume and injection period And basically you want to reduce the volume and and elongate the injection period such that if you were to perform injections onto any kind of model you wouldn't be easily detected and this makes a lot of sense um but how

would you defend against this uh first thing is to maintain a calibration test set so I'll propose a few defense mechanisms uh for how model poisoning can be circumvented um and then I'll go into why these don't necessarily work so first thing is that you maintain a calibration test Set uh when you have the initial model you have a test set and you run the test set against your model and you see and and you note down which events are generate are anomalous and which are not and then after some time after some period of training you run the same test set again and and see if uh this has changed or not um there

are a few reasons why this doesn't really work uh the most important of which is that when you're selecting features um it's hard to generate a test set that um it's hard to generate a test set that exercises all of the features especially when the features are latent so uh by changing the IP address and URL for example you may not exercise certain latent Dimensions that your model is taking into account so this is an example of a test set whereby the green points on on the left you have the initial model and on the right you have the updated model that has shifted and you would see that the yellow points have fallen out of the

notion of normality and the purple points have fallen in um something else that you can do is decision boundary ratio detection so this would mean that um if you if you detect that a lot of of points coming into your web server for example have been falling very close to the actual decision boundary um then you then you would be able to flag that as anomalous so it's kind of like an Inception of anomalies you would have an anomaly on your anomaly detection model and um that against that that again complicates things a lot because um how do you define how wide this region should be um it's easy to again for the attacker to

figure out uh where your decision boundary ratio lies and then just fly under that because that's static as well so again this allows uh adversaries to expand your decision boundaries so can machine learning be secure um it's not easy because the very notion of reinforcement learning the very notion of online learning means that you have to adapt to the data and fundamentally it's hard to distinguish between real drift real changes in the traffic that you that you expect and adversarial traffic that aim to poison your model um but what you can do is to slow adversaries down and it gives you time to detect when you're being targeted and um I think it's it's definitely not a

loss cause how do you defend against this obviously there have been lots of Papers written about how you improve principal component analysis how how you improve models to uh circumvent poisoning there's this thing called uh antidote which uh and principal component Pursuit and robust PCA and these are actually in use now um Netflix WR wrote a blog post in February about how they're using a robust PCA in their red anomaly detection system I think now it's called train wack um yeah I think not sure but um that that's in use now but again I did some experiments on this um I implemented uh PCA algorithms myself uh so it's a simple toy system and I'll be

showing the results later and and we'll see that it's actually not that much more effective than um than the naive one and we can still get past it with a certain level of certainty robust statistics is is interesting so principal component analysis measures the measures the variance captured by any principal component by using uh mean so median is a much more robust thing to use because um it it's harder to shift the median of something unless you have enough volume injected into into it also you have to find an appropriate distribution that models your data set a lot of research papers use gaussian and not use gaussian distributions just because it's easy to analyze and there's no reason to really

deviate from from from from that from that notion um and because of that a lot of libraries that Implement uh machine learning uh functions also default to gaon and normal models and people that don't really understand their data set well will just go the default and that causes a lot of inaccuracy so it's impossible to tune and uh it just generates a lot of confusion and to use robust PCA pretty much so i' like to finish off with um some kind of uh result presentation U I ran my own simulations with some real data I didn't have access to actual anomaly detection systems that were used in other companies because there's no there's no commercial

solutions that was ubiquitous and I couldn't just go up to for example Netflix and say hey can I run this test on your secret system that that you don't want people to know about but if anyone wants to wants to work with me and you know do future research I'd be more than happy to do so basically I look for a large um data set of of apach access logs and ran the test that I mentioned earlier this is a projection um sorry the dots in Gray are a little hard to see but basically this is just a projection into uh onto Target flow space and the first principal component space of uh the the the data set and we

see that robust PCA and naive PCA give roughly the same direction of of flow um which makes sense this is traff traffic injected again it's not easy to generate tra like this I had to go through a pretty long period of trial and error because these are lat dimensions and uh I couldn't just for example change the IP address by one to generate a tra that shifted right by one it was hard to generate this um but pretty much it gave me what I wanted to to to to to see um naive PCA actually shifted by uh a lot more than robust PCA robust PCA is is a a combination of antidote techniques and using a lashin

distribution which is more suited for web access Lots in in this particular case um so you can see the the faint Blue Line shifted to the the dark blue line and the light blue line shifted to the the darker green line um I also tested the boiling frog approaches over 10 training periods so you can see there's a I tried to uh represent the the different amounts of data of tra injected in into this into this model with different colors the green ones were injected first and then the red ones were generated last um so over 10 training periods uh we we we'll see the the results later uh how the how the model changed um so the the thing is

that boiling frog um attacks are harder to detect um because people can't can't basically block any group of IP addresses or any characteristic of your uh of the method that you're using to generate this this chaff um by simply putting it under a threshold or or saying that you're sending too much data and or or maybe uh guessing that they're undergoing some kind of Dos attack and uh the last the last uh method of attack is simply by by putting a lot of random data into this to try to expand the the decision boundary and we see that the shift is is is also pretty significant for naive PCA and less so for robust PCA

so here are the screen plots for rpca and na PCA uh with no poisoning uh robust PCA performs pretty well for that particular data set with the boiling frog attack over 10 training periods we see that it performs also pretty well but there's a lower chance of of detection on the on the uh adversary side because um they won't be that easily detected as a DS attack or someone trying to trying to poison their model um we see the rpca with 50% chaft so it means that 50% of the total traffic that you're receiving is adversarial traffic um it actually degrades the Integrity of the system quite a lot and it almost nears the effects of a

random detector so just jumping ahead to to evasion success let's let's look at these numbers so naive trough injection uh with naive PCA gives you 76% 76% evasion success so that's pretty bad with boiling frog injection it gave it gave you 87% evasion success but even when you're using both boiling frog and robust PCA it gave it gave uh the adversary 38% Invasion success which in my opinion is still pretty high if you're able to attack an anomaly detector and evade it with 40% accuracy I'd argue that the anomaly detector isn't working as well as it should so um that's it anomal detection systems today are not so good but they're improving I think machine

learning still has a still still has a role to play in the security space they're still vulnerable to they're still vulnerable to compromise um but obviously there there still is a lot of research being done in this area and if we can find more robust ways to perform classification to perform feature selection then they'll be great um but in general the wisdom surrounding this area is that you should use machine learning to find features and thresholds and then use these selected features to write uh heuristics and then run your anomaly detector with these heuristics because then You' be less susceptible to online model poisoning attacks and You' be less and and you'll be able to understand what

caused these attacks what caused the anomalies better so what next um I want to do more tests on anomaly detector systems that others have created um if you or your company or a friend has any has implemented any anomaly detection systems and wants to work with me to to test them out to see how susceptible they are how how secure they are uh against model poisoning against any other kind of kinds of attacks then that'll be great um defenses against poisoning techniques I mentioned a couple earlier using the decision boundary ratio and and and using cha detection um if there are more if there are better ways to detect if someone is trying to poison your model then

obviously that will help the cause a lot and also I want to exper experiment on more resilient ml models there are a lot that are in in research in the research phase today and I think in the coming years they'll be much more popular if they show promise um lastly I'm not sure how many of you guys are from the SF Bay Area but I run a Meet Up group called data mining for cyber security um it's a pretty it's a pretty active group we have meetings uh between two to four weeks once every two to four weeks um our last one was actually last week at Area 1 security and our next one is at Netflix so um we

talk a lot about about machine learning and using uh statistical models using data mining to solve security problems and it's an active area of research there's a lot of talk um on it um but what I've seen is that people want to go about using machine learning in their security problems but they don't know how to start and if they have started it's more of a prototyping thing and they haven't found enough cause to switch over um our last meetups were at Facebook and Linkedin um and I work for a company called shape security and what we do is uh we detect Automation in web traffic so that's it any questions

here okay the question is whether the lack of Effectiveness in machine learning uh techniques in detection is more due to the lack of training data or the um ineffectiveness of the algorithms to detect these uh anomalies um it's hard to attribute um how much each uh reason cause causes a a failure Point um but I think that the data fundamentally needs to be there in order to tune any models um if we can't get any data then we can't tune the models to and we can't tune the algorithms to make the model better so fundamentally personally I feel that the the lack of training data is a big problem because if we're doing data mining we need data um so to me

that's that that's a larger problem

world

that

right right um I agree with what you said fully I I I think um there is a problem with using outof thee box machine learning in security applications um because of what I said in the you know earlier um I think there are uh promising applications of machine learning and security uh there is the I think there's a paper written a couple of years ago whereby mware and and Android applications were successfully found with up to 95 to 97% accuracy uh on their test set of course um using deep learning techniques and that's that that's interesting because um malware detection uh using machine learning has been pretty established for a while and it's a small Enclave in

which machine learning has shown success but in a more generalized approach um where we're talking about you know Finding anomalies in general without any context surrounding it then I think personally I'm pessimistic until something new comes up then we have to reconsider the problem

yeah

yeah yeah

yeah exactly I think um I can't say I know exactly how that Android malware thing worked because it's been a while since I looked at it but um if I remember correctly uh what they were doing was to send uh was to train the model on a bunch of uh mware that they has seen that they has seen before and have a test set containing malware that it hasn't seen before so it used the recurrent neuron Network to train a model of normaly and a model of um abnormaly and then uh the the model was soft enough we had a soft enough decision boundary ratio that was able to classify something that it hasn't seen

before a totally different kind of malware or or or or a weird function calls API calls that would allow would let it classify um that piece the application as malicious yeah

yes

yeah so

right that's a that's a great point so the scenario that I created was a was a really a a toy scenario um in in which I got a whole bunch of web blcks and they were all timestamped and I just fed them in over a over a simulated training period so I didn't run it over let's say a week I ran it over a short amount of time and I simulated the changes in time and so the training yes so the cha had an immediate effect and the problem the the balling frog attack is really more complicated than I actually made made it sound earlier because the adversary really has to have knowledge of when you're

training a data set presumably the model won't be trained continuously 247 um it'll be trained May commonly seen scenarios are be train once a week maybe every Sunday at 5:00 it'll be trained for an hour and then this would be the model for the coming week so if the adversary didn't know when um when the training occurs or if you randomize the training period Then it's theoretically more easy for you to detect if uh someone is sending weird uh sources of is sending weird traffic to your to your endpoint um but then again um you know we have to assume that the attacker has has a a level of knowledge that you are not expecting yeah

MH yeah

right

that makes sense yeah I agree

good so um just looking I I haven't looked at all all kinds of algorithms obviously but um oh okay yeah sorry you have to wrap up after this but uh I've seen that classifiers are particularly susceptible to poisoning ATT Texs and are maybe the most insecure because they use a notion of uh the means of sets of points which is insecure statistics and uh using svms would be would be helpful I I think I haven't played around with testing it too much but just um you know it sounds like it would be performed better against attacks okay

yeah oh got it thank you thanks so much than [Applause]

GT - Making & Breaking Machine Learning Anomaly Detectors in Real Life - Clarence Chio

Related talks