Defeating Machine Learning: Systemic Deficiencies for Detecting Malware

Name: Defeating Machine Learning: Systemic Deficiencies for Detecting Malware
Uploaded: 2016-08-27
Duration: 45 min 14 s
Description: Defeating Machine Learning: Systemic Deficiencies for Detecting Malware - Wes Connell, Ryan Peters Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 02, 2016

BSides Las Vegas45:142.1K viewsPublished 2016-08Watch on YouTube ↗

About this talk

Defeating Machine Learning: Systemic Deficiencies for Detecting Malware - Wes Connell, Ryan Peters Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 02, 2016

Show transcript [en]

uh so this is the ground truth track at b science las vegas 2016. uh thank you all for being here i just want to mention uh that our sponsors are awesome and we couldn't do this without them so you should all make sure to visit some of their booths out in the uh chill area um let's see this is a live stream and it's recorded so please turn off the ringers on your phones and um oh yeah and also uh the the area in the back is a fire lane so we have to keep that clear so please you know if you go out and come back in don't stand back there uh okay so

we have ryan peters and wes connell from blue vector ryan peters is an applied data scientist and west connell is a threat researcher all right so give it up for these two thanks [Music] thank you by way of introduction as you mentioned my name is wes connell and i'm a threat researcher and software systems security guy with a firm called blue vector and alongside me is ryan peters his official title is applied data scientist but i've been working with him for a few years he's a superhuman software developer a machine learning expert he's also a die hard cleveland sports fan so he hasn't stopped smiling since kyrie drained that three in game seven but and if you're from golden state i

apologize i'm from dc so i know heartbreak quite well at any rate blue vector develops technology to address pervasive network security challenges and along those lines we do a lot of analytics and machine learning is one of our core competencies and the manner in which we have operationalized machine learning can be much more broadly applied to solve countless problems in the security space which brings us here and that's what we'll be highlighting today so the agenda is as follows we will briefly review how cyber defense capabilities have evolved over time we will highlight this common model problem that exists in not only machine learning deployments but across all security solutions from there we'll demonstrate an

attacker's perspective of defeating this problem with persistence i'll hand it off to ryan and he will introduce a moving defense solution through data diversification which adds an element of uncertainty to the attacker's perspective we'll review this moving defense from concept to practical implementation to quantitative results and the results are numbers uh it can get pretty dry but they're really important so just stay with us we'll then summarize everything provide final takeaways and we'll wrap things up with a quiz at the end i'm just kidding there's the quiz is in the middle no there's no quiz so here is what the security landscape looked like 30 years ago not a whole lot going on so given the non-existent defenses

attackers didn't have to work particularly hard to wreak havoc and so the morris worm in 1989 got the ball rolling followed by an onslaught of viruses in the 90s and even through to this day and so in response antivirus engine surfaced and they were and continue to be a very reactive approach and that you can't write a signature for a sample that hasn't been released or seen in the while before and along those lines we had packet and stateful filters that were deployed in the form of firewalls which became the standard we had exact hash matches on files and packet filters that were limited to exact matching on ip address port and protocol if you had a firewall you were deemed

okay and so antivirus and firewalls eventually gave way to sandboxing emulation and virtual detonation and they're fantastic in that they provide in-depth and detailed behavioral context while also recognizing malicious indicators and unfortunately as most of you know there are there's no shortage of sandbox evasion techniques in particular you've got malware that has environmental checks for signs of a sandbox you can look for common analysis tools check for mouse clicks mouse movement you can check if the cpu has multiple cores you can check the disk size you have arbitrary sleep statements you have exploits that target only specific versions of flash player and pdf viewers and we're also seeing varying methods of persistence and typically malware will bootstrap itself to load

and execute at system startup but we've seen as recently as last month the apt-28 hacking group would persist their payload when the user opened a microsoft office application and so in general this is very difficult to scale these technologies given that we are drowning in data today so today the attack service has now grown and expanded and evolved and we're seeing everything from nation-state-sponsored attacks to point-of-sale attacks insider threats social engineering and especially today it's ransomware and so users and enterprises are demanding additional protection here we are again and the security industry must respond and adapt and so what does our security posture look like today we've got machine learning and many vendors would say this is where

we are this super tech robot and that machine learning a technology from the future with a power that may mark the beginning of the end for the human race solves all of our problems and is totally unbreakable that's right and so in reality we're probably here okay so machine learning is out of this new dynamic and it's far less brittle it's much more robust it's dynamic it's proactive but the weaknesses vary depending on the implementation so let's take a deeper dive at how machine learning has been deployed across the security industry so we break it out into a couple categories here across the top we have supervised machine learning which entails labeling the data that you're training on

so that the model has an understanding of the differentiating characteristics and then we have unsupervised with no labels and they're merely clustering algorithms that attempt to find structure in unlabeled data on the opposite axis we've got incremental and we've got batch and incremental learns continuously it continuously integrates new data and you've got batch algorithms that train all at once and so in this top left category we have a few things user behavior analytics insider threat detection and it's deemed incremental generally because it's building profiles over time and it's unsupervised because we are identifying outliers and anomalies and departures from what's normal and occasionally we'll have this network anomaly detection in the supervised arena if known

examples of bad behavior are available for training in the top right we have network traffic profiling which are generally supervised and incremental and an example would be users labeling a stream of emails as being spam or not spam in the bottom left we have this unsupervised batch we have malware family identification and it's generally unsupervised because we are using clustering algorithms to find structure in the data and again it's batch because we generally build these on very large libraries of malware and in the bottom right is our forte it's malware detection and it's almost exclusively relied on with supervised batch algorithms where the training is performed on large corpuses of data that are labeled as

malicious or benign so we'll be fixating on this for the remainder of our talk so with that let's take a closer look at supervised machine learning so here we have the supervised machine learning overview we are collecting malicious and benign data in the top left and we create these feature vectors label them appropriately and we pass that off to the machine learning algorithm and it generates a classifier or predictive model so if you are a machine learning practitioner this is a simple concept if you are not this concept can be intermediate to stupid difficult so we're gonna dumb this down a little bit so let's pretend we're building a model that distinguishes hollywood celebrities from software developers

okay so the training data for hollywood celebrities would include people like ben affleck and matt damon and brad pitt and these feature vectors these representations of who they are would be things like tall dark handsome shredded dating supermodels eight figures in the bank stacks on stacks of cash and on the complete opposite side of the spectrum you have software developers people like west connell and ryan peters and joe and sean and most if not all of you and these feature vectors these representations of who we are would be things like tall skinny socially awkward goofy looking terrified of public speaking hooked on pokemon go we're probably on reddit developing carpal tunnel one work day at

a time and so these are two polar opposite classes of people and it's going to be a damn good model and so we pass these feature vectors these character traits and their appropriate labels celebrity or developer to an algorithm and that creates a model that should have no problems distinguishing these two groups of people we deploy the model and it will try to distinguish these developers from celebrities based on what it knows and so instead of building a model that distinguishes developers and celebrities we're building one merely to distinguish malware from benign content so are there any con any questions before we press on okay great so we have a couple potential vulnerabilities of machine learning

number one is we have training set manipulation so within this malware versus benign wear model one example would be if a web scraping source gets popped and begins serving malicious content resulting in incorrectly labeled training sets another example would be a malware author that designs malware that resembles benign software and another example would be inserting benign strings into malicious programs and we see this happen very commonly with things like putty number three is that you can have uh malware authors that manipulate their sample to avoid being analyzed altogether this can be things like obfuscation bit shifting maybe putting their payload in a password protected zip so that the model never even has access to it

but if we take a step back there's a vulnerability that all malware detection technologies are susceptible to including current implementations of machine learning and we refer to this as the common model problem where we share identical signatures rule sets engines models emulators and so i want you to imagine that each user or enterprise is represented as a building here and each is seeking to keep intruders out and so they go into the town lock store and they buy a lock and that lock is a representation of the security solution right there and so intruders interested in breaking into one of these targets can ensure a swift silent undetected entry by knowing how to break the lock before they show

up and under the common signature based deployment paradigm everyone in town is protected by an identical lock because they're using the same set of signatures and in most cases an intruder can just visit the locksmith and buy a lock for himself and so he takes a lock home he works away at picking it breaking it free of risk time constraints or any neighbors noticing and so the idea here is that an attacker can easily verify that his exploit will be successful against a target that's protected by such a signature based solution and because all deployments are identical all the attacker needs is to obtain any copy of the signature based product and iterate against it until he learns

how to evade detection so those were the earlier solutions particularly any virus what about newer approaches that use things like emulation and sandboxing and heuristics well we find ourselves in the same predicament each deployment is the same even though there's stronger locks and again because all these deployments are identical all the attacker needs is to obtain any copy of the heuristic engine or sandbox or emulator and iterate against it until he learns how to evade detection so keep in mind the attacker doesn't need to understand the inner workings of the security solutions he just needs his own copy to test with and so how about machine learning based security solutions we have this super robust dynamic

proactive solution but we're seeing the same problem here we are still distributing the same identical lock to each individual enterprise and so it's no different you know let's see if a malware author in possession of a machine learning based security solution can evade detection and in this case we're going to try it with obfuscation and so the workflow here is pretty straightforward we're generating a payload for our lab we were using a reverse tcp shell so when the victim gets infected it spawns the shell back to the attacker and then we are obfuscating it we use shikata gennai encoding and a handful of others but again the emphasis is not the specific payload or the encoder or the

tooling the point is that the attacker is working to evade detection and bypass the controls in place and this is through obfuscation so generating the payload using the randomly selected encoder to hide and obfuscate what he's doing he's embedding it into a calc.exe template and then he's testing against in his lab that copy of the av and the machine learning and so in our lab we use the av software clam win because it's free and open source and then the model that we built was it was built using a supervised machine learning algorithm with scikit-learn and we trained on twenty thousand benign and twenty thousand malicious portable executable files on the right hand side we have our test

list and this is data the training data that was held out that we could test against to check the how effective our model is and it had a three and a half percent false positive rate which means it was flagged by our classifier but it was not in fact malicious and almost a four percent false negative rate where it was not flagged but it was in fact malicious and so again we have to assume that the attacker has access to both av and machine learning software in order to do this and so the antivirus solution is looking for an exact static signature so permuting the file until the signature is evaded is pretty straightforward and in our experiment the attacker only

had to obfuscate his payload one time before evading detection and so again the machine learning solution is pretty robust and so obfuscating could actually increase the likelihood of the malware sample being detected and while the payload is being obfuscated the calc.exe sample could be missing characteristics of benign wear not only uh it's hiding its malicious intent but it's um it's also that it's absent of benign characteristics and so in our example the attacker had to obfuscate over 1900 times before he is able to evade detection from our machine learning model and so the attacker here is in the driver's seat for two reasons number one is that the he's confident the model has not changed as most

vendors only deploy new classifiers a handful of times per year for what's common with batch learning and he's confident that all targets have the same model therefore if the attacker can defeat one instance or one copy he can defeat all copies and so all it takes is persistence

so if we take a step back let's revisit this lock analogy and see how we can do better

with machine learning locksmiths make newer stronger really robust locks but unfortunately he's still selling the same locks and so the problem here is pretty obvious it's not necessarily to build stronger locks it's that we should be using different locks and additionally they should be changed every once in a while just like passwords and so machine learning is well suited for implementing these dynamic moving defenses and this is mostly due to the vast number of independent variables that we are able to control we've got the feature space as i demonstrated the learning algorithm the data input and it can be opera can be accomplished in an operationally realizable way and so why hasn't this been done before

well for one it's very difficult to implement and you can imagine it's a logistical nightmare and most importantly how do you verify that each lock is different across these different enterprises yet equally effective and still find a way to distribute and maintain them and so it's very costly and it's not discussed among vendors as it's easier to just produce one model and share it with everyone and so by altering the machine learning model so it's unique for each deployment we can achieve this moving defense and again these these defenses should be changing over time and to discuss this moving defense solution in greater detail i'm going to hand it off to ryan thank you wes so let's look at how we

can implement a moving defense solution using machine learning so as wes mentioned earlier there are three main knobs that we can really turn to change our machine learning model so the first one is the feature space so the data will look different to the learner depending on which features were used but the problem is you know in the lab we're choosing a feature space because we believe it to be the optimal feature space and it's really hard to maintain efficacy if we're forcing ourselves to use a sub-optimal feature space in addition we lose the ability to describe samples as future vectors if each deployment is using a different feature space the second knob we can turn is the

learning algorithm but the problem with this is there are a limited number of learners available and it's also extremely difficult to maintain efficacy as some algorithms might be more suited for a classification problem like malware detection and it's also pretty cumbersome on the vendor to have to develop you know unique solutions for every single customer the third input that we can use or the third knob rather is the data inputs and you know if you change the data there's really almost no bound to the number of different classifiers we can produce and by fixing the feature space we allow our data to be represented as feature vectors so we're going to choose this knob to

turn in our moving defense solution one thing we haven't discussed yet are the practical issues of where this training will actually take place you know does it take does it take place back at the vendor's lab or does it take place wherever the machine learning model is being deployed so centralizing with the vendor would be fairly difficult and you might have to send you know data back to the vendor depending on how where you're getting that data from so we're going to assume moving forward a distributed model excuse me where the permutations are happening locally on site with the user but it's important to note either of these would accomplish our goals so how can we construct these

classifiers to achieve a moving defense let's first look again at the current machine learning for malware detection paradigm let's peer into the locksmiths workshop so back in the vendors lab we're curating a large benign and malicious library it is often terabytes of data millions of samples you know the vendor's adding their secret sauce their feature space the algorithms you know that they've selected and they're training the learner to recognize differences between samples and the result is a black box classifier or your lock the vendor then pushes that classifier out to the user environment and then the classifier is exposed to unknown samples and issues determinations as to whether it believes it to be benign or malicious but it's

important to note here the samples here you know represented by yellow circles are still yellow because there is no ground truth available to the learner or to the to the classifier model and different classifiers might make different determinations on the same sample and this is a concept we're going to be revisiting later so if we want to generate a new lock it's fairly straightforward we can just use a different set of data we can then push out that new lock to the user and repeat the process as necessary so the question is what is the best way to instantiate a moving defense in practice across many user environments we see that by using different data

input we can generate different classifiers but here are many possibilities on how to actually vary that data input so the simplest approach is we can just use the vendor as a data source you know we could sample the vendor data in some intelligent manner maybe there's different algorithms that we could use or you know age off old data but the main problem is we're drawing from the same source so we're really lacking some diversity and you know it's likely going to produce blocks that are not that different from each other and it's also unlikely that the vendors are going to be willing or able to ride the service you know they'd have to curate a lot of data and if they have

the data anyway why are they not include including it in the original base model so we can do better so the second option is we can use a local data source for the user environment so we're using data that's uniquely seen by that deployment and we're feeding back the data labeled by this classifier back into the data set so this does make the classifier unique you know in particular to your local data but it's only reinforcing what the classifier already already thinks it knows it's reinforcing imperfect knowledge and if we actually appear under the hood here you know we see that the classifier is getting some you know some of the classifications wrong so it's just

reinforcing the errors so again we can do better so the third option is we can still use a local data source but we can feedback only the errors and this will require correctly labeled samples so this one's going to require trusted analyst review but often these adjudications are already being made when using the security tools so you know what we would be doing is adding knowledge users are already generating but has been previously untapped by the learning model and the resulting classifiers are going to be tailored to that user environment they're going to be stronger on the content they're seeing but more importantly these classifiers are unique and now everyone is using their own locks

so we're going to select this option for the remainder of our talk and we're going to call it in situ learning which is named for the locality of the data source so in situ means people are not just able to buy different unique locks we are actually making everyone a locksmith

so this in situ concept is actually pretty simple but there's many different factors to consider when operationally implementing this idea so here we're going to discuss two different dimensions balance versus unbalanced and replacement versus addition to consider when determining how to incorporate additional local data into the training set so the first option we have is balanced replacement an equal amount of the benign and malicious training set is replaced with new data you have unbalanced replacement in which you only replace samples in one of the old training sets with new data balanced addition equal amounts of new data are added to both the benign and the malicious training sets and unbalanced edition in which you only

add new data to one of the old training sets so why would you want to choose one versus the other operationally you might want to balance the addition of your data to prevent the classifier from favoring one class over another just because there is more of it however you know slight unbalancing might not be a major issue and you might be able to deal with this issue you know with whatever learner you've selected but it definitely could be a concern as more and more data is added you might choose a replacement approach if you want to more quickly that model to your local samples but you're doing that at the cost of losing part of the original training set

if you choose addition you're preserving the original training set but you're increasing the time required to train your new models you might be increasing storage requirements for these new samples so we tried all four approaches and when you're only adding or replacing up to twenty percent for a single branch off the base model uh we actually did not see much of a significant actually did not see a significant difference in performance so we're going to choose unbalanced addition for this experiment because it's the easiest to implement operationally it doesn't require additional samples or feature vectors to balance the training set which is something that the vendor would likely have to ship to a user so let's look at some efficacy results

for this unbalanced addition approach so the top row of this table is the performance of the original base classifier if you remember from earlier it was trained on twenty thousand benign and twenty thousand malicious pe-32 samples from the lab this local data column the false positives these are the files that were missed by the local classifier so it's going to be 100 to start off with because we missed all the files we missed so if we add one percent to the benign training set or 200 benign files the false positive rate falls to 14 percent adding 2 percent or 400 benign files it falls 8 and this trend continues as more and more local data is added to the original

model therefore we have shown here that adding local data previously missed by the base classifier to an in stitching model results in lower false positive rates on that local data so how far has this performance deviated against the original lab test set a different classifier is not really a better one excuse me we want to make sure that we did not significantly bias from our broader test set and as the results here show there was little to no degradation in performance on the lab data therefore the institute models we generated are performing equal or better than the base classifier so we can extend this by generating 10 random classifiers using five percent unbalanced addition and evaluating the

performance and you can consider this similar to a k-fold validation we ultimately want to prove that we're not just getting lucky with a specific sampling of the new data set so we can take the original lab data set and add a random sampling of local data missed by the base classifier and this will produce classifier r1 or trial number one we can repeat using a different random sampling which produces classifier r2 or trial number two and so on and so forth for r3 through r10 it's important to note that while these classifiers were generated using the same local data source for this experiment they represent 10 different operational environments each with their own data sources

so if we look across all 10 classifiers here we're seeing fairly consistent performance standard deviations for false positive false negative rates are about half a percent so the main takeaway here is we're generating classifiers with equal or greater effectiveness compared to the base model in other words we are producing equally effective locks but the question remains are these locks fundamentally different from each other

okay so we're going to use two different metrics to try and capture model similarity so the first mo the first metric we're going to use is what we call feature space commonality and this is more specific to our implementation of machine learning models so when we're training our models our total feature space was down selected in the lab from a global feature set so we're going from you know potentially millions of samples or millions of features down to tens of thousands of features and only a subset of these features are actually going to be used by the learner when we're generating a new model so we're defining the commonality here as the features used by both the base

classifier model and the in situ model divided by the sum of the features used by each model and the results show about 30 percent of the features were used by both models which means 70 percent of the features used by the model are different or unique to that model and we see the results you know very consistent across all 10 of these models the other metric we can use is called what we call overlapping misclassifications and this is a more general approach and can really be applied to any machine learning algorithm and so we every classifier is going to make mistakes what we're looking to capture is if the in situ mistakes are the same or different compared to the

base classifier so we're going to use a consistent test set for all these classifiers and the subset misclassified by each model which is about three percent of the files are shown in the circles of this venn diagram so the commonality here is going to be the samples missed by both the base and the incision model divided by the sum of the samples missed by each model and the results show only about 50 percent of the samples missed were the same between the two models and again these results are very consistent across all 10 models so using these two metrics we're pretty confident that our institute models are significantly different from the base model but are we just forking

these models in the same direction or are they being forked in different directions and so what we really want to look at is how the institute models differ from each other so we can answer this question by simply looking at a pairwise comparison of all the randomly generated in situ models using our misclassification metric so there's a lot of numbers here so let me explain first what's going on instead of focusing on exact numbers so let's look at a comparison of r1 classifier r1 versus r2 so we established earlier that r1 is different from the base and r2 is different from the base so we're trying to answer here is how r1 differs from r2

so the forward slashes here represent the files missed by r1 which is going to be about three percent of the total samples it analyzes and the backward slashes are going to be the files missed by r2 or about three percent of the total samples analyzed and the intersection of the slashes are the files missed by both classifiers which end up being about half of the three percent and we can repeat this comparison for r2 and r4 and so on and so forth what we see is for any two given in situ classifiers we have roughly a 45 to 50 percent overlap in the misclassifications of the three percent that each of them miss so half of the

time they are missing the same files but half of the time they are not again it's important to re-emphasize here that each in-situ classifier represents different deployments therefore we are gaining diversity across all these deployments these locks are not just different from the original lock but are significantly different from one another so let's summarize the benefits of in-situ first we have that diversity of defense and this is both spatially which is across deployments and also temporally you know if you got hacked on august 1st and they got access to your classifier model you could retrain it on august 2nd you know and incorporate even the data that was used to hack you so this really increases the

uncertainty for the attack second we're generating environment specific classifiers as we showed earlier where you've increased the performance on the type of data that's observed in that environment and these results are consistent with performance improvements we've seen in enterprise environments the third is we have increased responsiveness to new threats so previously if you missed a file the only way to improve the classifier was to submit it to the vendor cross your fingers that the sample gets incorporated into either a custom model from the vendor or ideally the global model but you know often you would have to rely on short-term solutions which could potentially be indefinite you know these included things like brittle signatures or rule matching

white so an approach like in situ allows data to be incorporated into the model immediately as dictated by the user and the fourth one is one that we didn't really touch on with this approach since we're doing an site where the where the machine learning model has been deployed there is no need to share personal or proprietary data and again this is not an argument against data or threat intelligence sharing in general regardless of the desire to share or whatever technical hurdles that might exist some situations built or even illegal to share certain types of information without excessive modification so for example you know you might have pdf files with personally identifiable information pii you might have health records

proprietary company info or classified data so an approach like in situ allows this data to be incorporated into the machine learning model rather than having to discard and rely entirely on the vendor's data or the vendor's willingness to implement a custom solution for your environment so summarizing the big picture improvements in machine learning methods for malware detection are weakened by the reliance on the traditional deployment paradigm a dedicated attacker as we showed earlier only needs a copy of the security solution to break through the target with confidence secondly the concept of a moving defense addresses the shared model vulnerability and may be naturally applied to some machine learning solutions we discussed just a few of the many different ways of

achieving a moving defense using machine learning and settled on several design decisions that were easiest to operationalize and yielded the best results and third the diversity offered by a moving defense is better for the herd and by better for the herd we mean if we are all using identical defenses we are all worse off for it users should engage with their vendors about its implementation so i'm going to leave you with this one thought you know changes in the security industry don't just come about because of changes in tactics of adversaries users must be must demand change so we challenge you to gain more understanding of the underlying technologies in our security products in particular the

problems with them and to continually challenge commonly accepted practices across the industry thank you very much [Applause] [Music] so now we're open for any questions

uh so i remember the part where you're talking about the different ways you would change the training data sure to achieve the different uh models um i think i maybe misunderstood something about like how you're choosing the different uh learning algorithms and winding up with a different classifier though could you touch upon that are you referring to this slide here or um are you well i guess i guess like at a high level i see how the differentiation of the trading data could give people different models correct but ultimately because we're all still working from the same base and at least a subset of the attacks are the same i am having trouble seeing like how

this doesn't just lead to one person having a really diesel uh protection if their attack surface is novel versus everybody else who would have a more standard thing because only the date is changing in some slight ways maybe or i mean that's that's true you're going to get a you know a variety of defenses and you know the better for the herd sort of comment was more referring to you know we don't know an attack happens until someone actually gets popped and you know the reality is you know machine learning is a probabilistic approach and so if we can at least move some of those defenses to catch that you know it helps out the herd by um

you know educating them of new threats i don't know if that answers your question at all i guess just like do you how are you also switching up the learning algorithms oh yeah we're not switching up the learning algorithms because it would cause you know too many issues there's a limited number of learning algorithms that are yeah it's just training the model differently to look for different features because the features of the data that was local to that environment has been recycled into that classifier on-site yeah well yeah so well i mean the feature of space that we're going to you know extract from that file or that sample is going to be the same but the

the features that are actually used by the learning algorithm are going to be different and so that's where you get a lot of your diversity and the you know that's that's what really results in a lot of the changes and classifications that you're seeing so um did you generate any attacks against the machine learning models and then test the um well we'll call them vulnerabilities against the machine learning models and test them against the different models that you generated the in-situ models when hiring men spoke earlier he said that in their experience changing not just even changing the type of model once a vulnerability found it tended to apply across model types from neural nets to decision trees to

whatever so did you actually test against any vulnerabilities to see if changing up the in-situ model actually prevented the vulnerability yeah so i mean those that um that malware variant that we showed earlier you know we sort of shortened the demo part here but we tested that against our institute models and you know found that about half of the models now caught it but half of them still missed it so you know it's fairly consistent with what we've seen okay so it okay so that that was actually part of my question as well as uh how how'd that calc sample with the yeah we should have hindsight we should have included that slide and you know that sort of

circle back but yeah but then my other question is uh this is fascinating word this is awesome one of the things that kind of occurred to me as well as and i want to pick you guys brains about it if the what if what if you just had one model that you were incrementally like always updating like like with the unbalanced additions but like every time you got a new sample to scan yeah the model was always getting changed or whatever yeah so i mean yeah there's a lot of continuous learning algorithms that are out there and you know i i think a next step would certainly be looking into those algorithms we haven't really spent much

time looking at those but um you know that would also address your your issue of you know again you know it requires it's the same you know the same barrier of an analyst has to adjudicate it and you know so you can assign it labels but um i i do think that's a good idea looking at continuous learning algorithms so thank you so this is in one way is a follow-up to the previous question sure you talk about false negatives malware that gets in but is it detected how does the analyst know to add it back in if it hasn't been detected we know from the verizon studies i think it's 18 months 24 months that often malware is

in the organization and before it's you know someone notices that there's some malware there so have you thought about how to get those false negatives back into the loop we've seen a few examples where larger enterprises after they've get they've gotten popped they go do the forensics and curate a list of all the amount where they've um they've seen so they they kind of house that and so if we give them the capability to develop their own models they can funnel that in from the get-go but after they've been popped it's um it's tough yeah that's the billion dollar question there if we can yeah you know so you so that that's a that's a hurdle you

haven't thought about how getting over the lag between the uh infiltration by by false negatives and then the discovery of that false negative that that's a very tough question um you know this approach certainly lends itself better to you know false positive reductions you know and and that's really important because you know the analyst workload can be pretty hefty if you have high false positive rates on your on your uh classifier algorithm so you know that you know but even just feeding you know we we showed we were just even feeding false positives in here we didn't do false negatives that we were you know we had separate experiments where we were doing false negatives but in this case

even just feeding false positives we got diversity in both the you know the malicious files that it missed and also the benign files i missed so let's go up to the mic dude

it just kind of occurred to me that maybe the key to making inroads along the billion dollar question is sort of the you know i hate to say buyer's total approach but you know crowd sourcing getting all the samples you can just constantly defeating this thing whatever it is yeah i mean you can think of some you know there's some creative uh you know sharing approaches that we could look into you know if you had trusted um you know trusted groups that were willing to share these samples with each other you know you could still get the diversity um without having to actually communicate with the vendor so i there's there's a lot of different there's a you

know this road goes down pretty deep if we you know look into potential ways to share data thank you thank you for your talk that was good thank you i have a question i actually am a vendor who's trying to build a product in this area and this is amazing how people are talking about it and we are talking back in the lab as well and i'm founder of the startup so and my background is arcsight which most of you guys probably know um the challenge that we see as a vendor is like you said that the algorithms can be built but the data is different for each environment and when we try to train the model

we need access to that data and which is which becomes a bottleneck for us because like you said you can't share everything but when we talk to customers they'll be like ah you know show me the demo i can't see my data so how do we break through that challenge and work together on that you want to talk to speak uh feature vectors um yeah i mean so you know one of the benefits of using this this approach that we selected is you know we can abstract our data as future vectors and you know that's not to say that future vectors you know the features that you extract might not have might they might have some you know

personally identifiable information in them but you could design them in a way where it's completely anonymized data so in that way uh you know a user could share data back to the you know back to the vendor without actually sending the samples and it wouldn't be you know you could you could avoid those issues so i think that's that's one avenue that could be used alternatively you'd have to give them the capability to use um to generate a model locally with their data without sending anything back but yeah that's one of the reasons why we chose that or we talked about that distributed approach because you know they don't have to share any data even

future vectors back with the vendor and it really becomes you know specialized for their environment great so i am looking for people who want to participate uh you know give us your data we will provide our services and and train the model to detect these attacks if anybody's interested let me know i'm in the room cool thank you thank you already i think that's it thank you guys very much appreciate it thank you very much

Defeating Machine Learning: Systemic Deficiencies for Detecting Malware

Related talks