2016 BSidesLV - Ground Truth - Day One

BSides Las Vegas4:20:47466 viewsPublished 2016-08Watch on YouTube ↗

Show transcript [en]

um a week ago they're like aren't you worried about de anonymizing the blockchain not really just because if you think about it Google was de anonymizing if you use that logic Google is De anonymizing the internet when they built their search engine the data is already there so I don't really feel like I'm screwing with anybody's anonymity um there's probably other people doing this a lot better than me I have no idea um but it's good for people to kind of think about these kinds of things when they write about it or when they make transactions it is important to know that any transaction that you make on The blockchain Ledger is going to be recorded forever and everyone can

see it it's harder to figure out who owns a wallet but it is certainly possible that everyone that people are going to write analytics like this for for whatever purpose so this database allows us to ask any question of the blockchain it allows us to figure out who has the most unspent Bitcoin where is the Bitcoin at who has the most money how much money was spent in a day what is the average money allows you to correlate with stuff like the stock market allows you to correlate with stuff like current events or any other economic issues and uh that is basically it um oh crap anybody have any questions

yes uh two things for the time stamps first of all how detailed how in depth does it go nanc does it I was just curious so the qu does do I repeat the question or do people have the question you don't have I don't have to repeat the question um so the time stamp is in Unix epic timestamp which is seconds from when the block was mind that the transaction took place on so how precise I would give it a margin of error of 15 minutes uh because that's the average time period so it's harder to correlate exactly when it happened but it's you basically have about a 15minute window of being right or wrong and uh that's just what I

because that's the average time that it takes to have a new block 10 10 oh it's 10 [ __ ] it's 10 minutes thank you um and you said you had another question yeah um does it does time zones like e like have anything to do with it so time zones do not have anything to do with it because the way the Unix epic timestamp works is it's only the number of seconds that have taken place since July 1st 1970 okay so oh but I guess what did I just say July January I'm sorry January 1st 1970 but I don't know what time's at GM GM okay yeah GMT soow VAR say again it allows some Vari

it allows variance time basically in the future as as it's not okay for uh when you start mining like on the blockchain or with the if you were to do the transaction right with the time stamp you put future times too okay just exive as long as it's not like too far okay the number okay cool thank you cool thank you yeah no problem uh any other questions yes so when you were showing the hex dump is it is it like a like a you know like looking at a packet header where you know these particular BS are addressed these particular okay so that's always static and you can PCH that out could you then if you wanted to

track a specific person's transactions socially engineer them by just saying you know hey I I heard about you losing your 15 mil I wanted to give you some Bitcoins and then now you've got their address to query against them can you so you're saying like like tell him you're donating to him because you feel bad for him losing his 15 see what address you just donated to and then use that to then run against his data to say who else took from or gave to this address um so you're saying just asking him for his Bitcoin wallet hey I want to give you some money yeah you you could do that but the thing is it's the it's

trivial to introduce new Bitcoin wallets so it's it's you don't have to reuse and it's actually best practice to use lots of different Bitcoin wallets for an anity andit whatever purposes okay yeah thank you new address for every transtion a new a new address for every transaction yep is the right way for to do it best practice according to experts and then based on that question there is like a known attack Vector that I guess the Dark Knight operators look out for and is basically where somebody gives dust transactions so a few Satoshi so that you can pin that address and see where the money flows so where it ends up so there's like complicated addresses that

take out like basically tainted inputs and basically let those not move out because like when you saw the inputs and outputs where it's basically by Satoshi uh Satoshi level like basically how many Satoshi were in there cuz there is people who basically spam the network with Satoshi yeah hopefully being able to track it and then um I uh yeah the base 58 was basically so you could double click the address and then be able to copy and paste it without it breaking with punctuations that makes so much sense yeah that makes so much sense I figured it was either that or something with the QR codes where like maybe q but that yeah it was basically that and I I I

don't know why I'm thinking this too but I mind I don't remember exactly the copy and paste I do was so when you had Capital O's like you didn't you had lowercase and no zero so you don't confuse them oh yeah sure so it was like back in the original design but it is a pain to work with then um there's this project out there called blockseer that does like something similar but with graphical uh relationships you can actually see the input outputs okay and then um my question was how big was the actual data set after you put it in the database so like that how big was it on disc after it was in the database yeah

uh like how much did the database reduce it yeah like how what is the size of like the database like yours if you were to basically focus on keeping it as a onchain or sorry like a basically what's the current status of it yeah type model like how big is that I don't know um I remember it being very close to the size of the csb the actual so the database that I used didn't seem to compress it at all or do anything like that so basically um I had an 80 gig CS or I had 100 Gig CSV and when I loaded it in and I looked at that directory as it was loading in I have to get back to you on

specifics but I remember it being about the same size as the you remember it was it less than the full 80 gigs I don't remember all right I'd have to I'd have to go back and look I was just curious yeah yeah I know there's other stuff like I mean Cassandra would reduce the crap out of it if you if you like really if you had really rigid data types and stuff like that but click house I didn't have as much experience with and also I tried to read a lot of the docs it's mostly in Russian so like I would go and I'm like I like search like is there compression and I find like one Google Talk post of

a bunch of Russian I don't know what they give you a link to some yes another rink to another Russian form I don't know what [ __ ] any of this means man you're really not helping but yeah yes so is there anything in the blockchain protocol or whatever uh to stop like maybe me having two wallets and just transferring one Satoshi back and forth uh there is a bit there is a transfer fee Okay cool so the way the way that works is basically um this is going to get into a lot of other stuff but basically there's a finite amount of Bitcoin that can be mined and eventually when all of those by Design eventually when all of those

Bitcoin are mined the Bitcoin network will be and Bitcoin miners will be rewarded specifically by fees so a fee a transaction fee is extracted from every transaction and it pulls together to actually then reward The Miner once coin bases are gone and once like the uh once mining new Bitcoin once all the Bitcoin are in circulation so to answer your question yes there is something preventing you from doing that cool any other questions no I'm done my name name is Andrew thank you guys so much for coming

this

for

yeah uh so this is the ground truth track at bids Las Vegas 2016 uh thank you all for being here uh I just just want to mention uh that our sponsors are awesome and uh we couldn't do this without them so you should all make sure to visit some of their booths out in the uh the chill area um let's see this is a a live stream and it's recorded so please turn off the ringers on your phones and um oh yeah and also uh the the area in the back is a fire Lan so we have to keep that clear so please you know if you go out and come back in don't stand back there uh okay so uh we

have Ryan Peters and Wes Connell from Blue Vector uh Ryan Peters is an applied data scientist and West Connell is a thread researcher all right uh so give it up for these two

thanks thank you by way of introduction as he had mentioned my name is Wes Connell and I'm a threat researcher and software systems security guy with a firm called blue v and alongside me is Ryan Peters His official title is applied data scientist but I've been working with him for a few years he's a superhuman software developer uh machine learning expert he's also a DieHard Cleveland sports fan so he hasn't uh stopped smiling since Kyrie drained that three in game seven but and if you're from Golden State I apologize I'm from DC so I know heartbreak quite well at any rate blue Vector developed technology to address pervasive network security challenges and along those lines we do a

lot of analytics and machine learning is one of our core competencies and the manner in which we have operationalized machine learning can be much more broadly applied to solve countless problems in the security space uh which brings us here and that's what we'll be highlighting today so the agenda is as follows we will briefly review how cyber defense capabilities have evolved D over time we will highlight this common model problem that exists in not only machine learning deployments but across all Security Solutions from there we'll demonstrate an attackers perspective of defeating this problem with persistence I'll hand it off to Ryan and he will introduce a moving defense solution through data diversification which adds an element of uncertainty to

the attacker's perspective uh we'll review this moving defense from concept to practical implementation to quantitative results and the results are numbers uh it can get pretty dry but they're really important uh so just stay with us we'll then summarize everything provide final takeaways and we'll wrap things up with a quiz at the end I'm just kidding there's the quiz is in the middle no there's no quiz so here is what the security landscape looked like 30 years ago not a whole lot going on so given the non-existent defenses attackers didn't have to work particularly hard to wreak havoc and so the Morris worm in 1989 got the ball rolling followed by an onslaught of viruses in the 90s and even

through to this day and so in response antivirus engin surfaced and they were and contined to be a very reactive approach and that you can't write a signature for a sample that hasn't been released or seen in the wild before and along those lines we had packet and stateful filters that were deployed in the form of firewalls which became the standard uh we had exact hash matches on files and packet filters that were limited to exact matching on IP address poured in protocol if you had a firewall you deemed okay and so antivirus and firewalls eventually gave way to sandboxing emulation and virtual detonation and they're fantastic and that they provide in-depth and detailed behavioral context

while also recognizing malicious indicators and unfortunately as most of you know they're uh there's no shortage of sandbox evasion techniques in particular you've got Mau that uh has environmental checks for signs of a Sandbox you can look for common analysis tools uh check for Mouse clicks Mouse movement you can check if the CPU has multiple cores you can check the dis size uh you have arbitrary sleep statements you have exploits that Target only specific versions of Flash Player and PDF viewers and we're also seeing varying methods of persistence and typically malware will bootstrap itself to load and execute its system startup but uh we've seen as recently as last month the ap28 hacking group would

persist their payload uh when the user opened to Microsoft Office application and so in general this is very difficult to scale these Technologies given that we are drowning in data today it's a today the attack service has now grown and expanded and evolved and we're seeing everything from nation state sponsored attacks to point of sale attacks Insider threats social engineering and especially today uh it's ransomware and so users and Enterprises are demanding additional protection here we are again and the security industry must respond and adapt and so what does our security posture look like today we've got machine learning and many vendors would say this is where we are this super tech robot and that machine learning a

technology from the future with a power that may Mark the beginning of the end for the human race solves all of our problems and is totally unbreakable woo that's right all the home yep and so in reality we're probably here okay so machine learning has added this new Dynamic and uh it's far less brittle it's much more robust it's Dynamic it's proactive but the weaknesses vary depending on the implementation so let's take a deeper dive at how machine learning has been deployed across the security industry so we break it out into a couple categories here across the top we have supervised machine learning which entails labeling the data that you're training on so that the model has an

understanding of the differentiating characteristics and then we have unsupervised with no labels and they're merely clustering algorithms that attempt to find structure in unlabeled data on the opposite access we've got incremental and we've got batch and incremental learns continuously it continuously integrates new data and you've got batch algorithms that train all at once and so in this top left category we have a few things user Behavior analytics Insider threat detection and it's deemed incremental generally because it's building profiles over time and it's unsupervised because we are identifying outliers and anomalies and departures from what's normal and occasionally we'll have this network anomaly detection in the supervised Arena if known examples of bad behavior are available for

training in the top right we have Network traffic profiling uh which are generally supervised and incremental and an example would be uh users labeling a stream of emails as being spam or not spam in the bottom left we have this uh it's unsupervised batch we have malware family identification and it's generally unsupervised because we are using clustering algorithms to find structure in the data and again it's batch because we generally build these on very large libraries of malware and in the bottom right is our Forte it's malware detection and it's almost exclusively relied on with supervised batch algorithms where the training is performed on large corpuses of data that are labeled as malicious or benign so we'll be fixating on this for

the remainder of our talk so with that let's take a closer look at supervised machine learning so here uh we have the supervised machine learning overview we are collecting malicious and benign data in the top left and we create these feature vectors uh label them appropriately and then we pass that off to the machine learning algorithm and it generates a classifier or a predictive model so if you are a machine learning practitioner this is a simple concept if you are not this concept can be intermediate to stupid difficult so we're going to dumb this down a little bit so let's pretend we're building a model that distinguishes Hollywood celebrities from software developers okay so the training data for

Hollywood celebrities would include people like Ben Affleck and Matt Damon and Brad Pit and these feature vectors these representations of who they are would be things like tall dark handsome shredded dating super super models eight figures in the bank stacks on stacks of cash and on the complete opposite side of the spectrum you have software developers people like West Connell and Ryan Peters and Joe and Sean and most if not all of you and these feature vectors these representations of who we are would be things like tall skinny socially awkward goofy looking terrified of public speaking hooked on Pokémon go uh we probably on Reddit developing carpel tunnel one work day at a time and so these are two polar opposite

classes of people and it's going to be a damn good model and so we pass these feature vectors these character traits and their appropriate labels celebrity or developer to an algorithm and that creates a model that should have no problems dis distinguishing these two groups of people we deploy the model and it will try to distinguish these developers from celebrities based on what it knows and so instead of building a model that distinguishes developers and celebrities we're building one merely to distinguish malware from benign content so are there any cont uh any questions before we press on okay great so we have a couple potential vulnerabilities of machine learning number one is we have training set

manipulation So within this malware versus benign Weare model uh one example would be if a web ring Source gets popped and begins serving malicious content uh resulting in incorrectly labeled training sets another example would be a malware author that designs malware that resembles benign software and another example would be uh inserting benign strings into malicious programs and we see this happen very commonly with things like putty number three is that you can have uh malware authors that manipulate their sample to avoid being analyzed altogether uh this can be things like obfuscation bit shifting uh maybe putting their payload in a password protected zip so that the model never even has access to it but if we take a

step back there is a vulnerability that all malware detection Technologies are susceptible to including current implementations of machine learning and we refer to this as the common model problem where we share identical signatures rule sets engines models emulators and so I want you to imagine that each user or Enterprise is represented as a building here and each is seeking to keep Intruders out and so they go to the town lock store and they buy a lock and that lock is a representation of a security solution right there and so Intruders interested in breaking into one of these targets can ensure a Swift Silent undetected entry by knowing how to break the lock before they show up and under

the common signature-based deployment Paradigm everyone in town is protected by an identical lock because they're using the same set of signatures and in most cases an intruder can just visit the locksmith and buy a lock for himself and so he takes the lock home he works away of picking it breaking it free of risk time constraints or any neighbors noticing and so the idea here is that an attacker can easily verify that his exploit will be successful against a Target that's protected by such a signature-based solution and because all deployments are identical all the attacker needs is to obtain any copy of the signature-based product and iterate against it until he learns how to evade

detection so those were the earlier Solutions particularly any virus what about newer approaches that use things like emulation and sandboxing and her istics well we find ourselves in the same predicament uh each deployment is the same even though stronger locks and again because all these deployments are identical all the attacker needs is to obtain any copy of theistic engine or sandbox or emulator and iterate against it until he learns how to evade detection so keep in mind the attacker doesn't need to understand the inner workings of the Security Solutions he just needs his own copy uh to test with and so how about machine learning based Security Solutions we have this super robust Dynamic proactive solution but we're

seeing the same problem here we are still Distributing the same identical lock to each individual Enterprise and so it's no different uh you know let's see if a m author in possession of a machine learning based security solution can evade detection and in this case we're going to try it with officiation and so the workflow here is pretty straightforward uh we're generating a payload uh for our lab we were using a reverse TCP shell so when the victim gets infected it spawns the shell back to the attacker and then we are Opus skating it uh we use Shikata ganai encoding and a handful of others but again the emphasis is not the specific payload or the encoder or the

tooling um the point is that the attacker is working to evade detection and bypass the controls in place and this is through opusc so uh generating the payload uh using the randomly selected encoder to hide and obate what he's doing he's embedding it into a.exe template and then he's testing against in his lab that copy of the AV and the Machine learning and so in our lab we use the ab software clamwin because it's free and open source and then the model that we built uh was it was built using a supervised machine learning algorithm uh with pyit learn and we trained on 20,000 benign and 20,000 malicious portable executable files on the right hand side we have our

test list and this is data the training data that was held out that we could test against to check the how effective our model is and it had a 3 and a half% false positive rate which means it was flagged by our classifier but it was not in fact malicious and almost a 4% false negative rate where it was not flagged but it was in fact malicious and so again we have to assume that the attacker has access to both AV and machine learning software in order to do this and so the antivirus solution is looking for an exact static signature so permuting the file until the signature is evaded it's pretty straightforward and in our experiment

the attacker only had to Opus skate his payload one time before evading detection and so again the machine learning solution is pretty robust and so op skating could actually increase the likelihood of the malware sample being detected and while the payload is being obsc the calc.exe sample could be missing characteristics of benign not only uh it's hiding it's malicious intent but it's um it's also that's absent of benign characteristics and so in our example the attacker had to Opus skate over 1,00 times before he was able to evade detection from our machine learning model and so the attacker here is in the driver's seat for two reasons number one is that the he's confident the model has

not changed as most vendors only deploy new classifiers a handful of times per year for what's common with batch learning and he's confident that all Targets have the same model therefore if the attacker can defeat one instance or one copy he can defeat all copies and so all it takes is persistence so if we take a step back let's revisit this lock analogy and see how we can do

better with machine learning uh locksmiths make a numerous stronger really robust locks but unfortunately he's still selling the same locks and so the problem here is pretty obvious um it's not necessarily to build stronger locks it's that we should be using different locks and additionally they should be changed every once in a while just like passwords and so machine learning is well suited for implementing these Dynamic moving defenses and this is mostly due to the vast number of independent variables that we are able to control we've got the feature space as I demonstrated uh the learning algorithm the data input and it can be oper can be accomplished in an operationally realizable way and so why hasn't this been done

before well for one it's very difficult to implement and you can imagine it's a logistical Nightmare and most importantly how do you verify that each lock is different across these different Enterprises yet equally effective and still find a way to distribute and maintain them and so it's very costly and it's not discussed among vendors as it's easier to just produce one model and share it with everyone and so by altering the machine learning model so that it's Unique for each deployment we can achieve this moving defense and again these these defenses should be changing over time and to discuss this moving defense solution in Greater detail I'm going to hand it off to Ryan thank you Wes so let's look at how

we can Implement a moving defense solution using machine learning so as Wes mentioned um earlier there are three main knobs that we can really turn to change our machine learning model so the first one is the feature space so the data will look different to the learner depending on which features were used but the problem is you know in the lab we're choosing a feature space because we believe it to be the optimal feature space and it's really hard to maintain efficacy if we're forcing ourselves to use a suboptimal feature space in addition we lose the ability to describe samples as feature vectors if each deployment is using a different feature space the second knob we can turn is the

learning algorithm but the problem with this is there are a limited number of Learners available um and it's also extremely difficult to maintain efficacy as some algorithms might be more suit for a classification problem like malare detection and it's also pretty cumbersome on the vendor to have to develop you know unique solutions for every single customer the third input that we can use or the third knob rather is the data input and you know if you change the data there's really almost no balance in the number of different classifiers we can produce and by fixing the feature space we allow our data to be represented as feature vectors so we're going to choose this knob to turn um in our moving defense

solution one thing we haven't discussed yet are the practical issues of where this training will actually take place you know does it take back does it take place back at the vendor's lab or does it take place wherever the machine learning model is being deployed so centralizing with the vendor would be fairly difficult and you might have to send you know data back to the vendor depending on how where you're getting that data from uh so we're going to assume moving forward a distributed model excuse me where the permutations are happening locally on site with the user but but it's important to note either of these would accomplish our goals so how can we construct these

classifiers to achieve a moving defense let's first look again at the current machine learning for malware detection Paradigm let's peer into the locksm workshop so back in the vendors lab we're curating a large benign and malicious Library it is often terabytes of data millions of samples you know the vendors adding their secret sauce their feature space the algorithms you know that they've selected Ed and they're training the learner to recognize differences between samples and the result is a black box classifier or your lock the vendor then pushes that classifier out to the user environment and then the classifier is exposed to unknown samples and issues determinations uh as to whether it believes it to be benign or

malicious but it's important to note here the samples here you know represented by yellow circles are still yellow because there is no ground truth available to the learner um or to the uh to the classifier model and different classifiers might make different determinations on the same sample and this is a concept we're going to be re revisiting later so if you want to generate a new lock it's fairly straightforward we can just use a different set of data we can then push out that new lock to the user and repeat the process as necessary so the question is what is the best way to instantiate a moving defense in practice across many user environments we we see that by using

different data input we can generate different classifiers but here are many possibilities on how to actually vary that data input so the simplest approach is we can just use the vendor as a data source um you know we could sample the vendor data in some intelligent manner maybe there's different algorithms that we could use or you know age off old data uh but the main problem is we're drawing from the same source so we're really lacking some diversity and you know it's likely going produce locks that are not that different from each other and it's also unlikely that the vendors are going to be willing or able to provide the service you know they'd have to curate a

lot of data and if they have the data anyway why are they not incl including it in the original base model so we can do better so the second option is we can use a local data source for the user environment so we're using data that's uniquely seen by that deployment and we're feeding back the data labeled by this classifier back into the data set so this does make the classifier unique you know in particular to your local data but it's only reinforcing what the classifier already already thinks it knows it's reinforcing imperfect knowledge and if we actually peer under the hood here you know we see that the classifier is getting some you know some

of the classifications wrong so it's just reinforcing the errors so again we can do better so the third option is we can still use a local data source but we can feedback only the errors and this will require correctly labeled samples so this one's going to require trusted analyst review but often these adjudications are already being made when using the security tool so you know what we would be doing is adding knowledge users are already generating but has been previously untapped by the Learning model and the resulting classifiers are going to be tailored to that user environment they're going to be stronger on the content they are seeing but more importantly these classifiers are unique and now everyone

is using their own locks so we're going to select this option for the remainder of our talk and we're going to call it Inu learning which is named for the locality of the data source so Inu means people are not just able to buy different unique locks we are actually making everyone a locksmith so this Inu concept is actually pretty simple but there's many different factors to consider when operationally implementing this idea so here we're going to discuss two different dimensions balance versus unbalanced and replacement versus addition to consider when determining how to incorporate additional local data into the training set so the first option we have is balanced replacement an equal amount of the benign and malicious training set is

replaced with new data you have unbalanced replacement in which you only replace samples in one of the old training sets with new data balanced addition equal amounts of new data are added to both the benign and the malicious training sets and unbalanced addition in which you only only add new data to one of the old training sets so why would you want to choose one versus the other operationally you might want to balance the addition of your data to prevent the classifier from favoring one class over another just because there is more of it however you know slight unbalancing might not be a major issue and you might be able to uh deal with this issue you

know with whatever learner you've selected but it definitely could be a concern as more and more data is added you might choose a replacement approach if you want to more quickly that model to your local samples but you're doing that at the cost of losing part of the original training set and if you choose addition you're preserving the original training set but you're increasing the time required to train your new models you might be increasing storage requirements for these new samples so we tried all four approaches and when you're only adding or replacing up to 20% for a single Branch off the base model uh we actually did not see much of a sign we actually did not see a

significant difference in performance so we're going to choose unbalanced addition for this experiment because it's the easiest to implement operationally it doesn't require additional samples or future vectors to balance the training set which is something that the vendor would likely have to ship to a a user so let's look at some efficacy results for this unbalanced addition approach so the top row of this table is the performance of the original base classifier if you remember from earlier it was trained on 20,000 benign and 20,000 malicious pe32 samples from the lab this local data column the false positives these are the files that were missed by the local classifier so it's going to be 100% to start off with

because we missed all the files we missed so if we add 1% to the benign training set or 200 200 benign files the false positive rate Falls to 14% adding 2% or 400 benign files it falls to 8% and this trend continues as more and more local data is added to the original model therefore we have shown here that adding local data previously missed by the Bas classifier to an in Model results in lower false positive rates on that local data so how far has this performance deviated against the original lab test set a different classifier is not a better one excuse me we want to make sure that we did not significantly biased from our broader test

set and as a results here show there was little to no degradation in performance on the lab data therefore the instit models we generated are performing equal or better than the base classifier so we can extend this by generating 10 random classifiers using 5% unbalanced addition and evaluating the performance and you can consider this similar to a cold validation we ultimately want to prove that we're not just getting lucky with the specific sampling of the new data set so we can take the original lab data set and add a random sampling of local data missed by the base classifier and this will produce classifier R1 or trial number one one we can repeat using a different

random sampling which produces classifier R2 or trial number two and so on and so forth for R3 through r10 it's important to note that while these classifiers were generated using the same local data source for this experiment they represent 10 different operational environments each with their own data sources so if we look across all 10 classifiers here we're seeing fairly consistent performance standard deviations for false positive false negative rat are about half a perc so the main takeaway here is we're generating class fires with equal or greater Effectiveness compared to the base model in other words we are producing equally effective locks but the question remains are these locks fundamentally different from each

other okay so we're going to use two different me metrics to try and capture model similarity so the first the first metric we're going to use is what we call feature space commonality and this is more specific to our implementation of machine learning models so when we're training our models our total feature space was down selected in the lab from a global feature set so we're going from you know potentially millions of samples or millions of features down to tens of thousands of features and only a subset of these features are actually going to be used by the learner when we're generating a new model so we're defining the commonality here as the features used by both the base classifier model

and the Inu model divided by the sum of the features used by each model and the results show about 30% of the features were used by both models which means 70% of the features used by the model are different or unique to that model and we see the results you know very consistent across all 10 of these models the other metric we can use is what we call overlapping misclassifications and this is a more General approach and can really be applied to any machine learning algorithm and so we every classifier is going to make mistakes what we're looking to capture is if the instit mistakes are the same or different compared to the base classifier so we're

going to use a consistent test set for all these classifiers and the subset misclassify by each model which is about 3% of the files um are shown in the circles of this Vin diagram so the commonality here is going to be the samples missed by both the base and the in module divided by the sum of samples missed by each model and the results show only about 50% of the samples missed were the same between the two models and again these results are very consistent across all 10 models so using these two metrics we're pretty confident that our in models are significantly different from the base model but are we just forking these models in the same direction or are they

being forked in different directions and so we what we really want to look at is how the insitu models differ from each other so we can answer this question by simply looking at a pairwise comparison of all the randomly generated Inu models using our misclassification metric so there's a lot of numbers here so let me explain first what's going on instead of focusing on exact numbers so let's look at a comparison of R1 classifier R1 versus R2 so we established earlier that R1 is different from the base and R2 is different from the base so what we're trying to answer here is how R1 differs from R2 so the for slashes here represent the files missed by R1 which is you know

going to be about 3% of the total samples it analyzes and the backward slashes are going to be the files missed by R2 or about 3% of the total samples analyzed and the intersection of the slashes are the files missed by both classifiers which end up being about half of the 3% and we can repeat this comparison for R2 and R4 and so on and so forth what we see is for any two given Institute classifiers we have roughly a 45 to 50% overlap in the misclassifications of the 3% that each of them Miss so half of the time they are missing the same files but half of the time they are not again it's

important to reemphasize here that each Institue classifier represents different deployments therefore we are gaining diversity across all these deployments these locks are not just different from the original lock but are significantly different from one another so let's summ the benefits of Inu uh first we have that diversity of defense and this is both spatially which is AC cross deployments and also temporally you know if you got hacked on August 1st and they got access to your classifier model you could retrain on August 2nd you know and incorporate even the data that was used to hack you so this really increases the uncertainty for the attacker second we're generating environment specific class fires as we showed earlier where you've increased

the performance on the type of data that's observed in that environment and the these results are consistent with performance improvements we've seen in Enterprise environments the third is we have increased responsiveness to new threats so previously if you missed a file the only way to improve the classifier was to submit it to the vendor cross your fingers that the sample gets incorporated into either a custom model from the vendor or ideally the global model but you know often you would have to rely on short-term Solutions which could potentially be indefinite you know these included things like brittle signatures or rule matching white listing so an approach like incu allows data to be incorporated into the model

immediately as dictated by the user and the fourth one is one that we didn't really touch on um with this approach since we're doing it onsite where the um where the machine learning model has been deployed there is no need to share personal or proprietary data and again this is not an argument against data or threat intelligence sharing in general regardless of the desire to share or whatever technical hurdles that might exist some situations or even illegal to share certain types of information without excessive modification so for example you know you might have PDF files with personally identifiable information pii you might have health records uh proprietary company info or classified data so an approach like Inu allows this

data to be incorporated into the machine learning model rather than having to discard and rely entirely on the vendor's data or the vendor's willingness to implement a custom solution for your environment so summarizing the big picture improvements in machine learning methods for malare detection are weakened by the Reliance on the traditional deployment Paradigm a dedicated attacker as we showed earlier only needs a copy of the security solution to break through the target with confidence secondly the concept of a moving defense addresses the shared model vulnerability and may be naturally applied to some machine learning Solutions we discussed just a few of the many different ways of achieving a moving defense using machine learning and settled on sever design decisions

that were easiest to operationalize and yielded the best results and third the diversity offered by a moving defense is better for the herd and by better for the herd we mean if we are all using identical defenses we are all worse off for it users should engage with their vendors about its implementation so I'm going to leave you with this one thought you know changes in the security industry don't just come about because of changes in tactics of adversaries users must be must demand change so we challenge you to gain more understanding of the underlying Technologies in our security products in particular the problems with them and to continually challenge commonly accepted practices across the industry thank you

very

much so now we're open for any

questions

uh so I remember the part where you're talking about the different ways you would change the training data sure to achieve the different uh models um I think I maybe misunderstood something about like how you're choosing the different uh learning algorithms and winding up with a different classifier zo could you touch upon that are you referring to this slide here or um are you well I guess I guess like at a high level I see how the differentiation of the trading data could give people different models correct but ultimately because we're all still working from the same base and at least a subset of the attacks are the same I am having trouble seeing like how this doesn't just lead

to one person having a really diesel uh protection if their attack surfaces novel versus everybody else who would have a more standard thing because only the date is changing in some slight ways maybe or I mean that's that's true you're going to get a you know a variety of defenses and you know the better for the herd sort of comment was more referring to you know we don't know an attack happens until someone actually gets popped and you know the reality is you know machine learning is a probabilistic approach and so if we can at least move some of those defenses to catch that you know it it helps out the herd by um you know educating them of

new threats I don't know if that your question at all I guess just like do you how how are you also sing up the learning Alm yeah we're not switching up the learning algor algorithms because it would cause you know too many issues there's a limited number of learning algorithms that are avilable it's just training the model differently to look for different features because the features of the data that was local to that environment um has been recycled into that classifier onsite yeah it's not just the data that's changing it's the features well yeah so well I mean the feature space that we're going going to you know extract from that file or that sample is going to be the same but

the the features that are actually used by the Learning algorithm are going to be different and so that's where you get a lot of your uh diversity in the uh you know that's that's what really results in a lot of the uh changes and classifications that you're seeing so um did you generate any attacks Against the Machine learning models and then test the um well we'll call them vulnerabilities Against the Machine learning models and test them against the different um models that you generated the Inu models and when hro men spoke earlier he said Ed in their experience um changing not just even changing the type of model once a vulnerability found it tended to apply

across U model types from you know neural Nets to decision trees to whatever so did you actually test against any vulnerabilities to see if changing up the Institue model actually prevented the vulnerability yeah so I mean those that um that uh malware variant that we showed earlier um you know we sort of shortened the the demo part here but we tested that against our in models and you know found that about half of the models now caught it but half of them still missed it so you know it's fairly consistent with with what we've seen okay so it okay so that that was actually part of my question as well as uh how how that c sample with the yeah we

should have hind side we should have included that slide and you know that sort of circle back but but then my other question is uh this is fascinating work this is awesome one of the things that kind of occurred me to me as well is and I want to pick you guys' brains about it if the what if what if you just had one model that you were incrementally like always updating like like with the unbalanced additions but like every time you got a new sample to scan and yeah the model was always getting changed or whatever yeah so I mean yeah there's a lot of uh continuous learning algorithms that are out there and you know I I

think a Next Step would certainly be looking into those algorithms we haven't really spent much time looking at those yeah but um you know that would also address your your issue of you know again you know it requires it's the same you know the same barrier of an analyst has to adjudicate it and you know so you can assign it labels but um I I do think that's a good idea looking at continuous learning algorithms so thank so this is in one way is a followup to the previous question sure you talked about false negatives yeah malware that gets in but isn't detected yes how does the analyst know to add it back in if it

hasn't been detected we know from the Verizon studies I think it's 18 months 24 months that often malware is in the organization and before it you know someone notices that there's some malware there so have you thought about how to get those false negatives back into the loop we've seen a few examples where uh large enterpris is after they've get they've gotten popped they go do the forensics and curate a list of all the mount whereare they've um they've seen um so they they kind of house that and so if we give them the capability to develop their own models they can funnel that in from the get-go but after they've been popped it's um

it's tough yeah that's the billion dollar question there if we can yep you know so so that that's a that's a hurdle you had thought about how getting over the the lag between uh infiltration by by negatives and then the discovery of that false negative that that that's a very tough question um you know this approach certainly lends itself better to you know false positive reductions you know and and that that's really important because you know the analyst workload can be pretty hefty if you have high false positive rates on your on your classifier algorithm so you know that you know but even just feeding you know we we showed we were just even feeding false positives in here we

didn't do false negatives that we you know we had separate experiments where we were doing false negatives but in this case even just feeding false positives we got diversity and both the you know the malicious files it it missed and also the ban files it missed so I think up to the mic did y your question it just kind of occurred to me that maybe the key to making inroads along the billion dollar question is sort of the you know I hate to say virus total approach but you know crowd sourcing getting all the samples you can just constantly feeding this B whatever it is yeah I mean you can think of some you know there's some creative uh you know

sharing approaches that we could look into you know if if uh you had trusted um you know trusted groups that were willing to share these samples with each other you know you could still get the diversity um without having to actually communicate with the vendor so I there's there's a lot of different there's a you know this road goes down pretty deep if we you know look into potential ways to to share data thank you thank you for your talk that was good um I have a question I actually am a vendor who's trying to build a product in this area okay and this is amazing how people are talking about it and we are talking back in the

lab as well and I'm founder of the startup so and my background is arite which most of you guys probably know um the challenge that we see as a vendor is like you said that the algorithms can be built but the data is different for each environment and when we try to train the model we need access to that data and which is which becomes a bottleneck for us yeah cuz like you said you can't share everything but when we talk to customers they'll be like H you know show me the demo I can't see my data so so how do we break through that challenge and work together on that you want to talk to speak uh feature vectors

um yeah I mean so you know one of the benefits of using this this approach that we selected is you know we can abstract our data as feuture vectors and you know that's not to say that future vector you know the features that you extract might not might they might have some you know personally identifiable information in them but you could design them in a way where it's completely anonymized data so you know in that ways uh you know a user could share data back to the um you know back to the vendor without actually sending the samples and it wouldn't be you know you could you could avoid those issues so I think that's that's one Avenue that could be

used alternatively you have to give them the capability to use um to generate a model locally with their data without sending any back but yeah and that's one of the reasons why we chose that um or we talked about that distributed approach because you know they don't have to share any data even feature vectors back with the vendor and it really becomes you know specialized for their environment great so I am looking for people who want to participate uh you know give us your data we will provide our services and and train the model to detect these attacks if anybody's interested let me know I'm in the room cool thank you thank you all righty I think that's it thank

you guys very much appreciate it thank you very

much

happen so please you know visit their booths out in the uh chill out area uh want to mention that there's you know we're live streaming this and recording it so please turn off your cell phone ringers and stuff and uh yeah I think standing in the back is not an issue but just so you all know the back is a fire lane please don't stand back there uh okay so uh up next we have Pablo Brewer who is the director at the center for information Warfare and Innovation all right thanks give him some Applause hey thanks thanks for uh thanks for coming out so uh I'm going to talk to you a little bit about snake oil and

and how we're going to live with snake oil and the fact that it you know it may be okay as long as you realize that you're you're buying a little take snake oil so mandatory disclaimer these views are my own they're not that of my employer not mandatory disclaimer vendors are fine people so if you're a vendor please don't await me with pitchforks and torches after I'm done with this talk uh their infc professionals just like the rest of us they got to make a they do the best they can developing products uh but you know they at the end of the day they also need to make a bug so uh last time I gave this talk at

Circle citycon uh a friend of mine that happened to be a foreigner came up to me and went hey great talk but what does snake oil actually mean so for the foreigners in the room let me Define snake oil uh snake oil is an American colloquialism uh and it's based in the in the days of the Wild West when they would sell these fraudulent tonics uh that may or may not include snake extract that had very very little if any benefit whatsoever uh and carried it on and now refers to any product with questionable or unverifiable content or benefit so that that's what I mean when I talk about snake equil so you know brief scoping here

brief uh explanation of uh in history of exploitation uh and what I've got is I've got the defensive measures in black and the quote unquote offensive measures in red uh and you know 1971 we've got the creeper virus uh you know two guys in Pakistan literally in a mud hut developed this thing just because they could uh not with any malicious intent but just they wanted to uh develop some uh self-propagating code and then 1986 Dr Dy Denning comes up with the idea of IDs and then 86 who get the brain virus and 87 John maffy uh before he decided to become a South American warlord uh you know develops this uh this antivirus thing uh and then you know uh 1991 Los

Alamos actually goes through and develops an IDs and checkpoint comes out with a first uh State full packet inspection firewall uh and then 1996 we get uh smashing the stack for font of profit uh it is not the first time that this uh topic was written about but it's the one that most most people know uh you can still go out there and read aa's paper uh just remember it's an AT&T syntax which is just maddening to me but uh there you have it uh and then here here's kind of an interesting thing uh 1997 uh The Returned lib SE paper is is written uh and it gets largely ignored and it gets largely ignored because the

old smashing the stack thing still worked uh and and so it gets just just remember that gets written in 1997 then you know 2003 uh IDs finally becomes commonplace right so it suggested 1986 developed in 1991 2003 you know finally most most companies have this thing uh and then uh 2003 uh Metasploit gets gets released publicly and people lose their minds uh I think my favorite quote about Metasploit when it came out was this is like the ice cream in handing out Dynamite to kids uh I still use that one um and then 2005 Intel develops the uh the XD bit which is to prevent what al1 talked about in smashing the stack for Fun and Profit but it's only the in

tanium servers right so these are you know this is about as close you get to a Mainframe without actually being Mainframe so you only find it big expensive uh businesses uh 2006 rise of 64-bit architecture means that we finally have stack smashing protection via uh the non-executable stack that execution prevention uh and address space layout randomization uh which is great so in 2006 we finally mitigate the uh stack smashing stuff and now all of a sudden right the offensive folks go hey that return to libc that guy was on to something right and so by the time we develop Depp and aslr we already have a technique for bypassing uh shortly thereafter the next year we get return oriented programming

which is really just an extension to the return to lip C attack uh and so what we get is this nice ping pong back and forth uh between the offense and the defense so the good news is you're all going to be well employed that's the good news the bad news is we're going to keep buying products that aren't really going to protect us in the way that we would like them to protect us so just kind of keep that in mind as we go through this presentation so why are we here uh most people that work in infosec are not computer scientists just you know straw poll how many of you have a computer

science degree there's a couple of you okay if you had to suffer through automa and you don't know why you'll learn why don't kill me cuz I'm going to teach all these folks the Salient points of aom in about 35 minutes okay um so you know why does it matter that most infos Pros aren't a computer s test well computer scientists have to suffer through this class called autom I say suffer because that's every major kind of has a weeding out class and and for computer scientist it's aoma it's a very you know math and and notation intensive course about computational Theory and complexity uh and that computational theory allows us to very quickly figure out if something

is snake oil or not uh if you pardon the the the language it is the ultimate [ __ ] flag uh and you get to throw it a lot when you're talking to vendors uh and so recognizing that snake oil and being apply that computational theory allows you to you know kind of avoid this conversation right so the boss comes in and goes we we can't do the breach thing it it's too expensive we're losing too much you got to make it stop right and then what happens is you know Joe vender shows up and says you need our next Generation box full of pew pew magic proprietary technology uh and it's it's full of OD

detection and it's going to cure all of your woses uh and then people that aren't computer scientists you know allow themselves to you know hope briefly and when we go really you can detect OD days and then we get told yes DNC your email will be Totally Secure I made these slides up before that whole thing broke it was just a happy coincidence I promise uh and then they go off and they tell us about their patented pew pew and Magic sauce and snake oil and you know you get those conversations and you're not really sure how the product works and we'll we'll talk about some of that language and then we throw a bunch of

money at them right and the vendor does this with our money and then you know 90 to you know 180 days later an AP comes along and there's our Network where AP is defined by anybody that could bypass your magic box full of pew pew technology right it very well may be a 12-year-old with a dollet modem and an AOL account but they got past your next Generation defense so clearly they're an apt so who's been to a vendor uh page recently and read a product description right let me know if you haven't seen one of these terms holy crap what does any of that mean state-of-the-art adaptive defense Next Generation threat analytic multi-layer hunt end to end cyber

anything really I feel dirty just saying the word cyber uh virtualized cloud enabled threat Centric digital DNA Big Data software enable and I'm going to pick on Sans here what the hell is a forensicator I just want to know you forgot machine learning H yeah well there's the machine learning one but we're going to get but this you know the first time I gave this talk I gave a very short version of this talk that unfortunately didn't go well at bside San Francisco and it happened to be the same time as RSA and so I went to the RSA vendor area and I couldn't have planned it better I

saw what in the wah wah world first of all first of all the marketing people didn't think this out through very well right because I don't don't I don't know that I can say koip and pl company it just sounds dirty right uh second of all Cloud somebody else's computer unless you're running Noel netware it's probably going to be over IP so it just seems a little redundant to me anybody here running netwar or you know ipx SPX I'm just curious no all right got my BBS going you still got don't don't laugh I still have WWI source code Inc on a floppy disc I still got it I'm I'm that old um so you know when I see Cloud over

IP the first thing I think is yeah you know cloud is made of servers and then I look get the vendor kind of like that rightz I I just I don't even have words for that kind of language so how do we get past the marketing speak well we're going to computer science the hell out of it right we're going to talk about a little computational Theory so there's this guy you may have heard of him his name was Alan Turing right uh and he developed this thing called a turing machine which is a an abstract machine a mathematical model used to prove fundamental limitations on mechanical Computing okay uh and something is said to be turn complete if

it's theoretically capable of expressing all tasks accomplished by a computer this is any computer I don't care if it's an IBM Mainframe I don't care if it's your iWatch your iPhone your Android your PC your Mac this is all computers uh an aom is this uh is this field of discrete math that studies computers and the problems that can and more importantly cannot be solved by computers Okay so so very roughly speaking automet break breaks up problems into three categories there are solvable problems these are considered easy they can be solved in what's known as polom time there are intractable problems which are hard and there are undecidable problems which are impossible and it doesn't

matter how fast your computer is it doesn't matter how much RAM how much hard drive you can reuse the entire AWS resource pool and you are not going to solve this problem so I'm not going to talk about the easy problems we're going to talk about uh a hard problem and an impossible problem uh and then I'm going to show you how those will those are enough for you to tell that you're buying snake oil from vendors so One does not simply solve an MP hard problem okay so what is an NP hard problem I'm going to describe one to you the most famous one is known as the traveling salesman problem and traveling salesman or tsp as it's known

is explained this way given a list of cities and distances what is the short possible route where the salesman visits each City and returns back to the original City seems fairly simple right not that hard remember that there's only one optimal path there is only one correct answer and that that answer is the most efficient it's not sufficient to be efficient you have to be the one most efficient answer and it turns out that this problem is really hard for computers uh and the reason it's hard is this let's pretend that we have a traveling salesman you just have five cities right you have five cities so when you start out how many cities do

you have to choose from you have five cities right you can start at any of the five cities you're you all right I'm going to pick my starting point I have five cities then how many you have left to choose from we have four and then we got three and then two and then one what mathematical function does that look like factal looks like a factorial right so it turns out that the running time is is factorial so that is non-p poomi time right number of cities is n and the running time is a factorial of the sample size that I have so running time for this thing seems relatively simple but it's a factorial time so let's think about this

if you have to generate the most efficient path to go from New York to San Francisco there are 19354 cities in the United States and you have to try every possible combination in order to get the optimal answer if you do that factorial and you do one calculation per millisecond it takes you 110 centuries to solve right almost 111 Millennia who wants to hang out for that anybody no one there's always one right um so a deceptively simple problem it's certainly easy for me to describe but there are a lot of calculations involved so you know why do I care about tsp well you care about tsp because tsp is analogous to a lot of problems that

we want to solve in computation uh optimal routing right I want my packs to get from point A to point B in the most optimal path well that's a tsp problem uh detection of manin the- middle or race conditions that's a tsp problem resource use optimization search algorithms crypto problems uh basically anything where you want optimization is going to be reduced to ATS P problem yes sir if you don't know the optimal routing how do you know that you don't have a man in the middle attack yeah okay so uh if solving traveling salesman problem is hard then you know solving all of these problems is also hard to do via computation so right at this point

somebody goes hey look I drove here from wherever and I used a GPS and it got me here and it did definitely did not take 110 millennia so let's let's examine how that happens right so GPS works because it makes a bunch of assumptions on your behalf and so some of those assumptions make sense right you're not going to start your starting point is going to be right where you are it's not going to be any of those 19,000 cities your starting point is where you are and it's going to make assumptions that interstates are faster than rural routs and rural rots are faster than Farm roads and then you're going to tell don't show me t uh

toll roads and don't show me fairies and it's going to assume things like Loops are bad and so those assumptions drastically reduce your calculation space they drastically reduce the uh sample space and so you get a workable solution notice I said workable it's not optimal well in the security space if it's not the most optimized solution you're going to have type one and type two errors that means you're going to have false positives and you're going to have false negatives right so who's ever you know followed their GPS and they end up with something like this right where it goes yeah turn left here and you're looking at a building right so that's when when

those assumptions fail and it's okay most of the time because most of us aren't playing Pokemon go when we're following GPS directions and we can actually look left and see the building right so we don't actually turn into it so traveling salesman is hard but you can solve it if you make some assumptions right those assumptions have to be good assumptions but how does that saying go about assumptions yeah I'm not going to say that because I'm being recorded but yeah something like that so if the vendor tells you that they're solving a hard problem they're making assumptions you probably need to figure out what those assumptions are so ask the vendor so that's a hard problem who wants to see

an impossible problem turns out that the impossible problem is actually easier to explain in theory now remember when I say impossible problem this means that no computer ever following the Turing model will be able to solve this period it's not a mors law or houses law problem it's not a resource problem it's not a Quantum Computing problem until we do something fundamentally different than what touring suggested we're not going to solve it so here it is given m a machine M an arbitrary tur machine with an input alphabet Sigma let Omega be an element of all possible strings will will M halt on input string Omega who remembers this from automa awesome the rest of you are

looking at me like this all right let me draw you a picture I'm going to draw your picture here it is so there's your picture there's your you know turning machine M and your input uh Omega you're going to compute and then you've got a deci state so given the reduction state if the turning machine state is decidable then you know this is a decidable problem no no no standing ovation all right kiss principle keep it simple stupid here we go given a computer program and input you will not be able to determine via Computing methods alone whether it will ever finish running so I give you a program I give you an input to the program you feed it

to some analysis Computing machine it's never going to be able to definitively tell you if the program will finish running with that input why is that well let's consider the options right so you've got this big analysis Computing machine running on you know the AWS Cloud there somewhere and you feed it the computer program you feed it the input and it finishes woohoo success but what if it doesn't finish well it might be stuck in an infinite Loop it might be running properly and it just needs more Computing time and you never really know if it needs more Computing time so you wait an hour and you pull it maybe it needed an hour and

10 minutes you wait a week maybe it needed a week in one day see never really know or you know possibly just haven't accepted the upgrade to Windows 10 I I'm not sure but either way you don't know why it didn't finish running you just know that it hasn't finished running right so this is basically a logic problem there's no way to get around this and it's not a Computing issue uh you're always going to I'm always going to be able to develop some program and some input that's going to run a little bit longer than you're willing to wait okay so C certain problems are provably undecidable therefore Impossible by Computing uh and so they're going to

require substantial assumptions to be made uh same thing for hard problems and so vendors are going to come to you and they're going to tell you that they solve one of these hard or impossible problems so can anybody tell me what the halting problem is analogous to yeah antivirus this code is definitely bad it's definitely malware well if you've got a signature then sure it's malware but if you don't then I I don't know I can't tell if it's going to Halt let alone if it's bad so when you know when they claim they solved a difficult problem right harder impossible let's just let's assume that they have I'll give you the benefit of the doubt right you're you're all fine

individuals you're osex professionals like I am and then I'm going to imagine that I'm going to I can take that vendor solution and I can package it in a chip okay and then I'm going to use that chip to see if I can solve an undecidable problem and if I can use your proposed solution to solve an undecidable problem then I'll have a proof gu Contra contradiction because I know I can't solve an unsolvable problem and if your algorithm solves an nons solvable problem you also didn't solve your problem right so the vendor says we detect OD days now let's be fair to the vendors I've I've not heard actually that's not true I've heard one vendor

actually tell me our new product our new Next Generation product solves all OD days fly into all OD days um normally vendors are a little bit smaller smarter than that they you know they leave themselves in out and they say we detect OD days we want to here we detect all oday but they rarely say that so let's assume that's true let's assume they've got this oday detector with this oday detector with this magic inside and then we're going to put that magic OD detector inside of the halting machine and I'm going to Define all halting states as safe code and all non-h halting states as malware coold I solved the halting problem but I can't I can't solve the

halting problem and if I can't solve the halting problem then you can't give me an oday detector that's just not how it works so that's that's what's called a proof by contradiction except that right earlier this year right April 19th MIT releases uh this thing saying MIT Builds an artificial intelligence system that can detect 85% of attacks mat's got some smart folks who believes that anybody here believe that all right so I I got forwarded this like by 600 people maybe not quite that many but you know I'd given this talk before and I got forward this hey look the MIT guys they figured stuff out they're smarter than you are and I went okay yeah

they're smart I didn't go to MIT I went to you know a different school but let's read the fine print so I went to the MIT maybe I went to the MIT homepage because I thought surely the mission is a accomplished and I'm out of a job now uh and I read their assumptions and it says it presents its findings to a human analyst and the human then identifies which events are actually real events and which ones aren't so I'm just imagining like this big box with Blinky lights and there's like a [ __ ] inside of it that like takes a slip of paper and goes no this one's real and this one's not um so

really you two MIT come on I I I just can't but never mind the details right let's let's just go with that 85% thing 85% is pretty good that's a solid B so let's say one 100th of 1% PE of people on the planet can write you one OD day a year seems like a reasonable number one 100th of 1% one OD day a year so let's look at China who has 1.35 almost 1 3 uh six billion people 1 100 to 1% right WR you an OD day 85% of them are detected right so they write 13,000 uh 135,000 OD days a year we detect 85% that's 11,000 that leaves 236 un unidentified and unmitigated OD

days a year and that assumes that we block 100% of known attacks that those humans that get that input make instant decisions and they're always right because humans are infallible that's why we have computers so 85% is really not so good so what about malicious traffic and tcpip I picked on AV let's let's talk about tcpip so you know those Next Generation firewalls how many how many next Generations are we up to this is actually my third Next Generation so we had the first generation that we had Next Generation which were second generation I can't wait for the next next Generation uh it turns out the tcpip is turn complete and turn complete means the halting problem

applies and if the halting problem applies you can't look at packets and positively identify all bad traffic you just can't it's a halting problem uh and so that next Generation firewall it's going to make things a little bit better right and but it's just not a Panacea it's just different kinds of signatures that are going to be bypassed in new and different ways or in old and known ways but if that wasn't bad enough we've got the thing called the Bon Numan architecture so most commercial systems and there are one or two exceptions uh have this key feature in their architecture where instructions and data share the same code space and on top of that instructions and data are

represented with the same assembly level language so given an arbitrary hex 41 and x86 architecture it may be a capital A or it may increment the ECX register uh and every exploit every exploit takes advantage of the fact that you're expecting data and I'm feeding you instructions uh in the case of return programming the instructions I'm feeding you are jump instructions where I'm going to find arbitrary gadgets of assembly code already existing in your memory space from completely legitimate programs and I'm going to dynamically build my payload on the fly so even data execution prevention and address based layout randomization uh they're not going to stop they're not going to stop us and that's just kind of where we're

at uh today so it's pitch black and you're likely to be eaten by grew uh for you old people there was this text game called zoric you should look it up um so here's where we're at um we can't optimize resources because traveling salesman problem says we can't we can't identify malicious TR uh logic because halting problem says we can't uh we can't tell if our program will even stop let alone give us a correct answer because that's also the halting problem and it's worse we can't even tell if what we're reading in memory space is data or if it's an instruction so anybody want to give up and the bar's right over there the bar's right over

there so how do we live with no Oda detection um the vendors will claim they can do detect some uh zero days using signatures and her istics which are really just statistical models uh which is which is fine uh the way that they're doing that is they're making assumptions uh and they're making those assumptions so they can reduce the problem space so they can give us a workable solution again remember that workable solution is nonoptimal so you're still going to get those false positives and false negatives so we as consumers of those products we need to talk to the vendors and we need to ask them what assumptions are you making then we need to validate

that the assumptions that they're making are sound and make sense in our operating environment so let's walk through an example uh vendor walked into my office uh couple of months ago and said they have this next Generation vector-based algorithm for behavior-based Network mod modeling that can detect all malicious activity on the network what does that mean all right Next Generation meaningless marketing term means nothing right it's a null uh behavior-based modeling means heuristics or statistical model and vector-based algorithm means it's directional okay so is it a valid claim absolutely not they are not going to detect all malicious traffic can they detect some malicious traffic yeah probably is it useful I don't know depends on your operating environment

but let's see uh let let's talk about an IC ska system why are we going to talk about an ICS ska system uh they are a physical system so there's a computer component and there's a physical component uh I don't know anybody that has anything nice to say about the state of security in ICS Gada it is horribly horribly broken uh I ICS skada systems are still learning rules that the PC industry figured out in the early 1980s uh they're difficult to patch if they're ever patched uh and they're missing traditional defense so when you look at IC scada these are all you know old Motorola controllers in some places old arm controllers um they don't have data

execution prevention they don't have address bace layout randomization they don't have antivirus they don't have IDs they don't have any of that stuff that protects us on our Windows and Mac networks uh and the protocol while it's well understood is still tur in complete so basically these ICS skaa networks are as bad as they're going to get so let's see if we can make it a little bit better so what are some protocol assumptions well again we understand the protocol but it is turning complete so halting problem means we're not going to detect all bad traffic but IC is still a fairly robust protocol we are probably not going to use all of the IC protocol

messages in our own unique implementation so let's just look for the ones that we're not going to use and cast those out as malicious traffic uh I'm also going to make the assump asson that my IC ska network is my IC skaa Network and so I'm not going to see a whole lot of internet surfing I shouldn't see any internet surfing I shouldn't be seeing managed code like Java or Flash or Shockwave or all of those things that get you pwned on a daily basis and I'm going to assume that I'm not going to have any Pokemon go on this thing so those my protocol assumptions uh and then these icsa have component as well and so these physical

components are pumps and they're valves uh and their pipes and their generators I mean these are Big honken Industrial Systems and those things are engineered to very precise failure tolerances and they have physical limitations they're when they're engineered they're engineered to work within certain heat ranges to work within certain pressure ranges uh to work within certain speed ranges to heat up and cool down at very very precise rates uh and so anything outside of that would be anomalous and I should keep track of that and and then I'm going to make these operating assumptions as well I've got these Sops uh the industrial process is going to be well understood uh again these are large very expensive factories very

large and expensive machinery and so we have these St standard operating procedures and actions uh and anything outside of those actions should be anomalous now there are times when we go into emergency response procedures uh and that allows us to break out of our normal Sops but again we should know that we're doing something anomalous uh and not every component in my industrial complex needs to talk to every other component if the wrong valve is trying to send a message to the wrong pump we've got a problem and so when you put those three uh operational circles together with your physical constraints and your valid protocol messages and your procedural constraints it turns out that really this is the space that we

need to operate in that is really all we need to monitor which is a heck of a lot less space than all of this so that vector-based that uh heuristic engine it might be okay it might make things a little bit better so key lessons uh almost no security problem of real interest uh can be solved optimally by automation alone uh we're going to need security analysts we're going to need smart people that read our ids's and ipss and understand what's really going to go on uh we need people that understand the technology and it can actually make sense of what our sensors are telling us um we can solve portions of these problems by

making assumptions and understanding the risk R uh but we make those assumptions we need to understand that there is some inherent risk in making those assumptions uh different vendors are going to make different assumptions that's actually a good thing for us because I know we've all been hearing uh defense in depth for a long time but if you buy multiple vendors and they all make different assumptions then maybe vendor B can catch an assumption from vendor a uh makes it harder to manage things I understand that but it does make things safer and security is not free uh understanding fundamental limits of computing are going to help you identify snake really you can get by

normally on tsp Von Newman and uh and the halting problem if you understand those three uh you can sit across from a vendor and when they tell you that they detect all L days you can go really halting problem and if they don't know what that means tell them to bring you an engineer computer scientist that does uh and automation is great it's going to help you reduce the noise but it's just not going to solve the problem for you so for the vendors please please please be honest in your claims um you know send me an engineer or a computer scientist I have questions salespeople are great for touching base with your contacts they're great for making new

connections they're great for maintaining connections but if I'm going to spend a lot of money on a product I want to talk to somebody that can tell me what's under the hood because I want to peek under it and I want to kick the tires uh your your assumptions and methodologies they're not secret sauce if you need to have your vendor sign an NDA to tell them so that you can tell them how your product works and what assumptions you make then by all means I've got no problem signing an NDA but uh if you're not telling me these things I'm probably not going to buy your product um and for goodness sakes quit selling me marketing speak I am tired of

reading an entire book full of literature and not understanding what your product actually does after it that next Generation cloud enabled softwar defined heris based virtualized threat Centric multier end to end analytics adaptive cyber defense Cloud over IP pew pew now not worth the glossy paper it's reading on um and if you think that's Overkill please please go read some of these vendor Pages you will get really excited after reading about two pages of text and realize that you still have no idea what that product does so for those of us that buy these products the vendors actually have some pretty smart Solutions right but you have to understand what the limitations are if they're making claims that are

too good to be true they probably are ask them how they're reducing the problem space uh and ask them uh what assumptions they make and then more importantly validate those assumptions make sure that the assumptions that they made make sense in your operating environment uh learn which problems can't be solved by computers and which can be uh can't be solved easily uh and realistically human operators and hunting combined with defense and depth it's still your best solution it's going to be your best solution in 10 years it's going to be your best solution in 20 years so the next time a vendor comes to you and tries to sell you brondo that mutilator um you know ask them to tell

you again how they solve the halting problem so with that are there any

questions so I think one of the previous speakers mentioned the benefit of um combined uh machine and human human analysis how does that um play into the assumptions and you know account help uh balance out what the machine's doing and improve the overall um outcomes I I absolutely agree I was actually here for that presentation and that that's actually right here in this slide where I talk about you know the good system what he talked about was automation with a good system so here's your good system of understanding the assumptions and understanding what the machine's going to decide to you for you and then those human operators actually going through and validating what the sensor is telling you so I absolutely

couldn't agree more so the idea would be that the machine is um acting as a have a first level filter to prioritize things for the human yeah that be a way articula that yeah so absolutely the machine is is a first layer filter to reduce the noise and and allow the human operator to spend time looking at things that really uh they need to spend time looking at the the low hanging fruit by all means please let the automation take care of that and then the the people can spend their time chasing down worthwhile

threats do humans suffer from the same problems that the machines do as far as um the halting problem and such humans suffer from different problems um hum suffer from a lot of problem humans suffer from so so no I mean uh the human can make the decision that if you haven't told me that this thing is halted by now I'm not interested in it a human can make that decision the machine really can't uh the machine can do it if a human tells it to uh the problem that humans have is they they suffer from boredom and they suffer from in some cases a lack of Education and Training uh and in some cases they they suffer

from most of us a lack of sleep uh and and so you know it again it's one of those things that the assumptions the machine and the assumptions of the human operator kind of back each other up because we suffer from different Pro problems um yes sir yeah thanks for the talk um uh I I think you're not really going far enough in your criticism of these products um so uh I think one big aspect that you haven't mentioned at all is that that these products can introduce risk and these risk increase over complexity I mean because what we're seeing with a lot of these next Generation things is that the these scanning engines become so complex that

that introduces a whole lot of vulnerabilities there was some interesting stuff by Tavis amandi where he found that some antivirus was kind of emulating what a program would do and then it would if it cannot ulate the function it would call the one from the system and thereby you could get code execution through the antivirus um and also like um you're totally right that you're hitting these theoretical uh borders if you're trying to get a perfect detection but I don't think you're hitting any theoretical borders to trying to build a system that just doesn't execute foreign code so my question would be if the whole detection approach isn't the wrong way of going but instead we should uh design a system

that just that is secure by Design and I think there's a lot of interesting research coming from the langi community uh which doesn't nearly get the amount of attention it should get yeah so I I agree with just about everything you said um so as code becomes more complex that's any code it's not just antivirus you're you're going to have more mistakes because people write the code uh it's funny when you look at antivirus they actually use a lot of techniques used by exploit code so uh Library call hooking and uh you know root colel interrupts and those kind of things that I mean that was all developed originally by malware authors it's now being used by antivirus firms a

lot of the things that does for Microsoft and I'm a big fan of EMT but it uses a lot of malware techniques in order to detect malicious code uh to answer the other question about the detection uh the LC uh is doing some great things but until I can tell the difference between an instruction and data because I've got two entirely different memory setups for those we're just chasing up the wrong tree uh as soon as I mix data and instru they're both represented in the same language I'm I'm stuck and so now I can't tell the difference between data and instructions and so your program expects data I actually feed it instructions and convince the colonel to

execute those instructions on my behalf that that's the real Crux of the problem with the vum and architecture no we just need to change the you know construction and architecture of every Computing platform we currently use but no it's not unsolvable none of these uh no no no I I agree with you it's just a matter of you know somewhere along the line we chose to do it this way for a reason we need to go back and

reverifying architecture they were really thinking about the future of infosec I think they were just trying to think about how do we crunch numbers as quickly as possible with as little heart Ware as possible so you're right this may be a time to go back and you know reexamine those assumptions please so sorry I'm going to break up the uh the tech talk with a sort of a soft skills people question related to this yes please so you mentioned about um you know demanding to talk to an engineer and so so I'm just curious then so again for tips for people if you have to deal with this so if the fender is reluctant to do that so what's

considered a deal breaker when do you pursue it harder you know how how do you deal with the the the push back if they don't want to cooperate with that so more like the soft skills how do you deal with with that so great question uh I have found that as soon as I tell the vendor uh okay I'm done talking to you I'm not going to consider your product they're pretty much willing to give me whatever I want um now I have had some vendors go look I I'll discuss this with you but you need to sign an NDA or I'll suggest look if we need to if you need me to sign an NDA I'm happy to do that

uh but normally vendors are are pretty good and once they realize that they're not going to get any further with you um if you're a small business it's much harder uh but if you're a larger Corporation um yeah it's it's amazing what they'll do when there are hundreds of thousands or millions of dollars on the line and I've never had one significantly pushed back on that not even in government and that's saying something any other questions okay well with that here's your vendor bingo card to go out there and go through the vendor space uh thank you so much and uh I will see you

around

uh welcome back to the ground truth track at bides Las Vegas 2016 uh I just wanted to take one more opportunity to thank all our sponsors um because we couldn't make this happen without them uh please visit their booths out in the chill out area uh let's see uh we're live streaming and recording this so please turn off your cell phone ringers and whatnot uh and please don't stand in the back because it's a fire lane uh up next we have Leila Powell who is a security data scientist at panace thank you all right thanks thanks so I'm sure many of you have heard the term data science appearing more and more in an infoset context and

at the moment as we've heard from other some other speakers already the focus currently seems to be on machine learning now machine learning is a great family of algorithms and can be really powerful but that's all it is is unfortunately we seem at the moment to be missing uh any coverage of the broader discipline of data science this can be problematic for a number of reasons first of all um similar to the talk we heard just before people can get taken up by the the hype of advertising around machine learning and just think it's a a Magic Bullet for all their problems if they don't understand the work required before and after to make the solutions

robust also people maybe think that they can just start applying machine learning algorithms ad hoc on data um without the experience to handle the data properly but for the purpose of this talk I want to focus on one of the other areas which is the fact that we're not spending any time looking at data science as a discipline means that we're missing out on some of the benefits applying the the dis the discipline of data science to infc can actually bring in the last year I've been working with financial services companies trying to help them to bake in a data science approach to their data analysis and infos sec in particular we've tried to help them with a couple of problems

first of all communicating the data um it can be hard in infp because lots of different stakeholders to get everyone to agree on what the truth of the situation is and in that case you then lose trust in the data analysis so today I want to talk to you about why data science is discipline uh and what that involves and look at how applying data science of discipline to infc can tackle some of the challenges that I've seen people face over the last year okay so data science is discipline essentially what I mean by this is it's a way of doing things there are principles that govern how you should do stuff it's not just uh a bunch of

algorithms that we just throw at things um data science like many professions is often misunderstood people have one idea of it but actually when you get involved there's a lot more there's a lot more to it and I'm sure those of you that are more on the infex side in the audience can can sympathize with with this this problem so let's talk about the principles of data science I've broken them down into into three areas and the important thing here is that they all build on top of one another so so the first one data exploration and preparation is required as a foundation for everything else that comes afterwards so I want to give you a

little bit more information on each of these uh each of these areas okay data exploration and preparation this involves uh a series of principles first one is understanding what questions can be answered by your data set so if you take any set of numbers you can uh look at the distribution you can calculate a median you can do anything to it and some other numbers will come out but you need to understand what information is contained in the data and whether the question you're asking can actually be answered the second one is is domain knowledge uh and this was this was touched on in the first talk this morning um suggesting that sometimes SEC data scientists working in security

might lack the domain knowledge to do the work but actually we should be getting this from the data if we understand work with the data properly uh we can learn what we need to know to answer the questions from the data set but we have to be careful to do it thoroughly the next point is taking a look at the metadata so by this I just mean data about your data when we take a set of data say from a database we don't just assume all of it's valid we want to look at the time stamps governing when that data when that record was updated where it came from is it still valid now was it was it

relevant 6 months ago can we still use it next point is around quirks in the data at some point whichever database you're you're reading from you're extracting data from was designed by another human who might not thought exactly like you do so when you export data and start to work with it yourself you have to be careful not to make assumptions about how that data is structured how it's updated and how it's stored otherwise it can lead to to misunderstandings interpretation then we need to think about completeness this is a bit slightly related to the first point what questions can we answer we need to think about if we're looking at a population of users or assets we need to really

think about how much how how well that population is represented in the data set we have if we only have 10% of our Assets in the data set any conclusions we draw won't really be valuable uh and finally I've thrown in simple stats in here here um I think this is often overlooked people often want to jump straight to some really Advanced algorithms but actually if it's the first time you've analyze a data set for something Beyond its operational purpose some basic statistics can be really valuable and reveal quite a lot and you should probably be doing it anyway to check you understand kind of uh you have a feel for your data before you start to do anything

else the next uh area then of data science is applying the algorithms now you will have he a lot of great talks um at B size today and tomorrow around some of the specific algorithms machine learning other stuff how you actually how the things actually work but today I want to focus on the as I said the discipline of data science and the things you have to be careful of when you apply these algorithms the main one I'd say is understanding which algorithms appropriate for first of all your the data set you have so if you want to apply a statistical test or a machine learning algorithm you need to think what's my data look like some things

will will assume a normal distribution do you have one of those if your data set is skewed are you applying the right the right U machine learning algorithm to it the next thing is is your use case what is the question you're trying to answer and what is the appropriate algorithm to provide you with the information to answer that question this is particularly important for the the scenario I'm talking about which is not doing data science as a as a researcher but actually working with an infos team in a company so we can't just look at stuff that's interesting we have to look at stuff that's useful so the use case and what we're actually going

to do with that afterwards is really important and finally the thing we want to consider is the level of accuracy required so how accurate a number is appropriate how much time do we have to re that reach that solution we can't say we'll be 90% accurate in 6 months because the infc team we're working with need something a bit sooner and finally on to communication which is possibly one of the most neglected areas of uh data science in infc a few the the principles then of doing uh communication well is first of all balancing uh what I've called caveats versus usability so we'll talk about this a bit more later essentially what I mean here is giving someone enough

information that they have relevant context for their use case and the decision they have to make but not giving them so much information that they're completely overloaded and don't know what to do next next we need to look at the perspective on the data that's appropriate for the different stakeholders infos data has a lot of different stakeholders with different roles and responsibilities and you can't show one plot to all of them to help them do their job then we need our insights to be actionable something I touched on earlier when we present results from the data it needs to be something that someone can do something about there's no point just highlighting a bunch of

bad stuff and uh and leaving them to it it's not really useful and finally we need to be careful that the probability of someone misinterpreting the way we've presented the data is low so our job as data scientists and this is what you should expect from a data scientist you're working with one is to make sure you understand understand um what they presented with you they shouldn't just be throwing stuff over the fence and leaving you to get on with it so I just wanted to jump back to to machine learning again briefly to say where does that fit in so it's kind of in this algorithms um this algorithm section and what I want to take what I

want you to take away from this slide is that if you're going to apply machine learning yourself um for example if you enjoyed the first talk this morning you'd like to have a go with some of those algorithms and you're actually going to use what you find to make decisions um make sure you do all this first bit all the data exporation and preparation you need to do all that as well and you need to be able to communicate the results and secondly if you're looking at tools you know back to the kind of crazy vendor claims again um look at what data they tell you they need to have for their solution to perform as they've claimed do they need

to collect data for 6 months N9 months which data sets you need how clean does the data have to be because your solution is only going to be as good as a data it's built on so today I'm going to focus on the bottom and the top of this uh Tower of data science um simply because there's there'll be a lot of focus on on other ALG algorithms in other talks and I think often we forget the how crucial the the foundations are um and actually doing something with the the data afterwards so I want to talk through how we can apply these principles of data science to um to infc and I'm just going to use the example of

of vulnerability analysis so I want to start off with the the data bit if you like getting the the strong foundations now if you're analyzing vulnerability data simply using the tools that were provided by your by your Vendor by the vulnerability scanner so just logg into web interface or using their reporting module then you you don't really need to worry so much about all this kind of data science stuff because you're just looking at pre-created plots um made by the people that that built the database right so this should all be fine it's when you want to do something beyond what you can do in that tool and I've seen lot in the last year people end up

exporting stuff in Excel and trying to do additional reporting and the problem when you do that is you need to have some kind of framework and a way of handling that data uh otherwise you get into all sorts of trouble so if you're happy with your your vendor tool then don't worry about it um if you want to export the data and do something with it yourself here's how we can go about that so starting with the strong foundations so I want to kind of set the scene with an example problem uh that I've seen in the last year so we talked earli about how there's a lot of stakeholders in in infc and in vulnerability data there's a lot

of stakeholders too you have the patching teams the vulnerability manager you've got the ceso you probably need to report up to the board as well and all those people need to pass an information about what's going on so suppose your your ceso has report on the vulnerability situation to the board and uh they they've got a nice trend line here of number of vulnerabilities over time you can see Something's Happened right there's a big spike so we know the what but we don't know the why just by showing this plot if they have to go and defend this uh in a board meeting and say something about it from from this data alone we have no

idea what's causing this Spike so one of the key points is to to make sure we actually measure something meaningful in vulnerability data we often start off seeing a number of vulnerabilities reported um whether that's in internal reporting I've seen that um or you log into your vendor tool that's the the big number the first thing you see but there's a lot of a lot of complexity in this so I kind of wanted to walk walk through what goes into that so imagine I were trying to help the CES understand that Trend so they can they can explain what they've seen give everyone a bit of confidence so they're you know they're not panicking we're going to have to

understand what builds what builds up to make this number now if those some of those some some of you in this audience might be very familiar with vulnerability data so what I say next might be kind of obvious but the point I'm trying to make make is that anyone that's not working with a given data set day in day out won't know all these hidden complexities so if you're trying to communicate about your vulnerability data to your cesu who might who might not have been Hands-On with the data they won't know all these things that seem obvious to you similarly if one of your colleagues that works in a different area of infc but you've got

you guys got to work together you might not know all the subtleties of their data so I wanted to just break this down um so we're all on the same page about the kind of complexities that can be hidden in those sort of Trends and a basic number the first thing I wanted to talk about actually was was kind of naming conventions so when I started working with vulnerability data a year ago I was really surprised that two things were called vulnerabilities so I know like you know if you're writing a bit of code you don't want to call two variables the same thing same in algebra same in a lot of stuff so it turns out that as many of

you probably already know uh a vulnerability is a you know flaw in software that has a unique CV ID or a unique vendor ID but then if we're referring to an instance of that vulnerability on an asset well that's also a vulnerability so you get used to this it's fine but it's actually pretty confusing it sounds like I'm being a pedant and well I definitely am anyway but there is a point to this um I was working with one vulnerability manager and she would have to go to the seeso and explain the vulnerability situation and you'd have these conversations where you'd say right we've got we've got 32,000 vulnerabilities but we've got less work than last time because we've actually

only got a 100 vulnerabilities I mean it just sounds crazy and it's a really hard concept to explain so for the sake of clarity um I'm going to kind of rename that and call it detections so vulnerabilities are things with a unique ID detect a detection is an instance of that vulnerability on an asset and this actually makes kind of reporting on it and discussing with lots of different people that don't work with the data a lot easier so it's worth thinking about language as well as the whole maths and data handling thing when you're trying to communicate to people that aren't in the weeds of the data every day so let's start to to break down this

uh this number so if you remember from the the principles of data science one the first things we wanted to get was that domain knowledge so we're going to start looking basically what this number means okay we got 32,000 detections let's break it down we've got say 25,000 unchanged since the last time we scanned so that's the big that explains the big jump we've seen on our original plot that we showed to C so uh we've got a bunch of new ones a bunch of open ones we've accepted the risk on some so that's gone away we've closed a bunch and this all adds up to make that number okay that's fine but we still

can't really do anything with this so let's break it down another to another level of detail so it now starts to get a bit more a bit more complicated and in the interest of uh of time uh since this is the last Talk of the day I'm just going to focus on one branch as an example let take a look at the new detections that have come in it's a bit easier to read um we'll have some that have come in because we've got newly published vulnerabilities so you've had you know Patch Tuesday a bunch of stuff has been released you run a scan okay that's all now detected on your machines you have some actually we've

seen that will come in because they're from an old vulnerability that's been newly newly detected on your estate so maybe this is second from 2012 suddenly pops up on a on a bunch of assets that's that's potentially interesting now as we talked about before detection is a combination of a unique vulnerability and an asset so we've looked at the the causes related to the number of unique vulnerabilities changing what about the assets well suppose you're you're also rolling out a program to scan more of your estate so you now SC scan another a new subnet so there's more assets so now you've got more detections of vulnerabilities but that's that's actually a good thing right you're scanning more assets you're

getting more coverage this is a good thing we don't we're not worried about that also maybe you actually buy some new workstations bring those online even an area that was already being scanned you've got more machines again so again this is feeding into this uh this increase but it's uh it's not something to worry about so I think what we're going to focus on is is the first two then things related to the the vulnerabilities but before we go any further there's a couple of other things we haven't done which were the from the the principles of data science um so let's pause this here and check the validity of our data so let's look at the the metadata

so we had the 32,000 detections um and there's a bunch of time stamps in vulnerability data which we should probably go in and have a look at just before we continue so let's have a look when the records in the database that we've exported data from were updated um I've kind of picked 90 days as a an arbit threshold this is one of the complexities actually it's not clear what the cut off is when is data valid that will depend on what the use case is how frequently you expect the data to be updated but in this case 90 days seemed reasonable so we'll have most of of the detections updated in the last 90 days

so say fine that's recent relevant information keep that we'll have some that haven't been updated in a long time and again I've broken out for you here some of the some of the reasons why some of the complexity behind this number so some will be detections that simply haven't been retested others will be have old update times because the asset they on just hasn't scanned and then if we again follow on the the leftand branch those that haven't been retested why is that now we get down to real kind of practical reasons maybe there's been an authentication failure so some vulnerabilities you require authentication to to test them so if that fails we can't update the

record maybe the test hasn't been able to be rep replicated you have to replicate the exact conditions to retest for the vulnerability so if you remove a bit of software maybe maybe they couldn't do the test again so as again just the complexity of assessing whether the data is valid uh and making decisions about what do you do with things that are on assets that haven't scanned you don't just want to be deleting information about vulnerabilities so there a whole kind of uh issue around this what what the right decision is in this case we were we were going to focus on the new detections so we're in this side of the the tree diagram so we're okay to

continue the next kind of um word of warning I guess is around that the data quirks I mentioned earlier so when you're exporting the data and you see something called last scan date think great I'll just uh stick that in my uh um my code I'll use that plot some pull out some graphs that be fine do you really know what alas andate means um um is it well documented possibly not is it the last time that there was a vulnerability scan or the last time there was a compliant scan is it when the scan kicked off when it finished is it when the scan was last authenticated or not authenticated or the most recent

of either of those you can obviously work this out eventually but a lot of time people will just take the first thing that seems sensible and go with it actually we had a case where someone was exporting uh data and had got got the wrong meaning for a time stamp it wasn't updating any of their information so I mean this can be really crucial when you start to to do data analysis off your own back so just to before we move on then those are the the kind of warnings around other things you need to check so just to remind you where we got to looking at new detections to explain that that spike in the the trend graph

for the ceso and we're going to focus on the the things from um some of the uh old and new vulnerabilities this looks interesting this appears to explain that Spike so the next thing we want to move on to is how we might actually communicate this so we're going up to the the communication part of data science now we're not applying any complex algorithms I said we're just getting some initial value from some some simple metrics now I think communication has been one of the biggest issues I've seen people face um I touched it before there's a lot of stakeholders they all have different areas of expertise someone's dealing with v data every day other people aren't and the same applies

for all the different controls and it can be really hard to get your message across so I wanted to kind of take a look at this it's basically what we called the data flow in infex so you're going to start off down the bottom with your your sort of tactical data so this will be everything from your logs your controls and in our in our vulnerability example this is going to be our 32,000 detections 32,000 data points down this end then as you move up a sort of operational data so maybe you're kind of V manager level now um that's going to be condensed down the V manager is probably going to be looking at I don't

know histograms of um age of detection so they can see what needs to be patched next how they're doing keeping in policy so we're going down from 32,000 data points arrow is getting smaller down to maybe say 10 bins in histogram compressing the data and then we move along to strategic data so this is going up to the level of the SEO or even the board itself uh and this needs to get even even more compressed um in fact one infos team we worked with had uh at the bottom end they were producing maybe a sort 15 page report at the top end they had a quarter of a PowerPoint slide to to explain the whole

vulnerability situation in the business so you really need to you really need to compress this as you go along um and it's not just a mity it's not just uh people trying to be awkward making you make the data uh more summarized it simply has to be that way because as we go along this this chain we go up the levels of management the responsibility the remit of the person at each level gets much broader so on the Tactical level those 32,000 detections maybe a patching team there'll be someone working on patching team maybe for Windows maybe just for the UK okay so great they can have all the raw data they need it for their

job and that's all they're focused on when we get up to the top you'll have your ceso who needs to have an eye on all the controls on their estate across all Global regions she's never going to have time to look at the raw data from all them and what's more it's not her job right someone needs to have made a decision before that so what data we show someone is essentially based on what they need to do with it and this is this is really tricky because it comes down to one important Balancing Act which I mentioned uh as one of the principles at the beginning this balance between caveats and usability so you need to provide someone

with enough information to to do their job but not so much that they don't have time to look at it or they can't interpret it properly and I think this is one of the big challeng Alles of trying to get to datadriven DEC decision- making in infosec is this flow of information as it goes up and giving people the appropriate um level of information to do their job and recognizing that not everyone has needs to have all the details at every time so based on that uh what we can see is that we need different perspectives on the data for different stakeholders and what data scientists should be doing is taking some data analyzing it and

then packaging up the results so that they're really usable for the person's use case they're trying to they're trying to solve when you do that when you make something really great for one person it's not so great for the other person that's okay um but I would say as a kind of word of caution if you see some analysis um in the press or I don't know in being shown to someone else in your team that's at different level level and you'll think it's too simple where's all where's all the you know really intricate details just think was that analysis intended for you if it wasn't you're probably not going to like it you're probably not going to find it

useful probably be annoying like I'd love to see all the like you know all the logs all the all the raw details but the person that's been presented to doesn't they don't have time to look at it they need to make a decision what's important is that people get the right impression of the data the impression you want them to have say the outcome is the right one and often by giving people all the information you want to give them you get the wrong outcome because youve just given them a bunch of logs and they're like what the hell is this so effective data science will be pretty targeted is what I'm trying to say so

let's have a look going back to vulnerability example let's have a a quick look at the different views that different stakeholders will have um on vulnerability vulnerability data so we can you can see an example of what I'm talking about so for um for a ciso as I mentioned before they have a broad view they're going to be looking across the globe and they're probably going to be comparing how different business units are doing in fact they are comparing because this is what we've people we work with um typically look at so some important differences we've gone away from raw numbers now we're looking at um vulnerabilities per asset so kind of a ratio number of detections to number of

assets otherwise when you're trying to compare your different business units a business unit with far more assets is obviously going to have far more detections of vulnerabilities so if you look at the raw numbers the biggest business unit is always going to apparently be doing the worst so the CES job here is to make sure all the business units performing well so we we look at a a a relative number of vulnerabilities um we compare um across the regions and we're looking at an overall trend a vulnerability manager for example need a totally different view suppose they're they're just working in in the in the Americas so this top bar here they need completely different

information they need to kind of make sure that that patching is being done on time nothing's being managed so the kind of information they might look like might they might prefer to look at be something like this this histogram we have um age of the detection so how long it's been on your estate across the xais the number of detections with that age on the y- AIS so this is showing them if your if your uh patching policy was 30 days you can see you've got a bunch as Evan does way past that policy um and plenty coming up and then they' probably also need an actual list of detections um to see what vulnerabilities are there to pass on to

the patching teams so they can actually go and do something and this is all this is all the same data um but there's no point showing this to the ciso or the previous plot to the vulnerability manager it's important though that we can easily go from one to the other so another of the problems I've seen is that people will have high level plots which when you break them down to the raw logs the numbers don't add up you seem to have lost some vulnerabilities someone's filtered them out someone's done some weird thing in Excel and you can't you can't get from the raw logs up to the summarized View and back down again and that's crucial

everyone needs to be seeing the same data it's just been repackaged but it needs to be the same so the next point I want to make about communication um is around how it's interpreted so if you're a data scientist and you're you're plotting a a a chart or you're um or you're infect professional you standing data or you're showing something about your data set which you're really familiar with it get in really obvious when you make a plot what it's telling you um but all of us have to be careful that we have to put ourselves in the shoes of the person viewing that plot and understand how might they misinterpret it and again if

you're a data scientist it's basically your job to make sure that the audience has a a low probability of of misinterpreting so an example would be okay I'm going to show I'm going to show an average so one one Su we'd like to see um average time of the detection on the estate before before patching nice quick measure easy get an idea compare the business units another one we worked with was like no rubbish I don't want to see that that highs outliers show me the full distribution so it takes a bit longer to to um process the information in that but there is more more information contained in it so the question you have to ask

yourself is if you're going to show someone in the first case the average is that guy aware that it masks out Liars probably he is but is he going to remember is it going to be in the Forefront of his mind when he has five minutes to check the report you've produced and has to make a quick decision on what to do on which business unit needs to be um contacted and told to uh you know improve their improve their process and this is uh this is again one of the responsibilities of data science to make sure that the communication is clear and and ambiguous and it shows us that the way we present data is specific not

only to the role of the person are they patching team V manager ceso but also to the individual especially in roles where people come from different backgrounds so for example the ceso RO you know some will be very technical others might be more from a a management background and they might have different skill sets when it comes to understanding data so we have to bear all of this in mind the next skill the next uh sorry the next principle to apply is that of actionable insight I mentioned so as a data scientist when you produce a plot or as a or as an infect professional when you're producing your report or when you're being shown something you

should ask so what what do you actually do with that now and remember again this isn't about doing data science and research it's about working with an infect team that actually need to do stuff um so I wanted to take you back now to the original part we looked at that Trend that cesa was trying to understand um and we saw that that was that seemed to be to do with new vulnerabilities coming in Old vulnerabilities coming onto the estate uh let's see what we can do with that let's take a look at the a way we can potentially provide some actionable insight so the trend we looked at before um is on the left we

can see there's been a sharp spike in in the last month um the C so needs a reason to go and report to the board and from the kind of analysis we did before breaking down all the information getting that domain knowledge uh we saw that appeared to be linked to the the old vulnerabilities being reintroduced on the estate on the right if we look at those plotted out separately notice we don't PL out all those things from the tree the the tree diagram that would just be that's you know that's too many caveat right this is usable we have in in yellow uh the number of detections coming from old vulnerabilities being reintroduced and

in blue the number of uh detections coming from newly published vulnerabilities they'll always be newly published vulnerabilities but as you can see this is pretty flat this hasn't changed markably in the in the last month however the number of old vulnerabilities has shot up and one of the one of the ways we've seen that that can that can uh happen is if you have a a kind of outof dat standard build so you put that back in it's got old software on it you suddenly introduce a load of old old vulnerabilities and you have to patch them again so this is actionable this hints at process this isn't about patching all the time it's about

looking at other aspects of data where are those vulnerabilities actually coming from maybe if we like update the standard Bild we won't have all this stuff coming in um we actually saw an example with uh uh with some people we were working with and an old version of um I think it was Skype had been rolled out and uh you can do this analysis broken down by um software type as well and then you see these these all these older vulnerabilities coming in and that's just poor process you don't want to pass that onto your patching team it shouldn't be there in the first place so this kind of insight this is a pretty simple example um but this is what makes

something usable the plot on the left gives an idea but it's something like the thing on the right which indicates what someone should do next right go and review your process go and review your stand bill yeah so are the abilities vulnerabilities that existed before that are now redetected in your own environment just VAB

yeah yes you could you could have both actually yeah so the the case of um um the standard build there would be an old vulnerability that's new on that asset but of course you could also have they could have been detected on your estate before so when that um when that standard build was up to date things would have been detected on it they would have all been patched and now you're putting them back in essentially um so yeah you can have both another another one of the complexities right so that would be another uh Avenue to go down to to split this out further okay so I feel like we've gone we've gone through the the kind of

framework of data science for the example of vulnerability uh we've seen the way to approach the data when you're when you're handling it yourself how to get some insight and the things to bear in mind when you're trying to communicate it what I wanted to talk about uh in the last kind of section is going beyond your data set so all this stuff we've talked about is just for one data set and that's quite common in infos right we have all these Silo tools we look at this one then we look at this one but actually once you have this kind of framework in place to make sure you're hand handling data carefully handling it properly and people kind of

get on board everyone's agreeing on on the picture they're seeing um we can start to look uh outside the data set one of the first things we should do next is start to think about the completeness of that data we just looked at we've done all that analysis but we haven't checked how complete the data is so this circle represents all the hosts that we found in our vulnerability scanner database that's what we use for all that analysis um if we look at the all the hosts we have on our estate represented by this gray Circle what we might find and probably will find is that there'll be hosts on the estate that aren't being

seen by the vulnerability scanner and unless there's a good reason they probably should be we also might see some hosts um in the small section here that are in the vulnerability scanner that actually been decommissioned and we haven't cleaned up no one's cleaned up the database no one's purged the data um but the main thing we're worried about is this this this gray area and the bigger this is the less confidence we have in how relevant our findings uh of what striving vulnerabilities on the estate are so what can we do about this well we can start to look in other data sets to see um to get visibility on these hosts now it'd be great if there was a

really upto-date cmdb but if in my experience that's not the case so we can go to other data sets and use those so what what we've tend to do in the last year is get data sets that will go some way to helping us solve the the problem we have but easy to access often in uh security teams don't own all their data sets they could be outsourced owned by it uh on some systems logging might not be switched on to a high enough level of detail to to to answer the use case you want so you have to be a bit pragmatic and get the data sets that will help you but you can get quickly again we want to

we want to get a quick return on investment so that people can actually do something with this data uh so one of the things we've been using is is Av data AV and vul both typically quite easily available so what you can do then is we can essentially join the data sets together and look for hosts that are in both which is the overlap and in the case of trying to get a feel for how good our control coverage of the vulnerability scanner is look for hosts that are only in the AV data so this is this section here and then we can go and see why aren't they being why aren't they in the

vulnerability scanner we also can start to work towards some kind of percentage coverage of our vulnerability uh scanner as a control so whenever you look at data from your controls you should try and understand the percentage of your estate that control covers if it's a really low percentage your conclusion is just basically useless you should try and improve the coverage first then do the analysis afterwards so as with all the um as with all the kind of uh machine learning hype it's one those things that like sounds really easy when you when you first mention it actually it's really hard to to to match the data sets together so in the v data typically work

with you'll have an IP address but often you won't have a a host name resolved often there'll be no DNS name resolved so in I AV data you want to try and match it but if you use the IP address uh a lot of businesses obviously will have DHCP so is it the same asset now you can't check the host names because most of the time you don't have them um and okay if it's the same IP address and they scan the AV scan was an hour after the vul scan okay they're probably the same or if it's three hours or four or five two days at what point do you do you make that cut

but you can actually do a pretty good first pass um with any post names net bios names for example that you do have in the data so it's not too there's complexities and and that's probably that host resol host resolution is probably another talk um but you can do you can do a good first pass you can learn something from this data and then to learn more you can get more data um so maybe you can get the DHCP logs and start to see which assets were assigned which IP addresses and at this point you're really glad you applied that really solid framework of how to handle data and how to communicate it because if you

hadn't done it this is uh an incredible Mess by this point and the more data you put in the harder and harder it gets to to to manage it sensibly so building up that kind of that framework of data science and applying that to infect data allows you to add more data sources in and build complexity without completely losing um losing all the accuracy and Trust in your data the other thing I wanted to to uh highlight is so once you start to get all this data and you're managing it well and you know what you're doing why not use it for other things so we can start to get a bit more context around what's going on with our

assets right we can start to find when we were finding host in in lots of different data sets we can start to say how's that host look in the vulnerability data what vulnerabilities does it have what's the situation in AV what else is going on maybe what users are logging on what software is on there have we had any um alerts from our from other system systems involving that host so we start to get a lot more context that can make it easier to do some of the Tactical and operational work another benefit is that if we start to bring in business context information as well we can move towards having a wider view of security across the business we

can start to make it something that is more relevant for uh the board they can buy they can get we can get a bit more buying from the board right because we're now starting to talk their language um we commun we can communicate better with them they can give them an overall picture of what the sort of uh what our sort of exposure to risk is so the takeaway um from this talk is essentially that first of all data science is more than just machine learning there's a bunch of other stuff there a whole framework for how you can approach analysis and if we can apply this to the way data hand is handling infc we can improve the trus in and the

the communication of uh security data the benefits of this for those of you that are uh working in the Tactical operational way is you can get more context you can know more about what's going on that might also be important when working out how to prioritize uh tasks uh to understand what which are the riskiest assets but possibly the most valuable thing in the long run is that we can start to provide evidence to management to the board that demonstrates how the work of a security team is lowering risk and helping the business at the moment the situation we're in is if nothing bad happens then it's okay so no matter how much work you

do there's there's no evidence to show what what you what you're doing just nothing bad has happened it's hard to measure nothing right so this way can really communicate to the board what's going on what the security situation is and finally demonstrate all the hard work the infect teams are doing thanks

hi I have a question SL comment and you touched on the importance of looking at basic statistics and I for me personally count is the fundamental staff that I look at first because that addresses coverage um if you don't have good counts your means don't mean anything you can't establish uh statistical significance so yeah I think that's absolutely true um a lot of the the problems I've seen is people literally the numbers just don't don't add up so people think they have X number of hosts and you look in the vulnerability data you're like we've got why number of hosts and then are you meant to be scanning these are you not meant to be um you have problems with

simply numbers of vulnerabilities not not summing to the total you expect to have because people are exporting data in Excel and cutting it and changing it um yeah absolutely there's there's a lot of value count stuff first then try summing it right let's look at a median you have a much better feel for the data the

is yeah yeah so the the exclusions is is it is actually a big issue um and I think it's something that people often do unintentionally um so people will you know I talked about if you're not an expert in a data set right um someone presents stuff to you they might assume it's obvious to you that they've cut out things that haven't scanned in the last 30 days they've cut out severity one vulnerabilities because who cares about those um you probably should care by way but um that's not advice um and they'll assume perhaps that it's obvious those things have been done and as that goes up that chain that flows through the the level of management no

one knows those cuts have been made right and then if that if that person leaves the business next person comes in they're trying to replicate that reporting and the numbers just don't add up they they don't know that that that guy chopped out those older detections or you know made that cut so absolutely I think that once you once you get thing this approach working um it need to to something that's that's repeatable then as well um repeatable and um automated if possible because then people can't start cutting bits and pieces out you may have already touched on it a little bit but can you share a couple of common ways that uh a common mistakes

that you can make when kind of interpreting analyzing data potentially arriving at the wrong conclusions because let's say for example certain data points are over over uh represented and therefore you kind of like oh the example you gave was of course North America is going to have the most amount of vulnerabilities because they have the most amount of assets do you have a couple of other examples that are like really easy to fall into like traps and stuff I think um it's a lot of it is just around making assumptions actually um people will see the example I gave around like the the last scand date thing right people be like oh that's obviously the last time

it was the scan started and then they'll develop some really nice plot based on on that and find out later that was actually the last time authentication failed or something um so it's about testing your assumptions about the data actually so you know if you if you can actually run a test and see what happens when you change things test your assumptions another example like that we saw was um in a database you would have when you had DHCP you would have an IP address uh used by Windows machine you get a net bios name then a Linux machine would have that IP and and I would just assume right okay that they're probably going

to nulli nullify the the net bios field right that should be a blank now no it stays there uh that's how the data bit that's how the data is stored um and if you had left that in and did done all these plots based on net bios numbers of unique net bios names or something like that they could be totally wrong just because you've assumed that if you built the database you would have done it like that um it's actually the really simple stuff that Trips Trips people up before they even get to attempting to apply blackbox machine learning uh like this gentleman said you know the Count's wrong um or someone's filtered something out in a spreadsheet and not

told you um great presentation you're giving me flashbacks to to VM program so uh and a lot of the problems we had but um one of the things I had was with communication and I think you kind of touched on that so I wonder if you had any Lessons Learned um in two different areas so I've been asked you know when I was doing it for uh metrics that weren't really um very good to to look at like percentage of uh severities in your environment so you know if you clear out all your level ones like you said your level fives would increase in severity when actually the security would be decreased overall so that's one example

and then um and also went like uh impromptu data so you know taking your data at times of the month May significantly uh decide how your data looks like after patch Tuesday or whatever so I was wonder if you had any suggestions or Lessons Learned around around those two areas so yeah I think the so to start with the the time analysis thing I think that's something you absolutely have to be careful with so for example one of the things we do when we first go in and start start working with people's data is try and match to their existing reporting right bit of validation um and you can find that if you export from a database on on

a Monday one week and then export on a Tuesday another week data doesn't agree um so it's important to to bear that in mind it's also important to think about whether you want to look at behavior in in kind of like a rolling window or if it's the current status that's important um and that depends that everything is quite dependent on what what the use case is um so I think it's just it's being aware of that and um thinking about how it might impact on your results um remind me your your other U the other one was on kind of um funny metrics being asked for like you know you present your data uh in

vulnerabilities for host which is an excellent n metric but um you know I've been asked putting in a percentage as a p gra which is horrible oh beond P percentages yeah I think this is the thing this is um this essentially comes comes back to the kind of breakdown right the vulnerability stuff is you have to see what's feeding in um I think you have to try to guide guide people I think doing data science is sometimes a bit of an education piece um so in the same way that you know you can take guidance off infoset professionals around the way that data is useful to them and what they what they need to do their job um they can

hopefully take guidance from data scientists around um the best way to present data and I think if you explain to to people why why they might actually be misleading themselves by showing that um and give them an alternative and show them a a valuable alternative that's useful um so maybe looking at um uh like as you said like a ratio of things instead of percentages I mean percentages can be fine it depends it depends what you're doing right um the coverage one can suffer as well right so you know your coverage can be really good and you can bring bring more assets online and then it takes a while for the scanner to to go and find them so your

coverage plunges it's about having reasons about being able to explain that the the differences you how you explain sometimes a increase in vulnerabilities may actually be a decrease in your risk if you haded more host or yeah exactly so that that was the you know the the tree the tree I had before that was the point some of the some of the changes are actually actually um positive you've rolled out your scanner to a a new a new business unit and I've got those more vulnerabilities so it's about I think there's not one metric right that's going to give you the whole state of vulnerability so it's about having potentially having a bunch of simple

metrics that are kind of um that will flag up these areas so they'll flag up things around process looking at age of vulnerabilities and how that that changes whether they're old or whether they're new you start to look at you know as said standard build process or other things um also looking at tracking the number of assets so I think uh one thing um I've been working on is actually as a complement to looking at number of vulnerabilities is looking at how the number of assets on state is changing and you can correlate that right so if you have um a set of metrics around assets you can see if that's shot up you're going to expect a a boost in your

um in your V abilities um sits around grouping together metrics I think that inform each other and for sure it's challenging um because especially with vulnerability because there's so many factors that come into play but I think anything is better than the kind of current situation I've seen uh where it's typically a number and a trend and and no Insight at all yeah exactly hi thanks uh you have a great presentation uh and and having done this for a while I just have a lot of questions but I think one of the biggest ones I had was and ones that I grappled with a lot is the last bullet you have effectively here so how do you measure

how do you tell you know the business ownership the leaders the money people you know what risk means to them is is your definition of risk just lowering the number of vulnerabilities is that it because risk to them I think is something else and it has a dollar value Associated to it so do you have anything related to that that that ties in the entire enchilada yeah nice phrase um yeah absolutely it's really complicated for sure it's not the number of vulnerabilities totally agree that's not that's not the point um I think this is something I'm I'm kind of uh I'm working with my colleagues at the moment actually is looking into into this in

kind of um how we can really estimate what risk is um that might be looking at that kind of hostcentric risk seeing seeing how that plays out when you look at your controls across the infrastructure but I think as a kind of first pass when we're talking about these basic metrics I've shown today it's simply about showing uh allowing the seeso to show that if he or she implements a a strategy something happens so if they they you know roll out vulnerability scanning on across more of the estate they can demonstrate they've done that we have more visibility so it's around educating it's around educating the board on on what Security's doing as well I think um

so it's it's being able to show an outcome of work you do and I think one of the examples of vulnerability that that can be problematic is if you take a really simplistic View and you just see an increasing Trend you know if that ceso in my example had gone to the board and be like oh God look at the spike in vulnerabilities um people will kind of infer things that aren't true maybe they're like oh what what's the patching team doing this is a rubbish and actually you've in you've actually improved coverage of the vulnerability scand across your estate do anything good and it represents as something bad if you have overly simplistic metrics so

simply repackaging the data in a way that gets Insight the proper insight and gives a reason why things are changing and you can the seeso can go and point to and be like you know what I'm going to show you I'm going okay vulnerability has gone up but that's great because we've got we've got awareness of these now we didn't know these were here before that's a totally different story to like uh this is this's a big spike I've got no idea so I think it's it's providing reasons providing evidence for what you've done and then eventually working up to to I mean the you know you say a million dollar question right is is getting a a um a measurement of of

risk to the

business okay

thanks that was pray really really should keep ref and keep

giving

2016 BSidesLV - Ground Truth - Day One

Related talks