BSides Las Vegas 2019 Day Two - Ground Truth

Name: BSides Las Vegas 2019 Day Two - Ground Truth
Uploaded: 2019-08-08
Duration: 9 h 18 min 1 s
Description: My Event Description

BSides Las Vegas9:18:01683 viewsPublished 2019-08Watch on YouTube ↗

About this talk

My Event Description

Show transcript [en]

to be here for it to work.

Hello? Cool.

Hi, this is me. I'm doing a sound check. Sound check, checking the sound. Still checking the sound. I am checking the sound as I continue to check the sound. I'm continuing to talk as we check the sound. I have turned the mic on. I do see a green light. I am continuing to check the sound. Sound check, sound check, checking the sound. How many more ways can I work the word phrase sound check into a sentence without it being incredibly repetitive? I'm still checking the sound. Is the mic not close enough to my face? Do I need to crank my head down like this? I don't, okay, that's probably the problem.

I did have it on a second ago. Hey, there we go. That sounds like it's strong. Can I count to ten? One, two, three. Act like you're trying to talk to him without the mic. One, two, three, four, five, six, seven, eight, nine, ten nine eight seven six five four three two one two three four five six seven eight nine, ten, nine, eight, seven, six, five, 4, 3, 2, 1, still doing a sound check. 1, 2, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4. 1, 2 3 4 1 2 3 4 sound check, sound check, checking the sound. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 10 9 8 7 6 5 4 3 2 1

sound check. We good? Good ones? Do you want me to keep talking? OK, is this good? Are we done? Can I stop? Sound check. As I continue to sound check the internet. This is amazing. The internet is listening to me sound check. What a time to be alive.

is good, my sound is good.

Welcome everyone to B-Sides Las Vegas 2019, day two. This talk is Security Data Science, Getting the Fundamentals Right, presented by Rich Fereng. And before we begin, we just have a few quick announcements. First off, we'd like to thank our sponsors, especially our Inner Circle sponsors, Critical Stack and Mail Mail, as well as our stellar sponsors, Amazon, Microsoft, and Paranoids. It's support from these sponsors, as well as our other sponsors, donors, and volunteers that make this event possible. Now this event will be live streaming, so we ask that as a courtesy to the speaker and to the rest of the audience that you right now make sure to check that your phone is set to silent. Now if you have any questions, we're gonna ask that you use the

audience mic so that our YouTube audiences can hear you. Just raise your hand and I'll be sure to bring that microphone over to you. And with that, we're ready to begin, so let's welcome Rich Corrine.

Thanks, everyone. As he said, I'm Rich Herang. I am a director in Sophos' data science group. Today, I'm going to present you with a long list of lessons learned from the School of Hard Knocks. In addition to my academic background, I've been at the intersection of machine learning, network, and endpoint security for about eight years now. At the moment, I'm leading most of the new research within the Sophos Data Science team. These are projects that run on multiple scales from just one person all the way up to four or five people on a single project. Most importantly, for the purposes of this talk, I think I've stepped on just about every rake there is in terms of running a project. And so now I'm gonna tell you about them

so you don't have to do the same. So this is the Sophos data science team. That's me in the top row. If any of the stuff that we're talking about here sounds like it's up your alley, come talk to me. I am wearing a blue I'm hiring wristband. We can stick you down in the lower right hand corner. Exact position to be decided over beers. So this talk is going to be about sort of the process around doing a security data science project. So what, why, and how. What I am not going to talk about is a specific security data science project that we attacked, or a specific model, framework, or what the best ML framework is and why it's PyTorch.

So I gather I'm supposed to give you sort of a key takeaway that I want you to remember, and that has to be in the first three minutes, so here it is. What most people sort of think of as their job as a data scientist is spending a whole lot of time doing fun programming and fitting models and tinkering with stuff. And yeah, maybe there's a little bit of talking to the consumers of the model or doing a little bit of data wrangling or writing stuff up for publication and stuff like that. In reality, this is what you should be doing. Most of your time should be talking to people, talking to external customers, talking to people within your group, talking to other people working in the same

field. Unfortunately, most of the rest of your time is going to be wrangling data and cleaning it. And then you do actually get to do a little bit of the fun stuff, fitting models, programming them, using the latest, greatest academic research.

Over my eight years in this industry, this is the single best predictor of success for a security data science project that I've found. Know who's going to use it. talk to those people, have a continuous sort of cycle of collaboration, feedback and updating going on within your team and to the end users of your project. So let's sort of start off with the why of all of this. So we've got sort of You see different versions of this and this sort of, this is a little ML centric but it applies to sort of more general data science analysis projects as well. We have this notion that we have some data, we train a model, we do an analysis, we make a decision, we do some sort of validation

on that, check it against reality, we deploy it, we deliver it, we hand off the report, someone implements the decision and we go around and we collect more data and on and on and on and so this is all you need to know, right? Start with The very first question, what are we actually trying to accomplish here? So how is this model going to be used? What kind of decision is this analysis going to drive? Can we just answer one question and then stop? Is this a problem that I could solve with a regex? I know it is our God-given right as data scientists to run up $50,000 in cloud costs to figure that out, but let's see if we can simplify the process. So

know what good enough is and then really be able to say in concrete terms, right, what are we trying to do with this project? So with that in mind, we sort of go back to the cycle and now we've got some questions that we want to be thinking about as we move through these steps, right? What features, what level of accuracy do we want to train it to? What metrics do we want to use? What ones have we agreed on with maybe our customers or internally? Where are we going to deploy it to? Who's going to receive this report? How do we have to like phrase this analysis in a way that they can understand

it and implement it? And where are we going to get the table from, the data from? How are we going to label it? How are we going to curate it and make sure it reflects what we need. So if you're coming to this from a data science point of view, very often when you're talking to external customers, they don't necessarily know what they want or they may just want some of that hot new ML pixie dust that they can sprinkle on all of their efforts. So in some cases, you kind of have to be an educator, right? What are the trade-offs you can offer? What's realistic, right? So yes, they want something that can detect

zero-day exploits in encrypted network traffic. Maybe sort of bring them back a little bit and tell them what you might actually be able to deliver. probably the most important thing is get something concrete out of them, right? I need, in order for this model or this project to be cost effective, I need to be able to get, you know, this detection rate with this false positive rate, or I need to be able to realize this much revenue from these customers or whatever it is. But map it to something concrete and measurable and negotiate that up front. So before we get into the how part of this presentation, which is going to be the bulk of it, You really should have sort of three questions nailed down. So you

need to have talked with your external partners enough to know how your project is going to fit into the business somehow, right? A project that nobody needs and nobody can implement isn't something that's adding value for your organization. the person who's receiving it should have a good idea of what exactly it is you're giving them and how to interpret or use it, right? What are the inputs that are going into it? What are the inputs that went into it? What's going to come out of this model or what's the decision that you're recommending in the support, in the report, and how should they interpret it or use it? And then everyone needs to be on

the same page as far as how do we know when we're done? What does this mean to be successful and how exactly are we going to measure it? Okay, so the how part of this. So these are kind of like six key things that I came up with as I was trying to frame up this process. They're kind of chronologically ordered. I know there's a lot of overlap between them. You can probably debate like which, you know, maybe good features should go under B Scientific or something, but just at a high level, these are kind of the, this is kind of how I thought it sort of split out. So data, let's talk about data, right? This is data science after all. So I have exactly three quotes

in this slide deck. I have front-loaded all of them so you don't have to sit through them. Basically, John Tukey said that just because you've got data and you've got a question doesn't mean that you get an answer. Ronald Coase is the source of the follow-up to that, that if you torture the data long enough, you will indeed get an answer. And then finally, Norbert Weiner had the comment that the best model of a cat is the cat. So if you are dealing with some sort of real world data, which presumably you want to ingest at some point, the data you use in your project be it your analysis or training your model or writing up your report, needs to be as close as possible to

the real world data that you're actually using. You need to get your hands on that up front as soon as you can. So things to consider on this, especially if you're doing an ML project, are things like label distribution, label bias, data collection bias. If you're doing a malware detection project and you're collecting malware from the internet, that may not look like malware that's going to arise on your customers endpoints. So you need to think about how you're going to sort of align those two datasets in a way that you're going to get a model that answers the questions that you actually want to ask. So this is something I'm going to keep coming back to in these slides, the security difference. So

how we, what extra things you have to think about when you're doing a security data science project versus sort of a plain old data science project. And in security, the bad guy gets a vote. So they're trying to obscure labels. They are adapting constantly. There's a cat and mouse game going on all the time. And in a lot of cases, it might not even be worth using old data. I have a very lovely, carefully curated collection of DOS viruses. I'm not sure I want to train my malware model on those anymore. I'm not sure that they're relevant. So when it comes to data, what should you actually be doing? The gold standard is gonna be getting telemetry from deployment. If I can get

real world data, that's my gold standard. If you can't get that, if you've gotta mock up a proof of concept before they'll let you out the precious bits, then there's some other things you can try, right? Are other people working on similar problems? Especially in academia, this is sort of a criminally underutilized resource in this industry. Very often, if you talk to an academic researcher working on the same problem and say, hey, I'm working on this too. Want to collaborate? Want me to cite your paper? They will fire you the data almost as fast as you ask for it. So if you've got an academic paper that's tackled the same problem, ask. They'll very often

be willing to share. Sometimes you can buy the data outright. Again, for malware or maybe malicious HTML or malicious URL problems, there are a lot of commercial feeds that you can license and just start pulling in data from those. Sometimes there are existing datasets that bearing in mind questions about label and sample bias, you may be able to pull in and use those at least to bootstrap a little bit. So again, in the malware space, Kaggle had Microsoft's malware dataset, which has the header stripped from it so it's not perfect, but at least it's a starting point. If you're doing something with web-based detection, you might look at the common crawl data set, which basically has tried to spider

large chunks of the internet. So look for other sources that you could maybe sort of adapt to your own nefarious ends. Especially in the security space, labels are something to be very, very critical and careful about. A very common thing is doing vendor aggregation. So if you have a threat intelligence feed where multiple vendors are giving scores for multiple artifacts, doing a vote over those is a pretty common choice. But again, academic literature is your friend here. If you look up that process, you can find a lot of papers that talk about different ways you might want to think about doing that voting. Crowdsourcing is another option. It can be expensive. It can be difficult to implement, but once you knock the bugs out of

it, it can be a very good way to collect labels for less technically-oriented artifacts. You might not want to try it for malware samples, but for things like is this a phishing webpage or not, you can usually get pretty good results. Finally, you can go back to the machine learning literature and you can do things like self-supervised labeling from a seed corpus. This can be buggy, this can be a little bit tricky, but it often gives you sort of a good head start on labeling data if you can form a nice small set of very cleanly labeled data. When you're doing train and test splits or train and validation splits, because this is a cat and mouse game, because adversarial samples change over time, and what was a hot

malware threat six months ago might not be one in another six months, Very often your train test splits need to mimic deployment, which means splitting it temporally. So train on the earliest stuff, validate on the middle chunk, test on the newest data. If you don't do this, if you mix everything up and you just do random K-fold, five-fold validation, you're gonna get wildly optimistic results because when you deploy it in the field, you've basically trained the model to expect data from the future.

Finally, this is something I don't, think I see quite enough talk about. Think about what data you actually want to keep, right? GDPR is a real concern. You should be worried about it. The risk of what happens in a data breach is a real concern. You should be worried about it. So, you know, everyone talks about, I've seen very, you know, different variations on this metaphor as security data is oil, security data is toxic waste. My own contribution to the ongoing, you know, data is X. Think of it like plutonium, right? It can be really powerful. You can do a lot with it. It's got a shelf life. And you want to make sure that you store it and handle it really carefully. Talk with legal early and often.

I am not a lawyer. I don't even play one on TV. But when you're talking about storing data, especially stuff that has potentially personally identifiable information on it, talk to your legal team early, often, and repeatedly. A lot of times you can transform data, so you can do stuff like tokenization or hash the features or store extracted features that might not have the PII. But remember, the only data that cannot ever possibly be breached is the data that you're not keeping. So think carefully about what you actually need to store. And again, talk with legal early, often, and frequently. Finally, and so this is one rake that was particularly painful to step on. When you have a good data set, protect it. It needs

to be read only. The danger here is not that you're accidentally going to delete it. No, that's like the best case, worst case scenario. What you really don't want to have happen is accidentally modify the data in place, not realize it for three months until you've been training on it for the past three months and realize that all of your results are crap because you've screwed up the data. So when you've got a good data set, you've got nice labels for it, it's pristine, it's beautiful, store it under a different account and make sure that every other account that has access to it only has read-only access. If you need to modify the data set somehow, you should make a copy of it, right? Local copy, work

with it separately. So the next thing I wanted to talk about a little bit was internal collaboration. Have one way of doing everything important, right? Never reinvent the wheel and make sure everyone on the team is using the same tools, frameworks, and so forth. You can get a lot from open source code, especially if you're using R or Python. There are tons of really good, really well-vetted libraries out there. So sort of avoid the not invented here syndrome. When you're doing a lot of these projects, you'll find that it they tend to sort of, your code tends to sort of split into three different levels, right? At the very bottom level, you'll have code that's specific to a single experiment within a given

project. And so that needs to live in one place. A little higher up the hierarchy, you'll have stuff that applies to a given project but might work for multiple experiments. So that might be things like labeling code. And then at the top level, you'll have utilities that your entire group uses across a range of projects. That top level utility code that's gonna be used over and over and over again, you need to treat that as mission critical software engineering project, right? That means, you know, JIRA issues, that means code code reviews, that means change tracking, version control, the whole five yards. Metrics code deserves a special shout out. If you are computing accuracy or area under the curve or F1 score or

anything like that for any of your sort of output statistics, everyone needs to use exactly the same code, right down to the same function, the same commit. every single time, absolutely no exceptions. The reason for this is two different implementations of the same function can give you slightly different results. And when you wanna compare the outputs of one experiment that Bob did to the outputs of another experiment that Alice did, and they're using two different functions to compute something like area under the ROC curve, you might actually end up getting that, hey, this looks better than this, when the reverse is true. So documentation, I love this quote, the only difference between screwing around and science is writing it down. Documentation,

I like to think of it as enabling collaboration over time, not only within your team, but also with yourself. It's a continuous process. If you wait until the end of the project to do all of the documentation, you are going to hate your life. When you do your sort of group level utility code, that always needs to be well documented. When you have your final experiment that you're gonna run and do the final analysis, document that code as well when you pass it over. Even if you're just doing like a modeling project where you're like, oh, I'm just gonna hand them a black box and all they need to do is use it, plan on

writing an after action review so that in six months when you come back to a similar problem, you know what you did, why you did it, what worked and what didn't work. So keep some form of lab notebook as you go so that you can actually write that after action report without going, oh God, what the hell did I do that one time? So I'm not gonna go through all of these. I'm just gonna say that these are questions that in a project, sorry, obligatory plug here, for the Aloha paper that the Sophos Data Science Group is gonna be presenting at USENIX in a couple of weeks. We actually went through all of these and were very careful to sort of make sure we had answers to all of

them. And the result was an exceptionally smooth sailing project. So common data set is probably the most important thing. If everyone is training models on the same data, on the same training data, and testing them on the same testing and validation data, you have no doubt that those results are directly commensurate and you can say, yes, this is definitely better than this. Other things like what are the right metrics that should come out of your conversations with your external partners. Just simple things like am I going to store stuff in JSON or CSV for my experimental results. Documentation and how to manage source code and how to structure your source tree. So from there, care and feeding of features. And you can think of this as also

feature combinations if you're doing a data science-y type report instead of an ML model. How often have you said one of these things to yourself? I can't get anything to work. I got it to work, but then I changed something in the code and it wasn't under version control and now I can't remember what I changed and everything's hot, everything's garbage. Or it worked great on my machine, I don't know what the heck's wrong with the folks over in production. Couple weeks of this and you kind of end up about here. The way to avoid that is to heed well the five commandments of feature engineering. So step one is always, always, always talk to

your domain experts, right? I think this was brought up yesterday as well. Talk to domain experts. Heed well their advice. Feature extraction is critical. You've got to track it.

So when you run feature extraction code, you are not allowed to run it unless it's checked into version control. When you actually extract features, link those features to that commit. When you run an experiment using those features, link that experiment to those features that are linked to that commit. And then finally, when you're doing research feature extraction code, have test vectors. Be able to say, I should put these three files in, and I should get these three exact feature vectors out. Write that down, record it somewhere, save it for later. Domain experts are generally pretty accurately named, they're experts. They are a great place to start for getting features. Typically, a lot of, especially like your really experienced analysts will have sort of this

intuition of, that just didn't look right, right? Why did you flag this TCP stream? I don't know, there was something funky about it. Drill into that, right? Ask what the first thing they noticed was. Very often, they will have a well-established, carefully curated,

set of tools that they're extremely familiar with and can tell you exactly how to use. Very often these can be scripted. Very often these will be great sets of features. They'll also know they've been playing the cat and mouse game for years. They will know how the bad guys are trying to hide from them. Ask them about that and very often those will actually be very good features themselves, right? PowerShell with a whole bunch of base 64 with null characters inserted in it. Right, got it. Stick it into a feature. Then once you've you know, gone forth and you've constructed features and you've built some models and you're getting good results, take that feature extraction

code back to them and say, hey, how would you mess with this? And prepare to cry better tears as they break it 10 ways from Sunday. So the security difference here, especially for stuff like PEs or scripts or things like VBA or what have you, tend to fall under anti-forensics and the subsequent parsing headaches. So a lot of these things, Bad guys very often use anti-reversing or obfuscation techniques which are very common in their stuff but within your total corpus of samples that you're going to be extracting features from might be comparatively rare. So, you know, your feature extraction could be merrily chugging along through all of your files, hits one of these obfuscated things or one of these things that's treated

with anti-reversing techniques and it'll suddenly crash or hang or eat up all of the memory on your box or break in some other hilarious way. So, make sure that when you're doing bulk feature extraction, you've accounted for these weird edge cases, and you can recover gracefully. So you don't come back over the weekend and find that only 1% of your samples have been processed. Think about what you can do if feature extraction just cannot be done on some samples, right? Can you use models or do analyses that handle missing values gracefully? And then as new attacks, new evasions, new techniques come out, go back and constantly revisit your features in light of that. And finally, When you have samples which break your feature extraction, these are really

great to set aside as test vectors, right? And so your test should be, I cannot process this feature, or I cannot process this artifact. OK.

So once we're through that, we're going to put the science into data science. And how we're going to do that is step one, obey well Occam's razor. Start with simple baselines. Linear models are great. Simple, naive Bayes analysis is an awesome starting point. And sometimes what you'll find is you can just stop. I've run my random forest on it. It works beautifully. I talked with my external stakeholders a while ago so I know what my metrics for success are and I have hit them. Great, move on to the next thing. Sorry you didn't get to play with your awesome new deep learning model. If you haven't hit that criteria, then at least Having a good baseline lets you know what you are buying when you pay

the price of adding extra complexity in terms of modeling or future extraction or something. So always start with the classics. They are classics for a reason. When you're a model, native bays, random forest, decision trees, they tend to work really well and give you a good starting baseline so you can get a good idea of how you improve in the future. So if we're being scientific, there's a couple of important questions that you want to ask yourself. Can anyone who's working on this project, right, you got a couple of people all working on the same thing, can anyone reproduce somebody else's results? Can anyone compare what they did with all of the results that have been logged, however it is you're logging? If you find a model that

seems to do suddenly amazingly well or find an analysis that seems to be incredibly predictive, can you explain what's different between that and all of the stuff that didn't work quite so well? And then when inevitably your project gets shelved and then six months later someone comes down and goes, oh no, we've got to do it now, no, no, no, no, now, Can you pick up where you left off? Can you figure out what exactly you were, you know, have you saved enough state to be able to pick it up off the shelf and keep going with it? The key to, yeah, so if you've answered yes to all of this, then you need to

go back and ask yourself this again and make sure that you're really sure that this is true. So the key to all of this is some sort of experimental management framework. There's a bunch out there, Sacred and MLflow are two that we've experimented with and we've been happy with both of them. But really what you want is you want to be able to push one button, say, you know, run this experiment, you know, Python experiment.py and then save everything and save it in a consistent format and you've already agreed with the rest of your team what you're going to record and how you're going to record it. You get dependency information, you get critical source

code, you get, you know, which features went with which feature extraction function. And finally, make it force you to write down notes. I am doing this run to look at this question. And if you have all of that saved in one place, then coming back to it in six months should be pretty straightforward. So know when to shift gears. Again, when I'm doing these things, I tend to sort of think of them in three stages of escalating complexity. The first one is tinkering. I've got some problem. I've got some data. I'm going to fiddle around with it. I'm going to try some stuff, and I'm going to see just what seems to work, right? good plots, some good visualizations, some quick off-the-shelf stuff I can try. So

this is like done at a read-evaluate print loop or in a Jupyter notebook or something like that. Once I've got some good candidates for what seems to be doing okay, that's when you move to testing. So you're going to scale it up. If I scale it up, does it keep working? Are there particular variants on this problem that seem like they're worth a full-scale test? So at this point, you should be splitting it out into scripts. You should have some basic logging in place. You should maybe be saving some artifacts manually. And then you get into sort of the rigorous study phase. And at this point, you're doing testing almost at full scale. This is

where your hyperparameter searches come in. This is where you're doing sort of more detailed analysis and really like pulling apart what you found in your sort of exploration of the data. And at this point, what you really want is clean code, version controlled, in a framework. So I am prepared to die on this hill. Jupyter Notebooks are great for starting out. They're great for small-scale one-off projects. They're great for reporting, for writing reports, for actually doing large-scale collaborative data science research. I think they are kind of a disaster. They make things really difficult. So go ahead and argue with me. We'll have beers or coffee or tea and fight it out. So do your experiments support your conclusions? This is always a good question to ask, right? If I

change three things, at once, I don't know which of those three things actually gave me an improvement. Very often there is a random element in a lot of analyses. So if you're doing k-means clustering, where those means start can sometimes affect what your final clustering is. If you are fitting a deep learning model, how the weights are initialized can affect your final performance. So when you've got something that seems to be really good, run it a couple of times, repeat that experiment, get some idea of the variance in these statistics that you're reporting. Finally, you need to be really critical of your own work before nature has a chance to do it to you. So think about how you could have possibly screwed something up. So

do your data support your conclusions? So not all metrics are meaningful. If you've been talking with external stakeholders, you should be beyond that. Not all measurements are precise, especially when you're dealing with label noise. And very often, you won't have enough data to measure what you want to measure. And I'm sorry for the quality of this, but I thought this photo was too perfect not to bring up. So this is a actual baseball game that I went to. And you can see at 5.31 in the afternoon, the right fielder had a batting average of zero. just before he went up to bat. At 535, after the right fielder has had his at bat, he now

has a batting average of 1.0. What's the right number, right? So that was, you know, if you go back and you look at his stats, it was obviously his first at bat, he got a hit, so, you know, 100% of his at bats, he's gotten a run, or he's gotten a hit, so, you know, sometimes you need a lot of data to get a really good estimate of something, and if you don't have it, you shouldn't be reporting those statistics. So again, the security difference here is noise and drift. You always in security are going to have, almost always, are going to have errors in your labels. So this can very easily lead you to

over or underestimating error rates, especially if you've got systematic errors in your ground truth labels, right? My machine learning models are really awesome at picking out these DOS viruses that I've had years and years and years to analyze and pull apart and disassemble. That may make me overly optimistic about the performance of my model because I'm never gonna see those in the wild, or almost never. If I've got a 1% error rate in my labels and I'm reporting a .01% difference between two models, is that really meaningful? You gotta, you know, make sure that your statistics are not being distorted by the noise that's in your labels. And again, adversarial drift or even just plain old drift is always a concern. So think about, you know, can you

measure how fast your data distribution changes over time? Think about if old data could be biasing your performance estimates. And then, especially in the security space, as you start implementing solutions or implementing detection models or taking action based on your security analyses, you're going to change the data distribution, right? If you find, you know, here's all of these misconfigurations, we can now detect these misconfigurations, we'll go tell them to fix these misconfigurations, those misconfigurations are no longer going to appear in your data. So you've sort of moved that distribution of data. Okay.

So how to deploy without just throwing it over the fence. These next two sections are going to be kind of short because they are sort of inherently company specific, but I'm gonna try and talk about having been in a couple of places ways that, or some common themes that have come up. So, I mean, you have been talking with your customer the whole time you've been working on this project, right? As long as you have, you should be actually in pretty good shape, right? They should know what the model, what the analysis covers, what the assumptions going into it are, what sort of data is going into it, what sort of results and conclusions are

coming out. And if you couldn't analyze some stuff or you just couldn't get access to some data or you think that there's a data bias, you should have told them about that, right? They should know sort of what the potential red flag areas are for this and where they might want to be a bit cautious about applying the results or the conclusions. So if needed, in your documentation, in whatever you pass over to them, in your nicely written reports and summaries and meeting minutes, you should be able to explain why you did what you did as far as modeling or analysis. This is also a great place to mention samples that don't parse well and you can't analyze, feature issues where you might have missing data or

you might want them to try and collect more data and stuff like that. Okay, so, If you've got all that under control, then basically here's what's left. A good starting point is always if you're delivering a model, give them a mockup, a Docker container, a SageMaker instance, whatever it is, however you're deploying it. Give them something that just takes the right inputs and gives out the right outputs so they can do integration and smoke testing. Do whatever you've decided you're going to do to convert your model from research to production, be that putting it all in a DLL or putting it in a Docker container or just handing them over a bunch of scripts. As long as you've agreed on it, that's all good. Once

you've done that, remember I talked about test vectors? This is where they're going to save you an unimaginable amount of time, pain, and money.

Run your samples through your production feature extraction and through your research feature extraction and make sure that they match up. Run your production model on your research features. Run your production model on your production features. Make sure those match up. Make sure that your production feature extraction fails on the same stuff that your research feature extraction does. Make sure all of your test cases pass, basically. There are, you know, this is, sort of a rule written in blood by us. Once you've finished all of that, package and do the handoff. You've got some sort of conversion procedure. Do it, toss it off, and you're done. So super easy, right? No. So in practice, every time you do this, it's hard. It's especially hard the first time. So again,

I know I keep harping on this, but you gotta talk to the people that you're handing these reports or these models off to. Be flexible about how you deploy, but not too flexible. You wanna get this standardized and repeatable as fast as you can. Every time you go, okay, we're just gonna do it as a one-off this one time, and just to get this out the door, you are, getting, you are accruing technical debt that will be harder and harder and harder to pay down. The faster you can get this standardized, the better off you're gonna be.

So again, the security difference here, this tends to be sort of a culture gap thing. The high level takeaway point is you're dealing with probability and aggregate statistics, not individual artifacts. So if your model gives the wrong conclusion on a single sample, it happens. If it gives the right conclusion on a brand new sample, it happens, right? Don't despair or celebrate either way. Machine learning models in particular always are going to have false negatives and false positives. You need to have some process in place to detect those and to work with those or to plan around them or to mitigate the problems that those errors might throw up.

especially people rooted in a more sort of traditional security background, sort of thinking of reverse engineering malware or writing signatures, very often will try and drag it back down to a sample level discussion. Your model missed the sample. Your model didn't catch this flow. And again, trying to educate and communicate to them that really what you're looking at is population level. Yes, it missed that one, but it catches 98% of the other stuff. That's the message that you want to keep trying to communicate to them. So yeah, did it work? This is where telemetry comes in. So you are getting telemetry, right? If you deploy the model and you never hear anything else, in six months you will have no

idea if your analysis is still valid, if your model still works, if your process is still monitoring the correct statistics. Try it on real data. If you can't get the real data to come to you, send the model to them and run it in some sort of silent mode for a while and get feedback from that. Check and make sure that the real data actually looks like what you expected. Sometimes you might get a description of the data from your customers and then you go and you actually look at it and you have no idea what you've just been shown. Make sure that those kind of match up. If you're getting telemetry, you're getting results, you're running the data through your model or through your analysis,

you should be able to tell whether or not it's adding value. Are you catching new stuff? Are you mitigating errors? If you've got some sort of metric that you've deployed, is it tracking what you thought it was tracking? Can you use the model to collect real world data for later refinement and training, right? It's going through some process that you have some limited degree of control over already. Can you negotiate with them to say, well, yes, you know, this Docker container or whatever it is, it'll give you the results you need, but it's also gonna send some data back to us, right? So that's one avenue that you can collect real world data. But again,

go back to the earlier slide where I was talking about the risks of holding onto data, right? Make sure that you discuss it with everyone who's involved and then discuss it again and then go talk to legal. The security difference here is model decay. So, you know, it works great until it doesn't, basically. Over time, things change, metrics degrade, models fall apart, adversaries adapt and evolve. So you always want to be cross-checking your performance against new threats, new attacks, new campaigns. And again, this is talking to the domain experts, right? When they see a new family of malware pop up, grab some samples from them, run them through your model, and see what comes out. All of these

things eventually have to be retrained, reanalyzed, redeployed. You need telemetry to tell you when it's time to do that. Okay, so to wrap up,

what exactly did we learn from all of this? Hopefully this. At the end of the day, what you want to do is you want to deliver a data science project that somebody wants. The analysis should be useful to them. The metric should have a business impact. The model should detect something interesting, and so on. The way you find out what somebody wants is by talking to them. So talk to your stakeholders. You should do good science. That means a lot of things, but mostly it means being systematic, being reproducible. And then make sure that you've delivered what it is they wanted. They know what they're getting and how to use it. And make sure that you can track what you've delivered and be able to tell

when it's not working anymore. And above all, you want to build this into your daily routine. Build processes that let you do this so you don't have to think about it anymore. Yeah. So with that, I know I'm a little bit early, but that gives you time to ask me questions, and I am contractually obligated to make you stir at this slide for just a second.

Hi. I'm guessing that your team mostly works in the domain of file analysis. Is that accurate? Yes. Follow-up question. I guess the new EDR product from Sophos, you have the same sort of process and network telemetry as most of the other products. Is your team doing anything with that sort of time series data set So we are working, there are projects in the pipeline for machine learning on network traffic, and I think that's probably as much as I'm allowed to say.

So you spoke a lot about the importance of collaboration, and in particular collaboration within the team that's working on the particular project. Anne, I'm curious if you have advice for people who maybe have a much younger security data science practice and maybe they're the only one working on that problem at a given time. How do you suggest that they go about externally validating, so to speak, and getting that feedback from their peers when they don't really have a ton of peers? So, like, not a ton of peers within the organization? Yeah, yeah, not within the organization. So, I think at that point, to presenting at conferences like this is great, right? So this is a

good place to sort of vet your ideas. You know, reaching out to other people that are working in the same space and talking with them is always a good idea. Very often you can sort of sanity check yourself against the academic literature and just sort of see what are other people working on and do they seem to be having similar intuitions as I do. A lot of it, even within a one-person data science practice, though, you can still be systematic and rigorous and reproducible. And building those good habits in gives you a jumpstart on doing things like presenting or publishing papers and getting feedback from that. Another thing that can be I know it's intimidating for some people, but I think it's super useful, is actually

writing academic papers, right? Submitting to conferences and getting feedback on your work from that. Even if the paper gets rejected, very often you'll get very useful feedback on ways the analysis could have been improved or ways the presentation could have been improved. Thank you.

So you talked about reproducibility, which is one of my huge touch points.

But you also mentioned storing your data in like CSV or JSON. Have you looked at any using database with provenance at all or to systematically keep track of stuff? So a lot of times the results that we store tend to be like very, very large. So like we could be, just as one example in the paper that I alluded to that went to Usenix, our test set was seven.

do it internally and then deliver static reports. Jupyter might work great in that use case. I don't think we've ever tried it, though. OK. Thank you. How's it going? Is there a threshold of data that you use for your modeling that you consider the most optimal? Is it like one year or three years worth of data whereby five years would be too much because it gets too skewed because things change over time? And the second part of my question is regarding, do you collect at all or do you remove out things like seasonality trends or cyclical trends that maybe exist inside the data that you're collecting? So I'm going to give you the really weaselly answer to that question, and I'm going to say it depends on

the problem. So for stuff like malware, you very often, actually for a lot of security problems, what you're less interested in, there are some seasonal things you can see, like everyone getting busy as everyone goes on Christmas vacation or something like that. But in general, what you're more interested in is sort of weird one-off bursts that occur as the result of a new campaign, or someone finds a new exploit and drops a PSD for it and it goes everywhere. As far as how much data is too much data, Again, it's really, really problem dependent. So when we're looking at things like URL data, just detecting malicious URLs, that seems to age out very quickly. So we've done sort

of our internal analyses of that, and I think we've come to the conclusion that we really only want to save a couple of months of that kind of data. Whereas for malware, it seems to have a much longer shelf life because you definitely do have some oldies but goodies that just keep coming back over and over. And so you want to keep those in your training set. Yeah, it depends. I think it's an analysis that needs to be done on a problem by problem basis, but it should be part of your analysis. Hi again. How much time is spent on the hyperparameter searching and testing phase, and is that automated? It depends, and yes. So

again, very often a lot of it comes down to what is our threshold for success. if something really simple works right off the bat or we have a couple of sort of, in addition to logistic regression and random forest and gradient boosted decision trees, we have a couple of stock architectures that have seemed to work really well on a large range of problems. And we tend to just throw them at them and see what pops out. And a lot of times, because we've sort of set our thresholds for success and our exit criteria, we run one of those, it works great, and we're like, good, ship it, done. If it needs hyperparameter search, then yeah,

there are several Python packages out there that will do hyperparameter optimization for you very effectively. And you just throw it at one of those. But again, all of this happens inside an experiment management framework so that when you run that hyperparameter optimization, you know what the range of hyperparameters you tried was and what it did and what the results are, and they're all automatically stored and tracked.

Hi, you mentioned earlier about regulations around data retention and stuff like that. How do you make solid cases for retaining for longer periods of time than what you would normally have to be held to by the decisions of your legal departments and stuff like that? Because sometimes some analyses really only have value if you have data outside of that scope. Yeah, and that's something that we have to negotiate with legal, right? We have to get their approval to be able to hold on to that kind of data. And again, what it comes down to is sort of doing the work behind it to show, no, if we restrict ourselves to three months of this particular data type, the results are going to be too bad for general use.

And so we really need six months or a year or two years of this in memory. And this is, Being able to show your work on that and show that you actually have considered only holding it for six months and you're not just doing the standard data science thing of, no, we couldn't possibly throw away any data ever, tends to give you at least a little bit more of a firm ground to stand on when you're trying to negotiate these things. But at the end of the day, if the legal department comes through and says, no, you just cannot keep it, that's that's the end of the story, right? We just tell them this is the business cost of not keeping this data and someone decides that they're

gonna pay it. Alright, let's give another round of applause for you.

I should have just, BYOA, bring your own adapter at this point.

No, this is, yeah. Okay, is this like your lightning port? Which one's for your lightning port? Do you not know? Okay, cool.

Can you go get my bag for that? Thank you. Do you have a, so it has, is there a USB I can plug into then?

It's gonna be for an HDMI, that wouldn't matter. Here, you have to have a lightning port in VGA,

So I have a lightning port and I just need to be able to convert it. Okay, perfect, thank you so much.

Okay, no, you can touch it. Do your job. Oh, that's the wrong bag, but that's fine. I'm so sorry about that. Yeah, it's all set. Thank you so much. You can drop that off with the brown haired woman right over there in the front row if you can. The one in the white skirt.

Try this, go straight in. And then the blue HDMI.

Sure, make sure it comes up.

you're going to use...

Yeah, thank you so much. Thank you.

It's probably gonna be a little shorter, but, hi. Yeah, that'd be pretty fucked if I wasn't a speaker. It's all good. Okay.

Wait a couple seconds, one, two. One, two? One, two? One, two? Okay. Ready to go? It'll be a little early, right? It starts at 11.

Hi, everyone. My name is V. Weiss, also go by Veronica Weiss, and welcome to my talk, Is This Magic Heart? Be Gyarados, Using Machine Learning for Phishing Detection and Talking about the Interplay Between the Two. So let's just start off the talk with, who am I? I'm just a college student who really likes gradient boosting and Pokemon.

This past summer I've been working on cloud application security, really dealing with credential compromise type of tooling, which is really funny that Capital One happened because, hey, what better way to show your value than an actual attack happening? So I've worked on data science teams in the past, security operations teams in the past, and this work kind of was inspired because I noticed a lot of workflows being not fully automated, which then got my interest in more machine learning and how to automate more of your workflows as a security operations team. And then I also was working for a computer vision lab for Skidmore College and MIT this year that really introduced to the more

visual part of data analysis. So let's get the elephant in the room out of the way. Sometimes you click on a harmless link and it's like a magic carp using splash. And then sometimes you click on a link and it's hyperbeam.exe is downloaded and your IP address is somewhere that's not just your personal ARP table and some attacker has your network information. So that's just a quick way to describe phishing to some millennial kid if they don't know what that is. So the agenda today is a brief history of phishing detection that goes into the current state of research, the data collection process, if you're an independent researcher who wants to get into this, some

data processing and feature extractions that you could also do on that data, and modeling and evaluation, as well as a conclusion. So motivation is that I was just in college, wanted to do a machine learning project. I was thinking about maybe taking ELF files and looking at header data. And then my professor thought that was too ambitious for one semester. And honestly, matching file signatures in the year 2019 is kind of outdated. So phishing is pretty cool. It's a nice legitimate business case. But the issue is, as a college student, you don't have an institutional college phishing repository accessible to you. If I was some security analyst at a big financial institution, then I would be able to just ask someone for some phishing, and would be

able to get some. But as a college student, you are kind of just thrown into the abyss as the internet and your brain as the only tools you have. So just thinking about phishing, I have some friends who are really smart at security, but they were asking me, why phishing? Can't someone look at an email and say, hey, I can tell that this is fake or not? But let's think about the vast audience of people who aren't accustomed to always thinking about what they're clicking on or thinking about how they're using the internet. So if you're just an employee, low level, you're thinking about just your day-to-day operations and you get an email from your

CISO saying, hey, update your credentials, it's mandatory, it'll look bad on you on your personal account with your manager if you don't update your information, of course you're going to click on a website such as this, see that looks pretty legitimate, And boom, all your iCloud credentials are stolen. And now people can go look at your weird dog pictures that you've collected for many months. So that's really important to just think about as security professionals, We might think something's basic, but what about just a normal person? So common phishing attacks that happen right now are just DNS cache poisoning attacks, where an attacker can compromise a DNS cache server and then redirect to other IP addresses or malicious open redirects. This is actually using the JavaScript and

HTML itself to have a redirect to a different website or a malicious attack. And then a botnet, of course. Everyone knows what this is. It's just having your host be compromised as a larger attack surface to. So how has phishing detection been done previously and how has it changed at all? So before there were blacklists and then whitelists. And that's just a lot of manual effort. It's not scalable. A security analyst has to maintain this. And it also is based entirely on previous incidents. When you have larger people like full of phishing campaigns targeting your security team, especially if you're a smaller institution, you cannot just rely on static retrospective work. And that's why I think automation and machine learning can be really helpful to

solve this issue. So a more new way of phishing detection has been human driven and machine learning driven. And it's really great if you could build out your own machine learning detection, but as security operations teams, you usually have to buy a product instead. So this is the small part of the talk where we talk about product Sorry about that. But CoFence is human driven phishing defense which has human analysts observing phishing emails in real time looking through your email repositories and kind of asking, hey, is this phishing or not? BehaviorSec, Agari and Inky are all companies right now that are pretty prevalent. All are in series A funding or better. And they rely on machine learning in a various amount of ways. BehaviorSec uses user

behavior analytics which we'll talk about how that's That's kind of a little suspicious to just depend on weird user behaviors that you may or may not actually have to be detecting to detect fraud. Agare uses ML as well, but they don't really like to share their solutions. Inky is based on a textual analysis using NLP, a visual analysis using convolutional neural networks as well as other probabilistic modeling that they use to extract features from the email directly. So that's just some ways that if you're a security operations team, these are some good products that you could use, more so Inky. And so we're talking about behavioral, you know, second behavioral metrics, but in machine learning, a lot of products right now are

looking at the way that users type on a keyboard, the ways that they swipe on an iPhone, and that just kind of, to me, seems like a little intruding on people's personal information and confidentiality. And hopefully at Ground Truth people have been talking about collecting the right data to prevent cybersecurity attacks as well. So if you're a company that kind of depends upon combating fraud, but you're also looking at people's scrolling techniques, like that's exactly what Tinder uses to like boost their metrics, like you probably don't need that. That's kind of uncomfortable information that's slightly dystopian to have as well for your employees. So what if we had a better approach that was scalable, It wasn't all based on a retrospective approach. It was more automated and had relevant

information where you didn't have to just store extraneous information about people and their daily habits. So the text in the right way depends on two data sources, emails and websites. So to dynamically evaluate an email, we'll go into that, but also We have to now be able to dynamically evaluate a website as well. And so this talk will go into what's been done before and the ways that you as an independent researcher can go into scraping phishing emails off the web and build your own phishing repository. So here are sticks for an email. This is just a phishing email. As you can see, you see the text, you see the paragraph, you see just the colors, the space, the way that white space plays into this email. And

so the right way to look at this would to have an NLP based textual analysis as well as a visual based analysis that will be looking at the coloration or seeing anything that's odd. Because as a human, we can look at an email and see, oh, something's off. We need to be able to also teach machines to do the same as well. So standardization for email classification should go upon eight different categories, which actually seven categories. That is a Bitcoin cryptocurrency extortion scam. A lot of phishing emails right now are talking about ICOs and saying, hey, if you reply to this with a certain amount of information, we'll send back some Bitcoin wallets or Coinbase account information. And that's just not...

So that should really be a prevalent category that is talked about in an email classification scheme, as well as if emails are disguised as an educational or a banking institution, or if emails are disguised from an application or help desk type of person who is asking, hey, update your account, click on this link to update your account. Those are also really prevalent and should be accounted for as well. And 419s are sometimes called Nigerian Prince emails. That's when someone sends, hey, I'm in dire need of help. come send information back. And those are obviously bad and should be accounted for as well. And malicious PDFs are becoming more and more important as DocuSign and electronic emailing can really target executives and people who aren't really trained

on how to use the internet safely but are very powerful. And then, of course, we want to have a phishing, non-phishing email group as a control group too for our classifiers. So websites can be broken down into four different parts. A URL, the deployment of the website, the content of the website, which also includes the written context, the structured text, the HTML and the JavaScript itself, as well as the reputation, which we could use certain algorithms like PageRank to determine. So now let's just talk a little bit about data collection, the feature extraction process and modeling. So there's such a lack of phishing email data sets. And that's probably because researchers in academia do not have access to corporate and institutional emails. And it's also really hard to

anonymize that information once you even have that, because you want to promote the integrity of the email as well for your classifiers, but still really hard to anonymize data. Thankfully, certain services like Google Cloud has a data loss prevention API that can be used to help deal with anonymization if you're interested in this work. And that's one recommendation for that. And so part of a phishing data set too is what are the prominent features of phishing data sets? What are we going to look at? And there needs to be more of a discussion in the machine learning community about, hey, are we going to look more at the URL? Are we going to talk more about the JavaScript? We really need people who are going to be

having a deep dive and, hey, we have these features that are important. Let's talk about the correlations between them instead of what exactly are the prominent features. teachers. So the Enron email data set is a really interesting data set that is from an energy company in 2002. And it can be used as a control group of just regular phishing emails. What happened was that an academic person from Amherst College purchased this email server for about $10,000. And now it's being used for academic research. The original copy was about 16 megabytes. Now it's much shorter, thankfully. But if you were interested in having some phishing research, this is a good data set to start drawing off of. I've used it before for my classifier building

and it's a really good place to have a base control group. Public phishing repositories are also really good to have and know about. Cornell, UNC, Texas all have their IT departments give out different phishing repositories. And it's really easy to scrape the two off of it. And I'll also be putting on GitHub the tools that I've written of using Selenium and Docker to scrape phishing repositories. As long as you have VMware, a file system that you want to write out to, it could even be just a file system or a PDF that as long as you can draw and make an analytics base table off of, then you can use it. And so this will be going up on my personal GitHub soon. If you don't want to have

to write your own Python scripting, maybe you don't know how to write Python. And so that'll be up in a bit, too. So the phishing website data sets that do exist are One from Mohammed, McCluskey, and Thabta. It's probably the most comprehensive phishing website dataset. And it has about 11,000 instances. All the values are also binary categorical data. So it's all one hot encoded values, which is really good if you have some rule-based decision sets or a decision tree such as C4.5. You'll be able to have some nice splits because of the binary categorical data's nature. And the data set is split about 43 to 57, so you won't have to jitter too much or change the data. So now let's talk about just what

Mohammed McCluskey-Thabta did to talk a little bit about what it means to establish finding extractions from a URL. So important things to think about for phishing detection in a URL is the IP address in the URL. If so, it's probably harmful. The length of the URL, if it's 75 characters or not, then that's probably going to be more harmful. If there's an HTTPS token added manually to the URL, that's also going to be harmful, of course. If there's two forward slashes in places it shouldn't be, or if there's a presence of an at sign in a URL, that's probably suspicious. An at sign can be used to get away from browser mitigation. it's really important to look for these miscellaneous tokens that could be suspicious. And then this dataset

also has it included in their data. So then you can put this into a classifier and see the correlations and relations as well as the frequency that attacks like this happen in their datasets. Now we can move on to deployment, which would just be talking about is the issuer a certified trusted issuer of the HTTPS certificate or any ports open, and if so, which ones? How old is the website and if the website has a DNS record or not? Also, really small things on the website such as the image itself in the address bar could talk about if it's from a different domain and if it's harmful or not then. And so then we can move on to the content itself, which is looking

at if an iframe was used. And an iframe, if you're not familiar with HTML, is a way to embed a web page inside of another website. So then you would be able to, in a subtle way, visually be able to

go around someone seeing a website inside a website. So that would be harmful. Seeing if the website disables the user from right clicking, you can just use a basic JavaScript disable right clicking on a website. If that is the case, that's probably harmful as well. Also, if the website asks the user to submit information to an email address or using mail to or trying to have any client side mailing systems, those are probably going to be harmful as well as if there's any JavaScript void zero, skip or hashtags in the flags of the anchors on the page itself. And if a popup window occurs, that's probably obviously harmful as well. And so now we can move on to the reputation, which is just talking about how is

the website interplayed with other websites on the web, how, How does Google itself see it? If you were to Google it, is it more of a safe website or would it be considered as suspicious? The features to look at for that would be to see if the website has web traffic going to it, seeing if it appears in Google, seeing if the website is pointed to by another website, seeing the page rank, which was an algorithm written by Google, which ranks websites in a page zero to one. website is 0.2 or not, then it's usually closer to the scale on zero. Then that is really harmful because there's not that much web traffic going to it. If the website is blacklisted

or covered in other domains as well as being harmful, then it'll be automatically suspicious in a classifier as well. So if you do recursive feature elimination, you'll end up with the five most statistically significant coefficients. And if you just do that, then you'll end up that the most statistically significant coefficients are HTML that deals with the redirect, which is JavaScript void, anything that deals with skip or a hashtag if the URL has two subdomains or more, which that means that there is a dot in the URL in places that shouldn't be and that that section consists of more than three of those sections if the amount of web traffic to the website is negligent or if there's nothing really going to it, then that would be considered suspicious,

which makes sense that it's one of the most important statistically significant variables or coefficients. And if meta or script tags on a website contained a URL, then that's also It's suspicious, so that makes sense. But if a website has a trusted certificate, that was seen as a statistically significant coefficient. However, it's pretty easy to get a cert if you're ever setting up a honeypot or some fake websites. So we'll go into that about how context really matters when you're evaluating your classification models and how you can't just rely upon recursive feature elimination or just going based on certain metrics itself. So if we just go on modeling, if you just use the THAVTA dataset, you'll

end up with something around a 95% accuracy with a 0.91 kappa statistic, which means that over 91% of the data is actually reliable and is better than just guessing at random. And then if you use a C4.5 decision tree, you also end up with a 97% accuracy, which means that 94% of the data or is better than guessing at random as well. And these are really high accuracies and kappa statistics to evaluate the reputation of the data itself because of the categorical binary nature of the data. And so if you just use a naive Bayes, you'll end up with about 70% if you don't do any feature extraction, but then you can obviously increase your metrics with that. And so you can end up with about a 91%

accuracy and about that 85% of the data is reliable and better than guessing. As well, however, the data set does not really talk about if all these phishing attacks and campaigns are independent of itself. And a really important assumption in Naive Bayes is that each category or row would be independent. So if the data set or the people who were to release this research really talked about the independence assumption and how maybe it's not just one attacker making a phishing kit, then that would make a much more reliant, much more relied upon assumption for a classifier. A neural network, just a basic RNN, shows about a 93% accuracy and that almost 90% is better than guessing or not of the data. And then gradient boosting leans really well

to binary and categorical data because the classifier like GBM depends upon gradient boosted one side sampling which gets rid of any features that have a small information gain as well as exclusive feature bundling. So if any features are mutually exclusive, then they are also taken out of the data set. So that makes sense that a classifier would have such a high accuracy with 90% of the data being better than guessing. So just takeaways from modeling is that if you have binary categorical data of a really complex scenario, zeros and ones, then you're going to have really nice decision trees. You'll have really nice accuracies and metrics, but you also really need to think about how that'll play into the context of real life. Also, context really

does matter with the recursive feature selection. You can have really important statistically significant variables, but if it's still pretty easy for an attacker to get a cert, then that's going to just mean that you need to really think about how your classifier is going to deal with more in the real world. And the future work that I would want to do is apply more of a visual analysis and textual analysis to the email repository that I'm building out. VirusTotal also is really important for malware analysis. And I think that would be really interesting if people were to make VirusTotal for phishing emails and have hashes associated with it. Because these campaigns reuse some of their

same techniques and the exact same email itself. So I think that would be a really important way for security analysts to kind of, as a community, get together and elaborate on work And so the conclusions or major takeaways is that phishing repositories are incredibly underworked on as a resource. College phishing repositories are also underutilized and that phishing detection should consist of both emails and the website itself that attackers are using to target people as well as a feature extraction for harmful websites should really be talked about and more and talked about what exactly, what exact features we should be using. And thank you so much to David Reed for being a great mentor throughout this whole project and the friends and mentors for support. And

I'll be putting the scraper on GitHub in the next two weeks if anyone's interested in getting a scraper to have their own phishing repository research being done. And that's basically it. If you have any time for questions, just feel free.

Can you tell me a little bit more about why you chose the models you chose? Yeah. So the models that I chose were more... Can we go back so I can see them? I don't remember off the top of my head. I'm sorry. Oh, of course. Yeah. Sorry about that.

Yeah. So the models I chose, I wanted to get an array of different types of models. I wanted to really make sure that you start off with a decision tree or just rule-based learning because I think you want to start simple. PCA was also involved in 1R to really just start off simple, but then I figured that would be more interesting to put more of the more complex models as well in the research for the presentation. And obviously, Naive Bayes is really different than gradient boosting, which is really different than a decision tree. And so I think it's just important to show that even with an array of different models, you can get somewhat similar metrics with the data set that itself. Did

you consider other neural nets instead of an RNN? Like, was that just kind of the one you went to? Or what were the other options that you could have done? For the other options, you definitely could have used a convolutional neural network, and that's part of the future work that is being done currently. Cool. So looking at this, one thing that I've always wondered is how do you avoid the unsubscribe links? Avoid hitting those when you're processing through these data sets. The unsubscribed links. Yeah, so if somebody received a legitimate email, it has an unsubscribed link at the bottom. If you're assessing all of these web pages that get processed from there, how do you make sure you don't unsubscribe people from mailing lists? Well,

most mailing lists would look really different, and so it would probably be kind of negligent to the classifier itself. So if you're being unsubscribed, then a classifier would be able to see that, hey, this is different than the rest of the links on the mailing list. OK. Wait,

one sec. Yeah, this hand up and then it's you. Have you considered adding information from third-party sources as a feature, like the Google Safe Browsing API or Site Transparency Report? Yeah, that's currently being added to the current work right now. That's a great point, that Google has its own analytics, too, that could be involved. I was going to say, for the unsubscribe links, I'm pretty sure when you click unsubscribe, you usually have to input your email or something like that. I don't think just clicking it actually unsubscribes you. So that might not be a big concern. I was also going to ask, so we recently saw some ML-based malware detection software being able to be exploited because you could just append strings on the end

and it would read it as like a white listed white listed software. So what ways you think an attacker could craft a malicious email that might be able to go around machine learning based phishing detection? That's a great point and especially really timely to add. However, I think that when you look at news reports like that, you have to be aware that the data scientists are already aware that there is certain things like vanilla mimic hats or the bite appending that you're talking about that was something that people were already aware of and should have incorporated into the model. I think that all you can do is have bug bounties for the attackers or classifiers that you're working on and just work with incorporating or improving your models itself

as you get more feedback from people who are in more adversarial work.

So where do you see tech trends moving if this sort of technique, I mean, you're using it here as an undergrad, and you have a number of companies coming up with the millennial wave detection. Where do you see attackers moving to get around these defenses and so forth? Yeah, I think a lot of that would be into the work that's being done with convolutional neural networks. If you look into any recent adversarial machine learning type of work, convolutional neural networks are a huge part because you can just change certain metrics, certain distances, certain hues in the image itself and kind of add more static to break the classifier. Also, if you put anything that is

similar to an input into a classifier but just change it slightly, then you could break the classifier in that way. So I think as more companies start putting ML into their phishing detections, then they're going to have to also think, how are they going to break it visually and also in a type of text way, which would be really simple of just, hey, we can capitalize something that we don't have to.

go back to the categories you had? Yeah, of course. Possibly.

Like you had the seven categories. Okay. I'm just getting to it. It's just going slow. There it is. How did you come up with these categories as Did you cluster them or did you choose them based off of what you've seen in the data? I did it based off of what I've seen in the data. I think if you just kind of go based first on clustering and then kind of putting that into a context, then you can end up having some bias of reading clusters in the wrong way. So I really wanted to make sure of looking into the research first and seeing what are the frequencies of the most prevalent type of attacks and then coming up with these categories through that way. Did you go

through and then label the corpi from the educational sources this way and then run your classifiers, or did you have a different way of labeling the data? So the data was labeled through a classifier. But first, I first created these categories and then had a tag, and then I could assign those tags to the frequencies and the data. Okay, thank you.

My question was if you tried to actually do a multi-classifier given different categories, so a one versus many kind of thing. I mean, maybe after you detect, oh, it is a phishing, maybe you could run through some of these models. I'm just wondering if this is something that... Yeah, that's currently being done. Yeah, that's super cool. That should be super cool. And more of the results will be released too eventually for that.

So whatever is left over after the classifier is by default there for the user to catch. And there's research showing that as these become more and more uncommon in people's email streams, they actually get worse at finding them. So in the face of very good technology that makes an attack email a very rare event, how do you protect a user who might see this so infrequently that they don't realize it's a threat? And how do you protect against the possibility that the actual number of compromised individuals doesn't change or even goes up? Well, I think for your attack model, as this is becoming ever more important, you also need to have more education for your employees at an institution. You said that people are going to start

looking at more emails like, and not be able to tell the difference. But if you have constant education as a security operations team, then you should have people who are mindful because also part of being a worker is also thinking about it. But if you're someone who is at a corporation that doesn't have a security team and is much smaller, then you just have to probably stay diligent to actually think, hey, is this actually something that I should be clicking on or not? Interesting. So you train your way out of people not seeing it in the wild. That would be the best way to do it, probably, if you have the resources.

And then a corollary question to that is when the classifier has identified phishing emails, what do you do? Do you tell the user? Do we mark it? Do we disable the URL? What do you think would be the best? Well, the deployment would be great if I were to put this into a different application and say, hey, this is phishing. This is just own research of just working on the classification part of it, not really the deployment.

Hi. Awesome talk, by the way. I've heard that some scammers actually craft their emails poorly because the probability of someone reading a poor email and actually clicking through, they'll be more likely to be a victim at the end or something like that. Have you actually looked at the frequency of your emails to see if there's a difference in distribution of really well-crafted phishing emails and really poorly crafted emails? Yeah, there's an index that kind of encapsulates that. And most of them are...

written pretty badly but try to be written pretty nicely. So no one really tries to just write garbage and then see if that works, but that would be great if that happened in a higher distribution. That would make things way more fun to look at. All right, let's give another round of applause.

I actually saw you on

the podcast. Oh, really? Oh, that's awesome.

out my Python script that I use to write it, if you want to use that or if you're interested.

online research. For sure. Veronica Weiss, I go to school at Skidmore College in upstate New York. I just finished an internship in Manhattan in the finance industry. Oh, I work at MIT for the research labs. Yeah, cool.

Yeah, yeah. For sure. Well, if you find me after or anything, or if you have my Twitter or anything. Perfect.

It's a hexadecimal for the first three. So zero x, 56, Weiss, W-E-I-S-S. Hi, thank you so much.

That's awesome. That's great. That's a really interesting way to do it. I would definitely be interested in that. I know. I haven't done that for emails myself, but I've done it in the past.

I'm doing it in the next two weeks once I get back to New York, but if you just, I can write it down for you if you have it anywhere. Yeah, RMVEE8. It's like the RMV8 architecture with two E's next to this.

They're PDFs. Yeah. I've thought a lot about that, too.

I'm talking. I'm talking.

I see green light. Oh, that works. Okay.

I'm so sorry. Why don't you talk? I don't have anything to say. Okay, so we're trying to figure out if the mic works. And maybe because of the shirt, because the way it's hanging down. Oh. We want a bigger shirt and a nice collar. That's the reason to wear collars without. See, this isn't that hard. Like I'm doing, I'm 10 feet away. Also, the mic is actually pointed towards you, isn't it? Oh, yeah. Okay, blame the mic. Okay, we're good. All right. There we go.

All right. Welcome back, everyone. This is B-Sides Las Vegas 2019, day two. The name of this talk is Old Things Are New Again, presented by Hiram Anderson. And before we begin, we just have a few quick announcements. First off, we'd like to thank our sponsors, especially our inner circle sponsors, Critical Stack and Veil Mail, as well as our stellar sponsors, Amazon, Microsoft, and Paranoids. Its support from these sponsors, along with our sponsors, donors, and volunteers that make this event possible. Now, this talk is being live streamed, so we ask that as a courtesy to our speaker and to the audience that you right now make sure to check that your phone is set on silent. Also, if you have any questions,

please use this audience mic so that our YouTube audience can hear you. Just raise your hands and I'll be sure to bring that mic over. And with that, we're ready to begin. So please let's welcome Hiram Anderson.

Thank you for the introduction. I'm delighted to be back at B-Sides Ground Truth. So my name is Hiram Anderson, and I'm the chief scientist at an endpoint security company called Endgame. And my secondary objective today is to be on somewhat of an apology tour for signatures. I spent the first time five years of my career dissing them and my primary objective actually is to get you to lunch. So I'll be, I'll try to be brief in my remarks today and probably just hope to pique your interest and hopefully that you and I can chat during lunch in more detail or you can find me at the in-game booth at B-Sides. So I think it's fair to say that one thing that machine learning has done really well in security

is malware. So we're going to talk about malware today. And particularly, it's good at detecting malware before it executes with a low false positive rate. So the reason it's sort of taken fire, I think the primary reason is because you can detect new samples, not before seen, at a modest false positive rate. And that's the primary driver for the success of machine learning for static malware detection. I think today any serious endpoint security company has to say the words machine learning when they talk about their anti-malware solution because of this fact, because it doesn't just memorize the past, but it also projects a little bit into the future. One important element I think that is not often stated about what

machine learning has done to our industry, particularly for malware, is it actually has brought about a paradigm shift. What it made me of that is in the old days, you had of malware experts writing signatures. And today you have a data set and somebody just turning the crank who doesn't have to know hardly anything about malware in order to produce a new model. So the automation component of that is actually a huge benefit to modern anti-malware products. And in fact, this hand cranking part is so easy that you don't really have to know much about malware. Even a data scientist can do it, right? So now contrast that with the quote unquote old school way of signatures

on this side. So signatures do not generalize to the future. They're terrible at predicting new families. But you know what they do well? They do exactly what you told them to do. They are excellent at capturing the known malware at almost a 0% false positive rate. And in fact, that is better in most cases than machine learning. So what I wanted to start off with this exercise is understanding the strengths and the weaknesses of these two things. So, let me go back, that was a mistake. So,

with signatures, I can get 100% true positive rate and almost 0% false positive rate. And in fact, that does require usually domain expertise and some sort of manual intervention. But at the end of the day when you're finished, this signature is totally interpretable in a way that machine learning isn't quite yet. What I mean by that is, so what do I mean by interpretable? So here's a rule, this is a Yara signature that detects Locker Goga and it was written by Florian Roth. And ignore almost everything except this bottom condition which I've highlighted. know exactly why malware is being called lock or go get it happens exactly for two conditions there's an or right here either it has a pe header so this is this is mz

and its file size is less than four megabytes and it has one of these strings one of these five strings that begin with an x or it contains this this wide string this may lead to the impossibility of recovery of certain files So this is signature. It's imperfect because by changing just a few things I can break the signature but it exactly captures all of the LockerGoga samples already released in the wild. And you know exactly what it's doing. So where we want to live is actually in the sweet spot in this middle. So we're going to let machine learning do its thing on the unseen stuff but we also want to get away from

the old way of the scaling problem, the human scaling problem. We want to have the automation of machine learning where I just point algorithms at data and outcome signatures for all of the known bad samples. These rules are perfectly interpretable, you know exactly what it's doing. And it doesn't require, you know, I can turn, even a data scientist can do this. He can turn the crake and make these signatures without sort of understanding totally what's happening. Of course we will never have good true positive rates on new samples and new families. That is the role for machine learning. So the goal is to use these two things together instead of these two things together. So I'll be talking about and for our

YouTube audience that these things you can't see my laser pointer. We want to use our automatic signature generation to bring the best of automation with

low false positive rates. So I just want to clearly point out that this desire to live in this sweet spot middle is not at all a new desire, a new thing. There's several excellent approaches, solutions out there already. I'm going to just name three. The first is by Florian Roth called YARGen and it was actually not intended to be totally automatic, it's a semi-automated thing. So you point YARGen at a batch of samples and it will create candidate YAR rules that you should review by hand and tweak until it works right for you. So it's meant to be refined by a human. The other two in the middle, VXSIG and BASE, these are actually very, they're based on the same work. VXSIG was recently

announced, released by Google, Hover Flake tweeted about it, that it's essentially, and BASE both, they're both based on Christian Blechmann's 2008 thesis. Essentially, it takes IDAPRO disassembly and uses the least common sub-sequence algorithm to find signatures that describe opcode sequences for malware families. So for today, I'm gonna be talking about more of the first thing. We don't wanna have any reliance on any of the assembler. We wanna just use the raw byte strings for this work. So in my education, I was taught that, and I believe that a talk is an advertisement for a paper. So here's a paper. which was presented just on Monday at KDD workshop for cybersecurity in Alaska and now today at B-Sides. So you'll note that on the author

list, I'm way down here, I do want to call out that my really smart colleagues from the laboratory of physical sciences and UMBC, University of Maryland, Baltimore County, have done far more work on this than I have. But please, you can find this paper today on archive and it'll give a much more thorough explanation than I will today. Let's talk about Ngrams real quick. Ngrams have been the bread and butter for a whole host of machine learning applications ranging from natural language processing, bioinformatics, and of course even to malware. So in particular, an Ngram is a way to splice up a long string in little tokens, and the number of tokens included is the number N. So a unigram

is for this is a sentence, be each of the words individually. A bigram or two grams would be this is a sentence and so on. Trigrams using three tokens in that three gram. So for malware, ngrams have been used on tokens that represent raw bytes and that's what we're going to talk most about today. But all of the same techniques I'm discussing today could also be used, for example, in disassembly mnemonics. or Windows API function call sequences, or Windows security event sequences. And all these same things apply, but we're going to focus on the raw bytes for today. So now what makes malware raw bytes really interesting to me and unique when compared to most of these other machine learning natural language process

domains are really two things. The first is the sequence length. So if you think about a document for natural language processing, If you have 10,000 words in your document, that's a giant document for natural language processing. But for malware, to have one million bytes is actually kind of average. So just the length of documents and tokens is different. It makes you wonder if the engrams that we developed for document parsing are maybe not totally well suited for malware parsing where document length is so much different. A second thing that I find interesting is just the size of the n-grams. So for most of machine learning in natural language processing and compression speech, it's most common to use unigrams or bigrams or trigrams, but almost

never do you see n equal to four or more. And in malware there's been a few efforts to have n equal to six, but almost nothing has exceeded that. And that's interesting for a number of reasons. I'll bring up that. So the first is the sheer vocabulary size. So we're talking about byte sequences, the number of byte sequences, of total possible byte sequences for six grams is something like 280 trillion. That's the total number of possible combinations you could get, right? In order for me to extract six grams on five terabytes of data, we did this, and it took a 12 machine cluster one month to find the six grams.

256 to the n is painful and we can barely do 6. And at the same time if you think about malware you almost think like 6. Is 6 really enough if you want a gram to mean anything at all? So with 6 bytes I can capture 3 wide characters. Right? I can get the MIC of Microsoft and that's it. With n equals 6 I can't even capture the full x86 instruction set which can go up to 15 bytes. So it makes you wonder, are we doing it wrong? If I want a gram to mean something, am I doing it wrong? So the question we asked ourselves was kind of a crazy idea. What if there was a world in which we can extract a 256 gram or a

512 gram or a 1024 gram. Spoiler, 1024 gram, we're going to call it kilo gram. Do you get it? Okay, so if that were possible, would that even be useful? Intuitively, the data scientists and you and me are thinking, that's a terrible idea. You're overfitting to maybe a single sample, and it'll never be useful. But spoiler alert, it turns out that signatures work in malware. And people reuse very long strings of bytes across a malware family. And we'll see that later. OK. So were it possible, we would like to explore whether or not we should even consider n-grams bigger than sticks. has never been done before. And we're gonna approach it in the following way. So there is a hope and they're based on sort

of two observations. The first is that n-grams follow this power law. The power law means that there's a few six grams that are used everywhere but the proportionality drops off according to this linear power rate, right? So most n-grams are used in only a few samples and there's a few dominant ones. The second observation is that writing signatures for malware, we actually only care about the dominant ones. If I want to write a signature for LockerGoga, I want my n-gram to cover all of the LockerGoga samples, right? So I can throw away all the other n-grams. So think about that 256 to the n space. We only care about just a tiny pinpoint in that space, that pinpoint representing the very top of this power law distribution, right?

Okay, so given those two observations, let's sketch a solution. The first is that we're just gonna focus on the top K. Let's call K a million, right? A million out of 280 trillion. It's a pin drop. But a million is still a lot of grams to worry about. And we're gonna do an algorithm that does two passes in a very efficient way. The first pass, we're just gonna find what are candidates for K. We'll use a hash table for that. Hash tables have collisions, we'll deal with it. And in a second pass, we'll use that hash table as a filter to only extract the candidate k values and actually retain the actual n-grams. And I'm going to actually do this like three times, so it'll be

very clear how simple this is. And I'm not going to show this today in the talk, but if you care about such things and want to read the paper, because of the power law behavior and because we only care about the top k, we can actually bound the error rate in recovering the true top K tokens in the presence of these hash table collisions, right? So we can say that with certainty we are actually giving you the real top K of the engrams. Okay, so let me just walk through it with a very fancy simulation what this algorithm looks like. So the blue thing is my hash table. And think of this hash table as having like it's like 16 gigabytes, right? It can fit on your

laptop, laptop memory. And then I have a bunch, a batch of files on disk, and for the sake of this demonstration, we're gonna say there are three grams in my set of files. And I have a hash function, and the hash function is just gonna compute a location for this gram, so remember this is a sentence, this would get a location to look up in my hash table. So gram one says I'm going to increment, I'm going to go to that location in the hash table and increment. Gram two says okay you belong here. And let's say gram three is a collision and I go back to that location in the hash table. I've just

showed you the first pass of the kilograms algorithm. I just count things in a hash table and I throw away all the grams. So think about 234 billion things I can't possibly keep around 134 billion six grams in memory or disk, but I can keep around this 16 gigabyte hash table with counts in it. So the second pass, I'm now just gonna use this hash table as a filter, and now every time I pass through one of these grams again, gram one, I see that he belonged to one of the dominant buckets, and so I'll keep that gram. I'll keep the actual six characters and store him to some data structure or database. Gram 2

did not meet the threshold and I discard him and that will be the case for the majority of grams I encounter. And then gram 3, this collision, I'll keep him too and guess what? That collision, I've recorded that and I've recorded their counts in this new data structure and I can tell them apart now. Even though their hashes were the same, I have the original byte sequences back. Okay, so that was the second description of the algorithm. The last time I'm going to show you of 34 lines of Python code that will allow you to implement this on your own. So you can't read this because it's too small, but if you zoom in on

your camera phone, maybe you can do it. But there's four parts. I create a hash table that's a numpy array. Then I do a map reduce to accumulate things into this hash table. And to make this fast, you can use clever hashing tricks like a rolling hash algorithm like the ribbon carp hash. Then there's a third part where I do a Numpy partition sort to find out which are the dominant K counts in my hash table. Then the last part is pass2 where I use that hash table as a filter and I store my true Ngrams and guess what, a Python dict. That's my fancy space saving structure. Super easy. Let's look at this now. Let's look at some

experiments that I hope you'll find interesting on now that we can compute n-grams of ludicrous size like 1024 are they at all useful? So first I want to tell you about some data sets. We have four of them we tried against. One was an industry, so provided by an industry partner that contained two million and change portable x-begle files from 2014 and 15. The year is important. I'll mention later. Then we're to use the Ember PEs from 2017. We have a malicious, sorry, a PDF data set, and also we have some Vyorshare. So the Vyorshare set is just malware families, the top 20 Windows malware families where the family is given by AV class. And we're going to run the following experience across them. The

first is, like, how large is N? How large should N be? I mean, our data science intuition says that you're silly to choose a large N because you'll be overfitting. So what we did is we took the top 100,000 N-grams across each of these data set corpi and we built an elastic net model and we tried all these values of N ranging from N equals 8 to N equals 1024. And so the first thing is your intuition of mine was correct is that as you increase N performance drops so the graph that I'm showing here is it's a balanced accuracy each of these data sets have different data set sizes so I'm weighting them and aggregating them in a balanced way into this accuracy number and

the reason accuracy instead of another metric is because one of these has no benign in it right so accuracy is a way to wrap them all together so what you should notice is that as I increase in that my performance drops from to 100% to down near 75% or something balanced accuracy. That's intuitive. One thing I found really surprising is that n equals 32, which is a ludicrously large n-gram size for most things, is really no worse than n equals 8. That was kind of interesting. Totally unexpected and surprising is that when we went to n equals 1024, this worked at all. So in fact, if you'll notice the accuracy numbers for n equals 1024 on the bottom, we're getting something like 92%

accuracy on many of the data sets. The other very unexpected thing is, you know, you and I have been taught since our childhood that signatures are brittle and they'll only last so long. But in fact, this last column here is, this says end to ember, when we extracted these signatures on the 2014-2015 data and then we applied them to the 2017 data and that's the accuracy number we're reporting. So if you look, the accuracies are in the high 90s for the not as ludicrously large n-grams and actually 72% for the 1024 grams. So that's kind of interesting. What we did with this was the following. So the intuition should be if I use n equals 1024, I'm going to have low

recall but extremely high precision. There's going to be only a few samples that match this very specific byte string. So we're going to form the following algorithm this way. The algorithm is we're going to create a Yara rule by using these ludicrously large these kilograms, and we're going to start with the extremely high precision local recall side. And until we have enough coverage, we'll just decrease in until we have covered our family enough. Does that make sense? So in words, we start with n equals 1024, and we have a little while loop that while n is bigger than 64, then extract signatures, make sure they don't exist in another family or a benign set. And then

a very fancy or statement. So I'm just gonna or all these if it exists, if this string exists or this string or this string. That's my YAR rule. Okay? And until I've met my target coverage. So this really simple thing, this really simple algorithm which is a few lines of code, if I compare this to YARGEN, and I will caveat to Florian Roth that this was never meant to be fully automated, and this is exactly what we did, we fully automated it. But compared to fully automated Yardgin, kilograms outperforms it across the board on this set of 20

malware families. There are a few malware families which they all failed spectacularly at, but for the most part we have very high F1 scores. Okay, so why else would you want really large enneagrams? Well, if they're really large, you know exactly what they're doing. And I want to just take you through a tour of some of the Ngrams that were discovered in this malware corpus that I think is really interesting. So first, what Ngrams are more common in malicious binaries than benign? So here's one. This is a gram that was discovered. It is a registry key, and you'll recognize it as a common registry used for run key persistence. And guess what? A lot of malware tries to persist using this run key. And

knowing nothing about the malware behavior, only that this existed in 12% of all malicious files in Ember, this signature was discovered totally automatically. Required no brains, no security brains, right? Here's another interesting one. This one did require brains, not mine, to understand what it did. So we found, I think this was a 64 gram, maybe 128 gram. we found a byte sequence that was very common in malware and I handed it over to my friend Bill and I said, Bill, what is this? And he disassembled it and he told me, he said, look, look at these bytes, that's virtual outlook, right? So this is assembling a string in machine code for virtual outlook that is used later to go get get proc adder so that I

can load dynamically DLLs in order to evade static machine learning detection. So this is really cool because this was an 8% of malicious binaries and Kilograms just found it. I wasn't looking for it, it just found it, right? That's pretty cool. Okay, this is from the PDF data set. This is from a 512 gram. It picked up, turns out that malware authors are lazy, just like you and I, and they reused code, and here is an obfuscated JavaScript string used to exploit a particular version of PDF reader. And I honestly have no idea what it does, but it's used all over. And this exact string you can find in hundreds and hundreds of malicious samples of PDF. And Kilograms just found it because it was popular.

That's it. This is a cool signature. Okay, last deep dive into signatures. This one's also kind of fun. So there was another signature used all over in thousands of malicious binaries in the Ember data set. And he gave it to Bill and he tried to disassemble it, it wouldn't. And finally he said, well, where did this signature happen? And it turned out it lit up everywhere in all the resource sections. So he went to the resource sections and it turns out this icon, this icon is used everywhere. If you see this icon on your desk, do not click. Okay? So this is a totally brittle signature. I could change a pixel here and this signature

would fail. Guess what, malware authors are totally lazy and they're reusing this icon all over the place, right? Okay, I'm done. I promised you that my first objective was to get you to lunch on time. And if you'd like to chat more later, please do. I have lots more slides that show caveats and more interesting findings, but I would just provide a summary. So I talked to you about kilograms, which is a clever plan words that I didn't think of and I wish I did. But it's as simple and it's insanely efficient. Remember how it took me, month of a 12 cluster machine to extract six grams. With this kilogram, so the secrets to kilograms, I use the streaming hash algorithm and a hash table. That's it. And

43 lines of code, right? And I can extract one million samples on my laptop in less than a day, right? So it's pretty efficient and it's actually surprisingly good and interpretable. We saw that, you know, It's not going to beat machine learning ever on never before seen samples, but for some malware families it has 100% detection and 0% false positive rates. So this is something you'd want to use in addition to your machine learning for the known stuff, the known bad, right? And that's my final point. I guess the overall picture here in my apology tour is that signatures are terrible. detecting new things but they are awesome for detecting the things that you know are there, are already out there and maybe you should use

them. Thank you.

We have time for about two, three questions. The more questions the less lunch, just so you know. Did you see the antivirus machine learning industry start focusing on classifying the non-kilogram classifiable malware then, like specifying those? Well, I will say that I've often thought about, like, so think about neurons in your neural network or think about boosting stages in your gradient boosting classifier. Why am I spending three boosting rounds trying to catch all apple when one signature could do it all? So I've thought about that. And I think a lot of mature companies, that's why they have both signatures and machine learning as a heuristic. Sometimes you just want to get the easy ones right away and

let the machine learning focus on the hard ones. So I like that suggestion. I would agree with that. John. So you talk about the brittleness of the signature still sticking around using the engrams. So in NLP, skip engrams can be used to solve some of that brittleness for that sort of problem. Have you had a look at that sort of thing? Yeah, it didn't look in depth. So I feel like there are multiple ways to use this. So we actually, in the paper, explored two approaches. One is purely signature-based, and one is as an engram in a model. So what I presented today was all the signature side. But the model side definitely Definitely, they're

much more robust, right? Yeah, but fundamentally at the n-gram level, instead of say the quick brown fox jumped over, the fox over, so one skip six gram. Oh yeah. So one thing about malware is it's positionally insensitive. So my dot text section could be at 400 hex offset or somewhere else. So one thing we did, we never did any temporal downsampling in our set, but what we did do is, let me show you this. Here are the most common 32 grams. You know something wrong about them, right? It's massively redundant. And so this speaks to that. Instead of, like the naive thing to do here is like, I'm going to skip every third byte and reduce my set. you are in danger

because of the positional insensitivity of missing out on something good. So instead we have this thing called a hash stride. So if the hash modulo some hash stride number, say 16, is equal to zero, I'm going to ignore the number. So that's a way to get around the sort of overrepresenting a single string. I guess I have not thought about how that helps the brittleness of the n-gram. And then one last question here. Just real quick, first, what hash function did you use? In all of our experiments we use a ribbon carp streaming hash. All that means is that, say I've got eight bytes coming in, I can put it through a shift register, so I You multiply every position by a number, and

when one comes out of the circle buffer, I know what to subtract from it and what to add. So it's very fast. So every hash costs one multiply addition. I'm not super familiar with that, but is there any reason you didn't use any kind of locality-sensitive hash functions? Yeah, well, exactly.

We want a strict hash function. We want to minimize the number of collisions. So things that are close to each other, we don't want in the same bin. Yeah, sure. Yeah. All right, let's give another round of applause.

Turn on, on the screen, one, two. One, two. You're good to go.

Thank you.

All right. Good afternoon, everyone. Welcome back. This is B-Sides Las Vegas Day 2 in the Ground Truth Room. The talk is Reduce Reuse and Recycle Machine Learning Solutions for Security, presented by ROM. And before we begin, we just have a few quick announcements. First off, we'd like to thank our sponsors, especially our Inner Circle sponsors, Veilmail and

Good start, I'm sorry. As well as our stellar sponsors, Amazon, Microsoft, and Paranoids. It's support from our sponsors as well as our other sponsors, donors, and volunteers that make this event possible. So this talk is being live streamed. So we ask that as a courtesy to our speaker and to the audience that you right now check to make sure your phone is set to silent. Now if you have any questions, we're going to ask that you use this question microphone so that our YouTube audiences can hear you. Just raise your hand and I'll make sure to bring over the microphone. And with that, we're ready to proceed. So please let's welcome Ram. Ram Ramamurthy Thank you.

Howdy. Hi. Thank you all for coming after the lunch talks. It's like it's super hard act to follow like you know Hiram, John Seymour, insanity bed. So I'm just like going to build and stand on you know the shoulders of these giants and the way this talk is structured is like I'm my goal is not to get through these slides you know I will kind of like you know hope you We'll have a lot of opportunity to ask questions. And the second thing is even though I'm the one presenting, this work represents the corpus of 32 applied ML engineers in our group. So it's not just me who did all this work. So I wanna make that clear. And Gabe is not here, but thank you, Gabe,

for inviting again. So the first thing I wanna talk about is how we think about the current state of security. And if you think of like the red team kill chain, this is something we're all familiar with. You've got attacker doing some sort of reconnaissance, establishing some sort of persistence by compromising a service account, they move laterally and they exfiltrate data. We all know about this. So what we did was, in 2012 when we were thinking about rejiggering the way our team thinks about security data science, we spoke to a lot of analysts. We spoke to a lot of blue team members, and we found out that they also have a discrete set of steps that they take to kind of

attain their goal. And this is what we call the blue team kill chain.

Blue team members, they gather data sets, they author atomic detections, there's some sort of alerts that are being written down, and then it comes with triaging. This is when they're all looking at multiple panes of glass and trying to see which is the one that's sticking out like a sore thumb. And once they find the alert, they gather context, they run their playbooks, and then they actually execute some sort of remediation or response procedures. What we found is that the length of the blue team kill chain is the same as that of the red team kill chain, it's awesome. You are playing the role of an arson investigator. The house is already burned down and you are now going to see what caused the burn for

the house, which really doesn't add much value. So we asked a question for attack detection, can we pivot the conversation to attack disruption? And what it really means is that you're trying to play the role of a firefighter. The length of the blue team kill chain is now reduced. It's not the same size as the length of the red team kill chain. And can you disrupt the attacker before she attains her goal? And that's what is going to be the theme for today's talk. It's like, how can we disrupt attacks? How can we pivot the conversation from attack detection to attack disruption? we do that today right for us like the biggest roadblocks that we kept hearing again and again was hey

there's too many false positives you know there's so many we get drowned in alerts we're not able to get to the goal that we want to chain and when we thought about this in context of our red team blue team kill chain what really happens when people say false positives and annoying is that the red the team member is not able to go beyond alerting. Like they get all these alerts, but they're not able to triage, they're not able to gather context, and therefore they're not able to execute on their million dollar playbooks that they have authored. So, you know, being like the bashful, like young folks that we all are, you know, we were

like, let's solve this using visualizations. Let's give them graphs, let's give them some sort of like, you know, pew pew. to kind of like make it look like awesome. So this is like an alert from our automated accounts. Like, you know, this is when an automated service account kind of logs in interactively. It should never happen. And we just wanted to like alert on this sort of anomaly. And we did what everybody else would do. We had like dashboards, we had like Power BI charts. And we basically said, hey, if you click on this, you know, just see this graph and just go do your thing. And the problem with it is in the context

of Azure, there are API calls that are being made, thousands of them in order of a second. So where all of this becomes, instead of intelligent triaging, it's like, where do I find Waldo? The second problem is that our red team members have shown again and again that attacks do not stick out like a sore thumb. If I'm not able to find what is anomalous in thousands of API calls, there's no way I can expect my analysts to do it. The first thing that we had to come to terms with is the scale at which we operate in. We see 630 billion authentications per month. If I have a false positive rate, a modest one of 10%, that is still in the hundreds of billions ranges.

methodologies that really shrink or operate at the scale at which the data that I'm seeing and also at a much higher like false positive rate. So we started like thinking about like had a couple of mind shifts and this is the first the first like five I'm just going to walk you through the different mind shifts that our team had. The first thing is we want to focus on what we call successful detections. So the x-axis is sophistication of an algorithm. It could be basic or something like advanced. The security domain knowledge, are you putting a lot of effort into it? Or is there only little effort into getting domain knowledge. And then the other side is the

utility of those alerts. Given an alert, is it actually useful or not? And if you were to do some sort of basic outlier detection, this is like a number of failed logons greater than two times standard deviation. It's an outlier. It's a fairly straightforward method. And there isn't really much security domain knowledge that goes into it. And you should not expect a big utility in terms of your rate of return. Or that's what we found. So the immediate knee-jerk reaction is like, oh, let's bring in time series analysis. Let's really increase the complexity of our algorithm. And this is what I actually did right off the bat. We had a look at weird times that people check in code into their system. We just

built a very regular, vanilla time series, Holt Winters system to look at failed logins when they're checking in code. But here's the deal. The problem is that just blindly increasing the complexity only produces anomalies. So the way we think about that is, the way my system failed is, It works really well in the North America's region, but the concept of weekend in the Middle East is extremely different. In Israel, the weekend is Friday, Saturday. So people are actually not checking in code on, actually check in code on Sunday because that's their Monday. So thinking of just blindly increasing the complexity really does not give you much values. And what we think is the... The true goal is successful detections or security interesting

alerts. And that's what you really want to strive for. So the second mind shift that we had is As data scientists, we have this insane curiosity towards feedback. Hey, I'm producing an alert. Let me go talk to my analyst and let my analyst be my mechanical Turk and reap off the feedback that she provides to me. And it's kind of really sad because that's really not an analyst's job. An analyst is there to secure the system and helping and just looking at them as a proxy for mechanical Turks or vendors. to get your feedback was not useful for us. So what we did was we tried to tap into the other assets that we have to get labels. So for instance, like bug

bounties. We had simple bug bounties internally within Microsoft and then soon outside, we said, hey, if you were to come and attack Azure, you'd probably get a monetary reward for that. Like, you know, red teams are an other great way for us to get label data. My favorite one is actually getting labels from other products, which I'll talk about in a case study going forward. The third thing is understanding how the cloud is very different from on-premise setting. So in on-premise, we all have this really nice comfy feeling of our network, of our private network. We exactly know the crown jewels, active directory, and anytime if your active directory is pwned, then you get keys to the kingdom. admin. But

if you think in terms of the cloud, it's a little bit different. And it's a little bit same as well. So it's same in the sense that anything that you see on premise, you can also, you will see like a cloud analog. So for instance, like you might see You might have on-premise SQL server to host your files. In the cloud, it's SQL Azure. You might have your domain controller, which would be Azure Active Directory. So there are a lot of analogs. Servers become services. It's also very inherently different. The whole point about the cloud is that there is no single point of failure. So what are the crown jewels of Azure? What are the crown jewels of

your cloud? So asking this question and really talking with our red team and our blue team kind of helped us identify, oh, okay, so the LSAS over here is kind of like the key vault in the cloud. Or like, protecting storage accounts is important because, hey, SQL Server is important. So we try to make these analogs and This was done by our red team to help us. When we took the knowledge that we had on premise and translated that to on the cloud. So things like the attacker, her goal on premise, she might be like, I want to be domain admin. But on the cloud, it really translates very nicely into becoming subscription admin. Or things like pass the hash

on premise is like a credential pivot, where an attacker, she might a subscription, she might get her certificates and pass those certificates and authenticate. So we try to do these translations to help us guide intelligent cloud security detection methodologies. The fourth one is we want to solve for classes of attacks. We kind of have a 33-member data science team, and we still have an extremely healthy backlog of things to get done every semester. And what this told, informed us is There's always going to be more security scenarios to solve than the resources that you have at hand. And this is an important insight because if you're going to have a unique custom solution for every

security scenario that your management asks you to solve, it's going to be hard. And this is like traditional machine learning where every task gets its own learning system. really want to pivot to is trying to learn from related tasks. If you've solved detecting geo-login anomalies in Azure Active Directory logs, try to use that methodology and we will see how it transfers to detecting unusual SSH logins. So this kind of helped us to keep our backlog and try to manage it in a healthy fashion. The last one, want to embrace empathy. I know this sounds a little frou-frou, but there's one thing that's really helped us to build I think is like solid ML detections is

to think and talk to our security analysts. We put them front and center and again like we do not look at them as mechanical turks. Like this is like a picture of what we call the grading fiesta. We host every bi-weekly when every member of our team comes together and we grade a person's detection. And if you're not able to grade it and if you're not able to like understand what the anomaly is, there's no way in hell like our analysts or like our customers are going to be able to do it. The second thing that we started doing is we really had like partner calls when we asked them like, hey, we're going to

private preview, would you like to test it? So really getting that feedback from our customers really helped us to think like, okay, we are building solutions for them and we want to be able to help them. So given these like five different things that I spoke about, I'm going to tell you, kind of ground them how we protect assets in the cloud, kind of like on the host, on the identity, and on the service level. So the remaining time is just going to be the case studies in each one of these areas. So I'm going to start off with the host and talk about detecting malicious PowerShell commands. So I don't have to say this out, PowerShell is like, Malicious usage

of PowerShell is on the rise. I love the graph from Symantec. This is a nice hockey steak. I think it was right after Powersploit was released.

Today we're going to specifically think about like PowerShell obfuscation. We're going to run through this really fast, right? This is just a very simple like PowerShell command to kind of like download malware from this like malicious website. But first off, I can escape the commands. I can then kind of like escape the characters within the command. I can even escape the arguments. I could put a pointer to the new object. And all of these does the same thing. And then I could basically escape everything. And this is like, you know, classic Daniel Bohannon. who came and schooled us in blue hat. So essentially, if you were to use all of these things, do the same thing, which is go to the same website and get malware

from them. were to decode like PowerShell command lines, right, rules really don't work that well because like you have to write a regex for each one of them. And classical machine learning does not really work that well for us because like every command line is unique. There might be no discernible pattern. So what we tried to do was our previous approach actually used n-grams. It didn't use the cool kilograms talk that just spoken you know in the last session but we basically used like six grams and our true positive rate was around 67% and you know we want to ask ourselves can we do better and our hypothesis was like hey deep learning methods are kind of like really efficient

at capturing like semantic variations is it possible for us to use this in this particular problem domain. And the overview, you know, in the next couple of slides, what I'm gonna show is like how we try to capture semantic relationships using embeddings and how to use those embeddings to kind of classify those command lines. So, quick kind of like preview of an embedding. So it's really popular in natural language processing. I think there was a fantastic talk yesterday on embeddings and malware. I'm sure you know they probably do a much better job than the couple of slides I'm going to show you about. But essentially when somebody says embeddings, all you want to think is taking words and converting them to vectors. And if

The classic traditional approach would be like one-hot encoding, but really that doesn't give you much because you lose semantic information. So one-hot encoding is just like, if I see that word, put a one, everything else in my vocabulary is a zero. So extremely sparse matrix. But embeddings, in the other words, are dense vectors. So the meaning of a word is captured in, say, four bits of information, and it's smeared across that four bits of information. And once you use embeddings, essentially what it helps you do is it helps you capture these semantic relationships between words. This is like the classic hello world of embeddings. If you take the vector for the vector representation of queen, and then you subtract from the

vector representation of woman, and then you add the vector representation of man, you should be somewhere in the domain of the vector representation of a king. And the reason why this works is because the meaning of a word can be inferred by the company that it keeps. And our goal is, can we infer the meaning of a command by the company of the other command lines or the other commands next to it? That's kind of like the problem we want to solve. And we use word2vec. And essentially what you essentially get is you're able to distinguish things that don't match. So for instance, all of those are kind of like vanilla-like Boolean variables, but all the ones are littles,

but that was a Boolean variable. You would be able to find that. All those are kind of like window styles, except bypass. You would be able to figure that out using contextualize. But the real power of contextual embedding is you can also learn linear relationships like the one that I showed you. So you know, export CSV minus CSV plus HTML gives convert to HTML. So what does this give you other than the fact that we spent probably a day, if I'm generous, coming up with these slides? The thing is, If you were to see, in your training set, if you only see export CSV, CSV and HTML, you can get convert to HTML. not have to see converted HTML in your training set for you to realize, because

you would be able to get that by the previous things. And that's where the discernible pattern, that's how we circumvent the discernible pattern of the previous approach, is when an attacker, when she uses a command that we have not seen before, but we have a healthy corpus of training data, we're able to infer what's not seen. So this is basically the data set that we have. We get benign scripts from what our own environment has, but we also take from the PowerShell gallery and we tokenize them. We get 1.4 million distinct tokens. And essentially, we learn the embeddings of the unlabeled scripts. And we have a very, very tiny corpus of labeled scripts. And that is essentially what goes into our convolutional neural net for classifying,

for training our model. So I'll show you very fast. I'm going to pray to a lot of folks now, especially John Ho is right there. So this is like,

an embedding looks like using TensorFlow. So essentially, we've taken 10,000 command lines and we've projected them on three principal components. And imagine that an attacker, you've seen invoke web request in your training set, but an attacker, she now uses IWR. you know, same reference to invoke web request. So if I can search for IWR, something that I've not seen before in my embeddings, I'm not sure if you can see it, but the first one is like invoke web request. So that's essentially like trying to circumvent again the problem of like, hey, I've not seen this before, so I'm just going to check my embeddings, see the nearest one to my embedding, because, you know, just like how a word is,

a by the company that it keeps. A command is a command by the company that it keeps. IWR is most likely invoke by request. So I'm gonna project back to this.

This is our results that we have. First off, the model has been trained multiple times per day. The most important thing is classification is completed in the order of seconds as opposed to coming back to lengthening the blue team, shortening the blue team kill chain. We got a really good hefty

increase in true positive rates by keeping the same false positive rate, which I think is a good way to show our management why our team should exist. Because we've not made the system worse, if anything, we're able to catch more attacks. We'll also put a link to the paper on Archive, and I'll be distributing the slides anyway. Please feel free to like, the code's also published on GitHub, please feel free to play with it. The second case study, I'm going to still be in the realm of the host, is trying to detect compromised virtual machines. So our previous approach kind of used rules and heuristics. So the problem at hand is, is a virtual machine

compromised or not? And our true positive rate was 55%. Marginally better than tossing a coin and deciding if a VM is compromised. So we obviously wanted to do better. The solution that we gravitated towards was kind of leveraging the spam information from Office 365 alongside the NetFlow data from Azure. And essentially, our hypothesis is if a VM is sending spam, it's most likely compromised. And we know that the VM is spending spam because we get the spam labels from Office 365. So we basically, they maintain like a a corpus of information that says this IP has been sending out spam. And if you see traffic from a particular virtual machine from that IP address, we kind of use that as label data for

spam. And there's a lot of good reasons why network data is good for detections. First of all, you don't have to do any sort of installations. It comes free. There's no overload on the customers. And essentially, it's OS independent, which is important in Azure, where there's a good, healthy mix of that's actually running on top of it. And some of the features that we extract from like, IPFIX by the way is kind of like NetFlow. We extract from the NetFlow data, we look at the ports of traffic, number of connections, which THCP flags exist. So those are kind of like our features. That's kind of like essentially the columns of our data set. And like I said, spam tags come from building 34 in Redmond. So

this is what the training data looks like. We got the handful of spam-labeled data, and we got benign IP fix data, and then we run it through our random forest model, and then given a new case, given a previously unseen NetFlow data, if we were to run it through our ensembles, we make a judgment, is this possible or not. So I'm gonna just like, I love talking about ensembles, because it's my favorite. And this is essentially what it would look like. So imagine all the positive examples are benign traffic and negative examples are malicious traffic. And now you have to find represented in two dimensions. You have to find a hyperplane that divides them into both. But it's not really possible because there's no

one single line that will cleanly divide both of them. So you need something with a nonlinear decision surface. And the first time is you would learn with one learner. You would do some classification and you'll see this learner kind of classifies, gets some of these positive ones. It calls it malicious. So you now identify You modify those, you upweight them, and the next iteration for anything that you have downweighted, anything that you've got correct, you downweight, and anything that you've gotten wrong, you upweight because you want to think of like, you know, like a really bad teacher. She only looks at like the bad things that you do and she wants you to get that correct. And then you upweight those, and then your second learner gets

those correct, but at the risk of getting the others wrong. And then you do one more layer. So you boost those things now. And essentially you iterate the different rounds of boosting. And the final result is a combination of multiple learners. And you get this nice decision surface. So the reason why we kind of use ensembles is that it's This was actually used in Kinect for a post estimation. So again, you want to think about shortening the blue team kill chain so we are able to classify in the order of seconds And our training is relatively fast.

We run it actually multiple times per day and it's completed within the order of minutes. And we got a 26-point improvement just by using an out-of-the-box ensemble method. And this is also a good lesson for us to kind of keep in mind. If there's an easy solution, please take it. No reason for you to look at fancier methods. In fact, we actually started off with logistic regression. We could have had a high school student come and solve this problem for us. But the important thing to keep in mind is if you can optimize for a business metric, which is like, hey, I want to be able to detect attacks faster,

most of the time out-of-the-box solutions are pretty good at scalability. The third case study is going to be identity focused. So we saw two case studies that were focused on host. Now I'm going to talk about a case study focused on identity. So one of the things we like to think about is the ML journey. And this is something I got a lot of help with my product marketing team, but they did a really good job. So on one hand, you've got folks who really don't have much machine learning power houses, right? But they still have the yearn to kind of like, I want to put machine learning. Probably their board wants them to do machine learning. But

the unfortunate truth is getting security data scientists is like finding a unicorn, right? Because they command NFL-style salaries. They want to be able to like, you can't have a big, enormous team. So our problem statement was, Power businesses were not able to have security data scientists to do machine learning. Specifically, you have a solution to detect anomalous Azure Active Directory logins. Open that model up for anybody to detect anomalous logins for their problem domain. It could be SH, it could be Linux logins, doesn't matter. So that was a problem at hand to help those folks who don't have any ML investments, we're not talking about you guys who have advanced ML investments. For people who have no

ML investments, taking the things that we have built and opening it up to them so they can build on top of that. And the challenge that was presented to us was anomalous login. This is the class like time travel. Rom lives and works in Redmond. Minute one, he logs in from Redmond. And minute two, he logs in from Hungary. It's not possible. So we actually have solved this problem in Azure Active Directory.

The thing was, can we take this and open this up so that other people can build on top of this for their own login space? There was no previous approach, but our solution was, can we reuse what we've already built without building a more generic model? So I'll first explain what we have. and I'll tell you what modifications we did. So this is the geolog and anomaly detection. This is in production and identity protection. So it has basically three straightforward steps. First off, we capture 45-day windows of your login data. And essentially, user is mapped to some sort of ISPs and use a geolocation service to look it up. Excuse me. So what essentially, if you look, User 1 is kind of like in the

Washington area and User 4 seems to be in the Massachusetts area.

But we don't know that yet. So the first, once we capture this login data, we calculate user-user similarity metrics. So using custom mappings. So all this does is that it helps you to find how users are similar against each other. And this, you can imagine, and it's done within the context of a tenant, by the way. It's not done globally. But it's an extremely sparse matrix in the sense that users who tend to kind of have

the same sort of login patterns, the end goal should be to have the same similarity mappings. And this is kind of interesting because when you think of geo-log anomaly detection, it trips a lot of vanilla systems because people travel, people use VPNs or proxies. So the goal was, if we wanted to take this in the context of a peer. So if my manager travels to Israel, because we have a team there, then when I log into, there's a login for me from Israel, there should be a small, chance occurred to it and this is exactly what the second step does and the third step is like we we basically run random walk with restarts on this similarity

matrix and this kind of like gives us a reachability score given like you know at a point in time what are the different locations a person's login history should be so this is great in terms of like at Microsoft scale but it's But to open it up, we had a lot of problems. First off, it's extremely heavyweight in the sense that it's compute-intensive and it needs, like, I showed you, like, you know, it needs, like, training data in the order of billions, whereas, like, if, you know, not many organizations might have, like, hundreds of billions logins that they see every month. And it also uses handcrafted features, which means it really doesn't transfer that well to, like, other

domains. It's really hard for us. We found it very challenging to add new data patterns that are very specific to different log sources. So what we ended up doing was we used recurrent neural nets. It's like if anybody hasn't said the word deep learning yet, I feel I should get the pity brownie. It's purpose... But what we tried to do is thoughtfully, though.

we went the LSTM route is, first of all, LSTMs are great in terms of like they're built for like sequential data. And if you're trying to detect like under login anomalies, nothing works better than this. And it also deals well with scale invariance. It's a fancy way of saying that, you know, if, if, When people log in, they don't log in with specific intervals. Actually, that'd be really weird if I just log in every day at 8 a.m. and log off at 5 p.m. So when you have this invariant scale, LSTMs work really well. And essentially what we did was we took our bulky geo-login anomaly detection system. We said, you're bulky. We can't transfer you. So we essentially used that as a teacher to

labels from our SSH logins and LSTMs have really good capacity to kind of mimic. That's one like interpretation of it. And we use that, we use the labels from our bulky GLAD system and we essentially extracted features like timestamp, users and location that kind of like if you look at the features that we use, it's really specific to SSH logs. It could be used in network logs. It could be used in other types of log sources. So we try to look for fields that are common across multiple log sources and use LSTM for it. And this is essentially what the data set looks like. You know, we get two weeks of login data per user, and we take those features and we essentially

put them in sequence. And the task at hand is login at, say, time t2, is that malicious or is that benign? That's essentially what you want to do in the context of SSH logs. So you take those sequence of logins, and this is the featurizer. That's when you extract user, location, timestamp. And you can see it's actually a pretty shallow LSTM model. So we've got three layers. And then you get an output for that particular log in whether if it's like malicious or benign. So if my login at T2 is contingent on my login at T1 but also login at T3. So if I log in generally from Redmond but then I saw a login to Malaysia

and then I see a login back from Redmond, Malaysia kind of should stand out. And that's what this scoring phase does. does this look like in production? I feel like this is something I wanted to, again, feel very passionate about. Essentially what we do is that we built this in, we had to do this in near real time. So we use a Databricks cluster. And we have a high throughput event hub, which is like a Kafka-like service. So it's able to suck in billions of logins per minute. And essentially we get the state Where the user logs in, that's kind of stored in Azure Blob. So, you know, for any time I see a user's login, I'd be able to query Azure Blob very fast and get

in our production pipeline. And then it's written back to an event hub that a customer can kind of consume in. So, how does this kind of look like? First off, I want to say that this is still in private preview, which is a nice way of saying it's still in a little bit of a research phase. So I do not have metrics to share for this. But that will change very fast. So the cool thing about LSTMs is there's so many people have worked on optimizing for scale and for throughput that Essentially, training only takes in the order of seconds. And if you run on streaming mode, the mean time to detection is, again, in the order of seconds. Once again,

we want to really think about shortening the kill chain, per se. And also, if you think about for this specific problem, logging in might be like the first vector for an attacker. So you wanna be really fast at blocking the login or trying to determine, but at the same time, you don't wanna hire people out. You don't wanna be the person between what the person's trying to resource and them. So trying to keep this in the order of seconds was super important. It's something you'd find in Azure Sentinel. So given that, I'm gonna talk, lots about how we protect the service using the mindsets that we had. So our rallying cry soon became like, do not want analysts to triage incidents. We only want them to look at alerts.

And now I have three independent alerts. I've got one for anomalous DLL that's been loading, a new process that started, and there's another separate alert for some sort of large transfer from a SQL server. And it really doesn't tell a story, and you're just adding to the burden of triage to an analyst. What we really wanted to see was, Is there a way for us to extract, say a story with these alerts that happen? Hey, in the context of a host, there was anomalous DLL and there was a process, and here's now a SQL large transfer that happened on this particular SQL box. So when we want to show this, we tried taking Titans, like what Matt

Swann did, and tried to see, do this in the cloud security scale? So for us, our approaches that were out there did not work because of the volume of data that we had. And our entire goal was, can we do this not on a host level alone, but on a service level and a host level and network level together combined?

Unlike the previous approaches where the dataset expertise was kind of like I could talk to folks from different buildings and understand, in Azure Sentinel where we had to add product value, multiple different data sources. So obviously, first off, Microsoft products, hooray. I could go talk to people I don't know. But you know what? Customers also use AWS. That's the fact of life. And what that means, and we had to understand how does AWS almost activity in AWS logs manifest. And if you just look at network level instances, you've got like Palo Alto, Cisco, Barracuda. And even when you look at endpoint, we not only just looked at Windows Defender, which we love, but also Symantec and other like CrowdStrike.

So our entire cry was, given that a customer has got so many different products, a way for us to stitch multiple alerts from multiple products and say some sort of cohesive story. That is like what this case study tries to cover. So I'm first going to give you like the 50,000 foot overview and then I'm going to drill a little bit into the detail. So the input to our system is essentially like raw events from all of these products so and that by itself is actually in the trillions of ranges because like each product has like a voluminous amount of just raw events now those raw events get converted to like anomalous behaviors through either from an

out-of-the-box alerts that comes from you know say if you have like Azure information protection there's like an that comes from there, or it could be something that we built on custom on top of that. And that also, now that's like in the billions of ranges, and which is really not actionable. So I'm going to talk in the next slide about, you know, you've got those alerts, you've got some sort of like anonymous activities, which is really not actionable. we use this concept called a probabilistic kill chain. I'm going to talk about that in just a minute. And that helps us drastically reduce the amount of security interesting cases that we want to ticket to customers. And we do one more round of scoring to only have a

handful of cases. I mean, all these metrics, by the way, are for a month. So the goal of this project is not a cheesecake factory. They've got multiple assortments. This is more like an artisanal, like handcrafted, like, I don't want to say hippie, but I'm Seattle, I could say that. It's customized for a particular tenant. And the expectation of this is like it's not going to fire every day, but when it fires, it's DEF CON 1. You want to put your biggest resources on top of this alert. want to focus on reducing the alert fatigue there. I know it has a lot of different drawbacks. For instance, it might alert lesser, but we made a decision where if we're

taking time away from an analyst to look at an alert, we want it to be productive and we want it to be useful to them. So how does this kind of work? And this, by the way, is a deep dive on the third and the last two boxes. So the first thing we do is we construct a graph. And we basically take The vertex is essentially an entity. It could be a username, it could be an IP address, it could be a VM, it could be a host, it doesn't matter. And the edge has any sort of connection between them. So if ROM is logging into a VM, you should see an edge connecting between ROM, VM, and another node for IP address.

And the events for this is just like we get from Microsoft and like partner security products. Really we haven't done anything over here other than we have constructed a graph. And this graph itself has like billions of nodes and edges, right? And what we essentially did is we wanted to prune this graph using like a probabilistic kill chain. What we understood by talking to analysts is that the blue team kill chain that I showed you at the beginning is a static kill chain. It assumes the attacker, after she compromises the box, she's only going to do lateral movement or she's going to go to the next step. It does not account for the fact that once she compromises a box, she

might compromise other boxes. So we wanted to bring an element of probability to it. And essentially we were thinking, what does it mean to have a probabilistic part to a kill chain? And we found out that blue team folks especially, not all kill chains are kind of like treated equally, right? So for instance,

if I see an alert wherein the kill chain has progressed all the way to exfiltration, That's probably a bigger bang for the buck than looking at an alert for just reconnaissance. You can just have one step for that kill chain. So complete kill chains are preferred over incomplete kill chains. That was our first criteria right there. And the second thing was we wanted to be time bound. If I see an event for the unusual process, but if all the preceding anomalous logins happened two years back, then it's really not that interesting. But if I see an anomalous process and right before that there was an alert for a malware on the host, that's way more interesting for an analyst. So

we put conditions on time on when we pruned this graph. And we also looked for commonalities, right? Even though this is constructed in the context of a particular tenant, we wanted to see if those kill chains manifest across different tenants of the same profile, that's probably interesting. Probably there is a campaign that's happening on a particular type of industry, which means that an analyst definitely needs to know about. So we essentially used these as priors to graph that we constructed to essentially regularize them. So we want to prune away any sort of uninteresting noise that's coming through the graph. And that would essentially give you

that step right there. So we made a decision to have one more round of scoring because we felt that just by including domain knowledge, it's fantastic, but we also wanted to include the labels that we had from our previous cases. So we basically took features like, you know, is this attack over different tenants, number of high impact activity across the graph, and we were able to do one more round of scoring on top of these interesting subgraphs to reduce the noise further.

So this is how it looks today. If you look at the first thing to notice is that the classification takes still in the order of hours. It's not in the order of seconds, like how the previous approaches were. We're trying to run that debt down. We have billions of alerts per day, but our training time is for that particular volume from especially multiple data products. We are kind of gated on the throughput of those products, but we're still able to keep that within the order of hours. I have like five minutes, but I'll show you a very quick demo right here. So this is like what... Let me show you this. what a case would look like in Azure Sentinel.

So essentially, this is when there was an anonymous login followed by a suspicious O365 mailbox creation rule. So what we found in a bank was we first saw the bank that was headquartered in the East Coast. We saw a suspicious login from Germany, which by itself is not that interesting. People travel. But subsequently, within a short period of time, we saw that person's mailbox was forwarded to an external Gmail address, which is super weird. And essentially what this does is that we were able to, because this was happening, these two events from multiple different services, we able to get them and try to stitch them together. And this is what an analyst would see. So it's kind of like the classic,

taking those two yellow alerts, which by themselves may not be interesting, and trying to get that high red alert. And one of the big advantage of having that graphical representation is It shows when the user clicks an investigate button.

These are the two alerts. This is the user entity. You could get any sort of related alerts for the user entity because we're pivoting. We already have that in our graph backend. I can switch over to timeline. Let me not do that. First, let me show you this.

And kind of like see for the user that's under investigation all the alerts that are related for that particular user. I can also run... This is like... We're able to display this in a relatively fast fashion because it's already part of our graph backend. And for a particular entity, you can also do things like show me... Let me show you this.

I can see the services that they created, user account that's failed. So for any sort of investigator, it kind of gives that experience from our backend when they click that big blue investigate button. So I'm gonna kind of like wrap up in the interest of time so we can have questions. So the most important thing that you wanna take away is, when you're protecting the cloud, want to think of like mind shifts, at least that we found useful, is think about the enormous volumes of data that you're having and also think about the differences in architecture from on-premise versus the cloud. And you know, really there's a lot of interesting debate out there about like

can machine learning even help in cloud security? obviously feel very strongly and positively about it and hopefully you know we've taken away how we drove down how we were able to increase the true positive rate and how we were able to scale in the billions of ranges so you know I put some resources in here I'll send the slides out you know but if you have any questions please let me know you can always email me if you don't want to ask me or I'll be hanging out over here super happy to help or with all of this. Well, thank you. So we have

about two, three minutes left so we can take about three, four questions.

So you mentioned that you were doing that probabilistic kill chain sort of graph pruning. I'm curious if you looked at other methods, like there's the unified kill chain that more like the attack matrix and stuff like that? Any gaps that you found with the kill chain, especially given that you want to wait the later stages more than those earlier stages and other ways you would have dealt with that? That's a great question. I think one of the deficiencies for the attack techniques is that we did not find them mapping directly to cloud-based attacks and service level attacks. I know that's rapidly changing. In fact, like, Sharon Shere from my team is working on that. But essentially, that was like our

biggest deficiencies. We would not find anything cloud-based that we can directly reuse.

So you mentioned that if a simple model works,

use that. You don't need to overcomplicate the problem. But how do, I've seen a lot of times where people who are getting started in machine learning, like they can run a model and the model produces something, but they may not know whether it was simple or whether the simple model was good or whether it was too good and they've got some overfitting problem. So how does someone who's starting out know whether the simple model is actually good? question. Well, I think the way we have, we kind of like convince ourselves that simple model works in terms of code maintenance, like legacy code, integration, like if you're able to do nothing like doing unit tests on your code, to code coverage and all of

that. So you get all this goodness by using simple models. But how do you know a simple model is good enough? We have really good program managers, and we get good business metrics from them. You know, if our system is able to meet or exceed their business metrics, check box. The overfitting problem, I feel like, you know, when you're evaluating, take into consideration the diversity of your tenants. Like if all your training data was only on, say, small SMBs, but then like, you know, you have to put this in production to say data from Fortune 500 companies. You know, you've got like, distribution disparity. I would say collect very different sort of like data sets and then test them.

That would be like my advice. Anyway, we've got to stop, but I'm gonna be hanging out outside for a couple of minutes, but I also want to attend the next talk. So yeah, I'm super happy to help though. Thank you. Yeah, thank you.

All right, everyone. Welcome back to B-Sides Las Vegas Day 2. This is the Ground Truth Room, and this talk is Scheming with Machines, presented by Will Pierce. Now, before we begin, we just have a few quick announcements. First of all, we'd like to thank our sponsors, especially our Inner Circle sponsors, Critical Stack and Veil Mail, as well as our stellar sponsors, Amazon, Microsoft, and Paranoids. It's the support of these sponsors, along with our other sponsors, and volunteers that make this event possible. Now, this event is being live streamed on YouTube, or rather this talk, so we ask that as a courtesy to our speaker and to the audience that you right now make sure that your phone is set to silent. And if you have any questions,

we're going to ask that you use this audience mic so that our YouTube audience can hear your question. Just raise your hand and I'll be sure to bring over that microphone. And with that, we're ready to begin, so let's please welcome Will Pierce.

I'm already coughing. Can you guys hear me OK? Cool. Well, welcome to my talk. This is technically my first talk that I've given. And skimming with machines, we're going to talk about the offensive use cases of machine learning. We're not going to talk adversarial machine learning. And we're going to talk about using ML to support offensive teams. So who am I? Will Pierce, I work for Silent Break Security. We're a small security consulting firm in Lehigh, Utah. We're kind of known for custom malware dev. We write all our own malware. We teach two courses at Black Hat, Dark Side Ops 1 and Dark Side Ops 2. They both focus on malware dev and Dark Side Ops 2 focuses on research. Has

anyone taken the Dark Side Ops courses out of curiosity? No? two days of training. So if my voice goes, I'm sorry. But my primary function is I'm an operator. So I fish people, I write reports. My job is effectively to breach organizations. Write a report, tell them how I did it. I'm not a data scientist. The second piece of my work is obviously research. So this is the research piece for the last year, 18 months, that I've been putting on. There's a lot of excitement around ML and actually kind of a finance background. Can anyone use Excel without a mouse in here? Is there any keyboard monkeys? Yeah. So that's my claim to fame is I can use Excel without a mouse. If

you're at a very boring party, that's my party trick. And then we do, obviously, teach training. So we teach dark setups, one and two, and we write training and things like that. The agenda for today is we're just going to talk generally about ML and InfoSec, and it's going to be from my perspective as an operator. So whatever that means for the products that we see, the hype that we see, and what I actually see on networks versus what maybe vendors are claiming may or may not be true. I'm also going to talk offensive tooling. So I have three case studies that we're going to go through that I've sort of developed. here for any groundbreaking ML research, you're not going to find

it. But what you will find, I think, is some interesting use cases from a very lay person's view. At least someone who's trying to harness ML to become more efficient. So as we know, vendors, old dogs, new tricks. From my research, what I know is ML is not magic. It's math. So no matter what anybody tells you, here are MLers, like data scientists. How many red teamers, pentesters, blue teamers? Cool, so this is a good even mix. So we all know it's math, not magic. I think a lot of the products aren't quite built yet, and the marketing engine is going. And so we're starting to look at it more critically. It has huge implications for the

red side, just in terms of the sheer detection capabilities and the sheer amount of data it can go through. So it's extremely important, I think, for offensive teams to at least take a look at it. My current thought is that maybe machine learning might go the way of app whitelisting. So if you remember a few years ago, everyone was like, oh, wait, you can just block PowerShell.exe and no one can run it. And then the Lullblends project came out and they're like, oh, as it turns out, there's just a ton of stuff that can execute things. So it's gonna be interesting to see how vendors try and implement the, they just have an impossible task of just an enormous amount of data. So it's interesting to listen

to stocks. So vendor claims are generally overblown. The data requirements, so in organizations that we get into or we consult with, most of them have insufficient logging and alerting. So as we know, ML requires some amount of data, consistent data. Obviously, there are techniques that I'm not aware of or can't speak intelligently to fill in data sets, but I do know that ML requires data and I know that organizations don't always have the logging required. ML is a significant engineering challenge. So there's just a ton more moving parts. There's models to keep updated, there's model drift, there's And they still suffer from the same problems that current sims suffer from. So false positives, poor system

implementation, open shares on the network. So if you're keeping your model in an open share, we could steal it. So a lot of times what we do is we'll actually find code or products written in .NET, which we can then pretty easily reverse because it's not a compiled language. And it's the same for models. So I think the first neural network I found on NetGrowb was a business app and it was legitimately in an open share. So that model was just there for me to take. And if you know anything about adversarial machine learning, that opens the door to any number of things. And then we get to data scientists running the sim. So what ends up happening is, very expensive to run like have a full-time employee to

run your soccer sim or monitor alert so what ended up happening is people went sock as a service and what I know as an operator sock as a service is terrible because it's just an average of everything so you're just a client they don't necessarily you know they have an SLA with you so I mean probably the best one come up against it was 24 hours later our client got the the

from the sock and it was too late. So if you're going to have data scientists running your sim, you're going to have a bad time. And we've seen it before. But if you look at the tutorials that are kind of out there, so I like babies first because it's the tutorials, the DEF CON thing. So from an hour perspective, from our perspective, so So we I see all these tutorials that oh we can detect malware. We have all these PE files You know they're all statically in this folder we can read in all the bytes Well, we don't we primarily modern execution happens in memory So you're not going to be able to gather you know enough we can gather samples

and you can view them You know in memory but just a static PE file I think is going to fall short of a detection in a real environment Phishing so Proofpoint is a great example of this. Proofpoint has a pretty good product, but it ignores vectors like LinkedIn or Twitter. So instead of just going through a difficult ML-enabled mail system, I'm just going to hit you up on Twitter with a document or hit you up on LinkedIn with a document. So then I'm pushing the malware from the mail filter to the web gateway, and mail filters aren't going to catch it. And this is something we already do because of increased detections. And then on the network, so we see some, OK, we can detect DNS

traffic with machine learning. It's like, well, if you have a pair of eyes, you can detect malicious DNS traffic. Malicious DNS traffic looks extremely malicious. So when you're trying to shove as much data into a DNS packet as you possibly can to get that transfer back and forth, you machine learning to do it and you might be introducing additional complexity that's just unnecessary. So you're building, you're going from something that's certain to something that's less certain or something that generalizes versus something you know is true. But from our side, so what we see is typically from the route, so sequence analysis of function calls. So there's a paper out by two guys or a couple guys out of Sandia And they wrote a paper

about detecting malware using the 32 API calls. So for me, if we're living in memory and we rely on those API calls, that's something we should pay attention to from my side. From phishing, so we have a much harder time phishing than we used to. We used to be able to send five or six emails, get up maybe 50% clicks and shells. But now we're just seeing context aware spam filtering, so it's just much more intelligent. Anything with a link is getting picked up. Anything with a link is getting visited and pulling down those payloads. So it's just an additional headache that we're not, I think our side of the industry is still kind of reeling with. But obviously

with things like going to LinkedIn, it kind of makes it a little easier. But we're just having to spread it out versus only email. And then on the network, we need information to perform our jobs, and so that requires querying any number of services or servers. And if we're on a host, Jim in HR, and we're querying some SMB server over in the accounting department, even if you don't have host isolation, that's going to look anomalous. Jim shouldn't be doing that. So if you have those detections in place, but at the same time, could implement a machine learning model to detect it, or you could implement host isolation. And in that case, just the traffic

doesn't exist anyway. So it's obviously give and take. So our perspective, or at least my perspective, of ML, at least in the short term, I think offensive teams actually benefit in the short term. So our data sets are smaller. Our expectations are lower. Let's be honest, none of my team expect anything useful to come out of ML. But that's on them, because it does work sometimes. But effectively, my job is to try and augment decision makers. I want to bring information to them so that they can make the decisions that they need to. And then as I slowly pick away at that, eventually I would like to get to a point where we have some semi-intelligent operation going on.

And just generally from the red side, we need to be able to create better tools. So later on in the presentation I'll talk about pushing models client side. So if we could ship malware with ML models in it and just never have comms, that'd be awesome. Or if it could do, rather than reaching out to communicate on an external domain, if it could just collect information on the internal network and just it back out whenever we need it to. It's going to be preferential. So if we can trust it on the internal network, not to destroy anything, then it's going to be preferential for us. And ultimately, we just want to be better operators. So every operator has like 10 commands they run. And if I could get to

command 5, and rather than going through my normal 6 through 10, I can just go straight to 10 because of some information I got in the previous commands, then that makes me a better operator. more efficient at my job.

So I actually wrestled with this title. I said new kid on the block, and it's not technically true. But it's new kid on our block, especially from the offensive side. There's not a lot of research out there. There's some pretty sweet projects. And actually, if you go way back, there's some really cool projects. But I just want to separate ourselves from adversarial ML. So adversarial ML. Obviously, let's make this model think this dog is a cat or classification bypasses. It's more than that, but basically we only care about classification bypasses because we're cavemen. But I want to talk about offensive ML and just that using ML to support offensive operations. So, awesome research, so the parcel sec, the timing attacks,

deep exploit, the big language model. They said they didn't release their big model, but we found the smaller models are just as generating phishing emails.

It's Markov Obfuscate. C2 is pretty awesome. Basically just obfuscates using Markov models through and you can C2 over it. It's pretty crazy.

This one was Defcon last year, a year ago. But there's, it's coming out, DeepWordbug. But I have kind of started to aggregate these projects at my GitHub, which I'll post a link to. But it's just RedM. So I'm starting to collect these projects to start aggregating them. But when we're talking ML, so we have, I think, a unique set of challenges on our side. So our data sets are very sparse. We might have like 30 projects in the pipeline, each with 5,000 lines long. So we just, by traditional ML sounds, we just don't have the data so my epochs are higher than they need to be. Overfitting is, in my view, for us, not a problem, just because our use cases are much more focused. We're

not necessarily trying to generalize. We actually might be trying to be very specific. So in that way, I think of overfitting as a tolerance, if you imagine an engineering term. So how close to that tolerance do we want to get? I don't know if I have a professor liaison who's been helping me. And it basically means go like, hey, I think this is possible. She's like, yeah, I think that might be possible. I'm like, OK, sweet. So should I go and do it? She's like, yeah, go try it. See what happens. Transferability. So we are in a lot of different networks. So we obviously can't store data for the purposes of other networks. And a lot of that data is very sensitive. So social security

numbers. KRB TGT hashes. We just can't keep it for the sake of machine learning. So we have to find a way to make our models both useful but also agnostic to each network. So I think it's potentially a unique challenge. And then sharing data. So as offensive ML becomes a thing, We can't share data like we can share other TTPs. So I can tell you, oh, you should go look at install util, because you're going to be able to get execution, application whitelisting. I can't give you a data set, because it might contain some KRB, TGT. Or if you have enough domain knowledge, you might be able to reverse it and get something if you're clever enough. And that potentially could just come

from my lack of understanding. But there's a book. It's called Adversarial Machine Learning. It's a blue cover. And it's pretty great.

So when I start to think about the problems that we kind of face or the ones that I want to solve, I don't think, you know, ML is, on the blue side, is speeding up. So that just, people are just behind it. And so I think unless we, the red side, start looking at it, we're going to be left behind. So it's a bit like, you know, the frog in boiling water before, you know, it dies before it realizes it's too hot. So my hope in this is if you are a red teamer is to start looking at it because it is useful. Beyond the offensive ML, organizations are gonna be implementing it, so you're gonna need to have some basic understanding of it if you're gonna wanna

operate in networks in the future, assuming it takes off. But as I go through thinking about solving a problem, I have this empirical analysis, so I have experience as an operator that leads me to conclusion. So I know what an expected output should be for a given input. That's kind of what I was a human. I have my own anecdotes as well. I have my own anecdotes, but those could be slightly skewed. So in one network, I might have been able to do something. And in another network, I might have done the same thing and it not work. And I might assign the fact that it didn't work to something incorrect. So from I'm not going in the future, I'm gonna

have some false assumption about why it didn't work until that is corrected. And that's gonna come through in my ops. And then there's statistical analysis. So, you know, how many commands did I run? You know, what was the length of my commands? You know, that kind of thing. And I think the blended approach is really useful. We'll get into some statistics later. But the first case study that we're gonna look at is classifying a sandbox. I had a blog out a while ago, it was almost a year now, and I said part one, and I was like, if I put part one, I'm gonna have to put part two. And I didn't, and this is part two, this is what you're looking at, this is

the talk. So hopefully there'll be an update, be part two through seven, I think.

So, playing in a sandbox. Our job is to package up Word, Mac, or Excel, whatever it is, and try and get some user to click on it so we can get execution. And part of that is, obviously, basically any email with an attachment these days is getting executed in a sandbox. And sandboxes are kind of outsourced analysts, in a way. And their job is effectively to make some determination about the safety of that payload. And it's been traditional, you know, like, OK, let's change the strings in it. link it instead of attach it, just to try and get it past the mail filter. But they're kind of dumb machines, and they're very automated, and they don't do, I wouldn't say they do a great job, and they're very

easy to see as a human. So, playtime is over. So, why would we care about being in a sandbox? Well, we write our own malware. It represents a significant portion of IP. So if that piece of malware gets compromised, that represents a significant amount of time that we're going to have to spend rewriting that piece of malware. I mean, Slingshot is our main tool, and we've probably been debbing on it for seven years. So it's not like, and actually we don't send in Slingshot for that reason. We have a stage one tool that does that. So we have a smaller piece of malware that is responsible. to protect our payloads. We just don't want to give it away for free. We don't want to be spraying

the internet. We don't want to be accidentally detonating it in someone's personal account. So we need to make sure that we protect our payloads even from sandboxes. Alert fatigue, I saw some of you raise your hand for blue side. You guys are obviously familiar with alert fatigue, right? But little known fact, so are we.

So we put image pingbacks in our macros. So there's an insert field code. And once you open the document, it'll go down to a remote web server. And it'll try and pull down an image. We just put an arbitrary URL. And it has a web hook. So when it hits it, we just get a text. Well, it's all well and good with you open your email. And we just get one. It's like, oh, sweet. Got a shell. Or I got someone to open it. But in a sandbox world, they like to share it. So it's like you get one. And then you get two. And then you get five. And then 20. I don't want

to be up at 3 a.m. responding to sandbox alerts. If I can push that job to ML and just be like, you deal with it from the hours of, even from 6 to 6 overnight, then that's worth research for me because I get more sleep. It's pretty easy to detect a sandbox. Does anyone have experience with sandboxes? Does anyone look at the process lists of them? at your process list on a sandbox or just a host that you don't use very much, even a Windows 7 VM versus a user's box, there's going to be a very clear difference. And I'd say human operators like our team can detect a sandbox probably 99.9% of the time.

But it gets laborious to do. And so we want to make sure that we're I just want to break it down. So we get a process list back for OPSEC purposes. So we use a stage one tool. We get a process list back, and it's kind of a look before you leap. So if we, in the process list, see like silence is on the box, for us that means a handful of commands are off the table. So before we're even interacting with the target, we have some expectation of what commands that we can run versus not. So we're not gonna be on the host generating unnecessary alerts just for the sake of not having had a process list to begin with. So I guess that's why we

were even looking at this process list to begin with. So there are clear differences. The other difference obviously, users are gonna have like Chrome and Word and Excel. Sandboxes aren't gonna have any of that. It's just gonna be very basic. Typically the payloads get run as admin. It's Windows 7, VM. So they just look very different. And this chart up here is just the little PCA thing that I did. So it's obviously the classification piece. But by the numbers, so if we're looking to select features, I chose process count, user count, and then a process user count ratio. Admittedly, they're terrible features and obviously a label.

It illustrates the fact that you don't have to use any NLP to sort of break down the numbers. You can just use features. You can use the pure process count, the user count, any number of ratios. It doesn't have to be this really complicated thing. It's just data representations. It's like however you want to represent that. And obviously, I know there are techniques to pull out the best features.

For our purposes, this has worked pretty well. You're not just limited to the features. A lot of people have their own checks that they want to make. Process list is ours, but some people are like, recently used files. Is a keyboard enabled? What version is it? Is there a username attached to it? Sandboxes often have Bob, admin, John, admin, PC.

They just don't try very hard. But you can arbitrarily just attach whatever you want, and these kind of become representations of your sandbox, your data set. And any MLers, if I say anything incorrect, just talk to me afterwards and correct me. That would be actually extremely appreciated. So if we're looking at decision trees, this is sort of the first, actually this was the second model we went to, because obviously when you learn ML, you're like, I want to do the cool stuff. So what we did is we actually went to neural networks first, and then we're like, OK, this is all right. And then we went back to decision trees, like, OK, this is actually way better, because it's one simpler. I can explain it to my

operators, so I can be like, hey, you didn't lose your shell because we're, I don't know, you lost it because I can explain it to them. And I like to think of it as 20 questions for an algorithm. So it's easy to explain to operators who I've already mentioned are a bit like cavemen. But you guys are familiar. So effectively you just have a root node and then it's making some determination and it's gonna, true or false, it's gonna just go down the tree until you get to some conclusion of the classification piece for it. And the code's super simple. Even if you're not mathematical, like I'm admittedly pretty terrible at math, as long as you understand the

concepts and the code and you kind of understand what goes in and what comes out, and on balance you can get it, it's extremely accessible. So I wouldn't say it's not like math that you did in high school where you're solving problems. I've kind of, I would say maybe not fallen in love with math again, because I never was necessarily a love of math, but just the fact that it's more about concepts and expressing ideas in mathematical notation. So it's kind of cool. So you don't have to be mathematical to implement these things. You can do it as long as you, obviously if you want to get into the details, you'll have to know, but some simple stuff. So then

obviously neural network. We get our inputs, we get our output, the weights, so the stronger the weight, the stronger the signal on the output. So we just make a prediction on it. So it's kind of basic ML stuff. I'm going to shy away from explaining the models, because I know there's some MLs in here. And if you want to Google it, I'll just do myself a disservice if I try and explain it in front of professors. So I'll shy away from it, and I'll talk to the operations piece. But again, this is the code for it, and it's super simple. There are libraries like I love Karaz, and it's just, you have a couple backends on it. I've been moving more into the PyTorch area,

just because it seems to be more academic and maybe a little more supported. And then also ml.net for client-side ML. So Windows just actually said they use ml.net in their Windows Defender product. And so it's gonna be supported for So it would be awesome to be able to use ML.NET in your malware and ship it to the client side. Same thing Microsoft uses, it's going to be supported. I don't see why you shouldn't or couldn't. We have some Microsoft people in here. So, the code is super simple. And so we have the models once we've built them. So I have a server that I'll release. I called it Deep Drop. And effectively all it does is it has these two models trained. And then we have a

function. So there's a macro with it. And all it does is it posts back a process list as we would. And then effectively you just parse the process list as you would a regular. Instead of doing a lot of them, you're just doing one of them. You'll gather your features. And then you'll make a prediction on those features. So if you're in a neural network, obviously you're going to have to scale them to whatever is appropriate. And then after that, we're going to make some sort of, if the prediction is good, we're going to drop our mail or otherwise we're not. It's very simple. So it would be tantamount to be like if you had

another singular check that you were just shipping off. This is very basic dropper code here. So there's always going to be some dropper decision. And you could also push it client side. You know, it's not all nice, so you have model drift. So overnight, all the sandboxes could change. So they could all just be like, you know, we're done with Windows 7, we're gonna go to Windows 10. And then that's, you know, just having looked at it, the Windows 10 process list is much larger and already much closer to a regular user host. So once we start getting, for example, a stock Windows 10, our model gets a little fuzzy. It gets hard to, it's harder to determine. Defenses, so obviously we've noticed a few

vendors, their sandboxes have stopped reaching out to the internet. So if we don't have information, we can't build models. So we kind of need that feedback. And then data collection for phishing campaigns. You're going to need a separate process. So you can't collect data while you're phishing. So those two pieces of code need to be separate. So in a production like phish or a production macro, you want to keep it small, you want to keep it tidy, you want to keep it as small as possible. But when you're just collecting data, that adds code weight to it. So you need a secondary effort. So you're going to need some sort of secondary phishing campaign that,

I mean, it could quite easily be automated, where you're just going out and just sending things to VirusTotal or or any way you know you can get execution in a sandbox. And that'd be separate. So that's gonna feed your model and then your prod VBA or your prod macros are then gonna rely on that model to make some sort of determination. And adversarial inputs, so we've kind of come a little full circle here. So if we were to post back, if we were to make the features post back in a URL parameter, so for example in a get request and parse those out, server analysts could change those inputs fairly arbitrarily and get, trick our deep drop to deploy our malware and get it up. So

for example, if we were to, just that top line there. So if we were to send this in a URL, we're gonna get, someone could arbitrarily intercept it and change those numbers and then force our dropper to do it. So there's gonna be some check on the other side, What's funny is I'm dealing with the same problems now that the blue side is dealing with. So it's like, it's kind of two sides of the same coin. Let's see. So the other thing I mentioned is client-side models, like Excel is built for data. So you can build a neural network in Excel. You can build a decision tree in Excel. So you can then package up your model. it with Excel and it's

as if you have sent some sort of operator intelligence with it. So, you know, you then, it just never has to reach out to the internet and it has everything it needs client side to make a decision about whether or not it's safe. This is kind of awesome. There are some caveats. So I don't think, I haven't heard of anyone doing this. So your macro is going to stick out like sore thumb. And you lose the ability to troubleshoot. So in our macros, we, you know, we, I mentioned the, the back, so we know when a document's been opened. We have a macro ping back, so we get a process list back. So we know

when the macro has been executed. But if you push it client side and you never get anything back, you can't troubleshoot. So let's say, for example, we get that macro ping back. We see a process list. We can be like, oh, this product's on the box. We'll just try a different payload. But if we never get that process list back, we can't make any assumptions about what's on the box or even anything that happened. It's like, it costs you something effectively. So, it's gonna cost you something to send it client side and we kind of need that information to make intelligent decisions on our side about the future of our op. So let's see, so now I have a demo of

it. Let's see how this goes.

So obviously we have this terrible output from wherever. OK, so this is a deep drop. I'll release it on GitHub. And it's super simple if you guys have questions or whatever. So it has two pre-trained models. I'll release a data set with it as well. It has like maybe 200 things. I can't release like full process lists because I have client information in it. So you'll get the, you know, just the,

Yeah, the features, yeah. That I've selected. So just starting it. Super simple, it's just gonna load your models. And then I have a macro that I'll release with it as well. So here we load our neural networks, we load our routes, it's just a flask implementation, super simple. Then I have a macro that I run, and then in the background it'll run that. So we're not gonna drop the payload, so. on a little VM just running in my lab. So it's super simple. It's really not that difficult.

You guys probably do much cooler stuff between the hours of 6 and 8 than DeepDrop.

Any questions about that one? Really simple. Again, if you're here for breaking... stuff, you're not going to find it. But this is kind of how we're thinking about implementing ML on the offensive side. So command recommendations. This is our case study two. So we went from this really easy, awesome thing to something that's actually extremely complicated. So when I first got into it, I was like, oh, you know what would be sweet? Command recommendations. That would be awesome. Because then I would just have any command I could just ask it, and it's going to tell me, and I don't think anymore. But that's not how it turned out.

So old habits. So we have existing knowledge that we want to take advantage of. So I have a model in my brain that's like, I need to run these 10 commands, and based on this information, I'm going to do this other stuff. So we want to take advantage of that. We have a team of six, seven operators. So I have logs for all of them. So we have parsing for all of it. So I get just everything. So I can see kind of everybody's session logs, for good or for bad. Some of them are pretty dirty. in terms of dirty, but they're just like, their OPSEC's terrible sometimes. But, you know, so sequences of commands,

like we know there are models that can analyze sequences and give us probabilities. So, you know, we can say, you know, based on all these logs, based on all these commands run, based on this sequence of these first 10 commands, the length command is probably gonna be x or ls, dir, whatever it is. But what's more interesting, actually, is when you look at the, this is where we get into the statistics, is you can start to see patterns in an op. So once we get initial access, there's like a grouping of commands that we initially get. And then as the op progresses, like when we're privileged, then there's a different set of commands. And so it's been really interesting to see sort of the, even though we know it

and we do it, it's been interesting to see the subtle hints of transition throughout an op. So these are just a few op points. that I pulled, and it's just a simple graph, but effectively, number of commands in each color represents a different command. And so we get a pretty colorful distribution. But obviously orange, and I think this is our PowerShell. We have an in-memory PowerShell that we run. I think this is our PowerShell, and then the other one is getUID. So this is some basic metrics from our ops. So it's for the last rolling 12 months. Obviously we don't keep data around, so some are deleted or some just aren't useful anymore. But six of us, we

ran almost 60,000 commands among us across 30 ops, and we averaged almost 2,000 commands. Aside from just the ML stuff, looking at this data from an operation standpoint is extremely useful. How long do you think it takes you to type a command? seconds, a second, a half second. Okay, so now you can multiply this by 2k and you get some reasonable assumption amount of time that you're gonna spend on an op. So our project manager loves this. I haven't told him about it though, so don't tell him. We have 99 possible commands in our rat. We ran only 84 of them. So from a malware dev standpoint or from an opsec standpoint, we could probably

drop 10 commands and we just lose all that code. or binary smaller. So not only does this kind of analysis help us shore up our ops, but it helps us find bugs in code faster, it helps us find where operators are having issues faster, and it helps us to just remove unnecessary things that just get built into products. If you think of your malware as just another piece of software, this kind of analysis can be extremely useful. But then you also get stats like, Most commands is obviously an outlier, and I don't know which client that was. Or the fewest commands where somebody started it up and just exited it really quickly and nothing useful was actually done. Or the longest command where someone's uploaded

a DLL and that command just gets, that DLL or shellcode just gets put in their log. And so it's not necessarily pretty, but I think going through this process, it helps you take that next step towards ML. It helps you think about your data, helps you think about structuring it, it helps you think about, so I think it's an excellent first step toward some sort of offensive ML. I don't think enough red teams spend enough time inside their data. So these are our top commands. So if you look at the total here, these 15 commands are almost 90% of everything we run. But I already said we have 99 commands. So it's like, what? So we could probably cut 50% of our commands and not

notice a difference in ops. And that has a significant impact on our detection rate, potentially, on the number of APIs, the code that gets loaded into memory. So it's extremely beneficial. This PowerShell, so PowerShell is obviously a bit of an interesting one because it's just a wrapper for other commands, effectively. And of the PowerShell commands, And I think 85% of it was getDomainUser, getNetLocalGroup, and getDomainComputer, and then some variation of those. And then there's some other ones. So it has been actually pretty enlightening to go through this.

But getting to the data is extremely painful. Parsing is really difficult. I've written a database, but it's not in our product. So we have logs. So I have to go through unstructured text. Our regex is like 80 characters. Took me a ridiculously long time. But to get to the data, you have to strike the commands. You have to somehow deal with the arguments, which I have not figured out yet. So you'll see in the demo. So the arguments is just impossible. So there's like six ways you can run get domain user. You could write regexes for everything, but if you have 99 commands, that is a significant amount of work. So I have to think about, is it really worth the effort? Or maybe you

guys can tell me, does NLP have an answer to this? No? No, it's regex, right? That's the problem. But once we have it all, we can start to model the user, and we can start to build out maybe particular profiles or whatever it was. But we talked a little bit about sequencing. So based on the previous sequence of commands, we're just gonna get some sort of distribution, some sort of probability of the next command, and then we're gonna run that one. So recurrent neural networks, and then long-term, short-term memory. And this is the one I'm not gonna explain. So you guys can go Google these. But what I do know is that there's input, there's a sequence, and then there's some distributions and probability that gets

put out on the other side that gives me the next command based on the previous sequence of the analysis of the pre-runs. So, demo.

The project I used, I found, was actually, it's just text gen RNN. So it has an interactive mode. It seems pretty clean. It's decent. But here we're loading all of our GPUs. And then it has an interactive mode. So it gives us, let's see.

So here you can see it just builds a, it just goes one by one. So if you think of getUID, so getUID is to get the user on the box. And that's usually one of the first commands we run if we don't know. If it's not, it's ps. So you can see,

It gives us options. So these are the next 10 commands that most of the time run. I think we run into issues because we only have 99 commands. So there's only 99 possibilities. And so if we don't, for example, don't train enough, you'll get that. We'll go pshell getuid, pshell getuid, pshell getuid, and it just doesn't go anywhere else. So it's been somewhat painful to deal with. Actually, not as painful as the arguments. But, you know, but it's a start.

I wish this would go away. Do you

guys have any favorite projects, like for RNNs or LTSMs?

Yeah, that's that was the challenge. Yeah. Yeah. Yeah. Right. Yeah. So you could do that. But we have our own shell. So we use the CMD to library. So that's actually my next stop is to look at their parser and see how it does it. We have some modifications there. I just need to dig deeper into it. I was hoping there'd be a more elegant solution than regex, but that's where I'm at currently. But this is kind of where I, you know, we kind of started at a high with, it was very easy to implement some binary classification, but it's very quickly, it devolves into this very challenging, finicky, not finicky, but very it devolves into this very architected solution where you need to

have multiple data sets and regexes that you're keeping offline. And so suddenly, someone like me who is interested in ML but is primarily an operator, if I think about my teammates, they just don't have the patience for it. But it's almost like if they don't learn the basics of ML in three years, to be able to op on a network. So depending on that, the adoption rate of ML in networks, I think this talk hopefully, if Red Team is obviously here because you're interested in it, but hopefully you get the idea is you need to start looking at it. You need to start thinking about your commands. You need to start thinking about the processes you inject into. You

need to start thinking about the sequences of commands. You need to start thinking about how your the sequences of the APIs that your commands call. Inject arbitrary APIs, just like between virtual, create remote thread or virtual lock, just put a ton of different APIs. Microsoft has put in these monotonic models that supposedly, I'm not clear on the details, but what I've heard is that they prevent you from adding good features to, So then we're just back to where we started, where we're just removing malicious features or we're trying to doppelgang like explorer.exe. So it almost feels like AV is starting over. It almost feels like the traditional AV is like standing still and we're just in this run up phase until people have

figured ML out on the defensive side. But what I think is gonna happen is that my has been it's not as easy as it seems and there's a lot of nuances to data and you guys already know this but there's a lot of nuances to it that at scale become almost impossible. Like I do not envy anybody who's trying to solve that challenge. And actually I wish I had saved that tweet. There's a guy, I forget what his name, but effectively I think he worked at some AV vendor and he's like I've been doing this for six years. thing we've come up with is network alerts. So if that tells me anything, it means like that's the data that you have. Companies

paid vendors to take in all their logs and vendors just threw it in a pile. There's no structure to it and I think now they're realizing their mistake and they're having to go back through and structure it properly, but I still don't know how they're going to a consistent product when all of their clients have different levels of logging. So do you have models for different clients? Do you aggregate everything? What do you do? I think probably the only person or the only company who probably gets away with it is Microsoft, and that's just because they're scale and they're on the OS. But if you think about like the AMSI stuff, where AV vendors can have hooks into AMSI, And that provides you limited, like the

Silance research. Do you guys see that one where they had that Silance bypass? I put it in air quotes because you're still not going to get that past AV.

Silance did that as a way to sort of backdoor their model. Because it is even legitimate EDRs now have a really, really difficult time deciding what is legitimate execution and what isn't. So if you're living in Explorer, you can basically do what you want. Explore is an extremely noisy process. But if you take that away and you put it in an ML model, it just becomes, I don't know how, like impossible. No, it doesn't mean it's impossible. They're just super smart people. But it just becomes an additional challenge that you're going to have to solve. And I don't think, we're just going to, I kind of feel like we're going to be in the same place five years from now. Okay. So

the challenge is obviously arguments and then assuming a human expert. Thank you.

Sorry, some panic from the back there. So assuming a human expert, so when I'm on an op and there's something not working, I have to troubleshoot it. And that's going to increase my command count. It's going to put those sequences off. So implicitly, I'm just going to have bad data in my data set. So there could be a sequence at which I get into where the model is telling me to troubleshoot something, but it know any better. So it's kind of difficult. So you get just dumb commands or fat fingers, misspellings. It's all the classic NLP stuff. But as I was doing my research, there's some actually really, really old papers from the 80s and the 90s. As Windows researchers, we really

love subsystems. And actually, one of my prized possessions is a Commbook that the Microsoft library sent me that I bought for $3 at Amazon. so old that Microsoft is getting rid of it in their library. Anyway, so you can go predicting Unix command lines. And in this paper, they basically say, it turns out it's really difficult. And there's no conclusion to it. But it's obviously not a new problem. And so I felt better about myself when I read that paper. I was like, oh, cool. No, they couldn't figure it out either. OK, so now we're getting to the last case study. And it's the reinforcement learning piece. It's a reinforcement learning piece, and this is like the golden goose.

This is like what everyone wants right now. It's like this auto-hack. All these adversary simulation products that bring in that MITRE attack framework, they're all trying to build some sort of intelligence into it, and they're just having a really difficult time. It's actually really interesting to see how different... are dealing with it, so I think the MITRE guys, they use pre-conditions and post-conditions and things like that. And we're not different

BSides Las Vegas 2019 Day Two - Ground Truth

Related talks