
that I showed you earlier and so cluster 1 and cluster 7 again just two separate groups of users and they these these two clusters had the highest rates of adoption of two-factor authentication across all of those clusters so there's something and then these clusters four and five were the clusters that had the lowest across-the-board adoption of two-factor and cluster four and five let me just go back for a second so they're not the largest but they're somewhat similarly sized so that's kind of interesting as well this this is a really I'm so sorry for the quality of this this is really embarrassing but there is also an extremely extremely extremely weak but positive correlation between whether
someone pays us monthly so has a nonzero average monthly payment and whether they have some type of two-factor you actually can't even see it on the projector because it's so slight but it's there so I feel like I'm on the right track okay but ultimately these results are still kind of imbalanced so what am I gonna do and I so to be clear from this randomly sampled data these these were 10,000 user accounts it shook out to be about I think 13,000 like accounts that could log into these accounts there's a chance that the actual target class wasn't even represented in this data so there's a couple of ways to deal with this with the imbalance class issue so one is
taking so is over sampling so this would be taking all like normal users and original excuse me and not so normal users and basically creating copies of this not so normal like these bad users and inserting copies into a duplicate data set so I would have the same or almost the same number of observations of the class that I'm really curious about and also just the normal user base I could also do something where I understand phone so I could just cut this large normal user set and make it equivalent to the the smaller more interesting target set that I want to look into I will say that I have a little bit of a hunch that this is
probably going to work better that over sampling will probably work better just because I think giving my giving whatever model a chance to see this this much will probably help out a lot as far as giving extra visibility into kind of what these accounts look like and being able to recognize them but I'm probably gonna try this too just because this whole thing is you know in the spirit of just let's randomly try stuff so I'll probably do that and then I kind of want to I've touched on the fact that this isn't finished this is still like where I stopped just there is essentially how far I've gotten in this research so I want to talk about a couple of questions
I presented this to the data science folks at my organization a week or so ago I got some really good questions and I want to talk about those so one of the the big big questions that I've gotten is how are you gonna evaluate the results of this particularly given that you're using unsupervised learning this is a really hard problem just like in any in general across any domain across any problem where you're using unsupervised or excuse me using clustering clustering isn't supervised but a lot of it really depends on why you're using unsupervised clustering techniques to begin with I keep saying unsupervised clustering clustering is unsupervised there's not a really tight that's supervised so it really depends
on why you're using this method in the first place and unfortunately most techniques that that allow you to evaluate like formally evaluate the success of a clustering model require access to some sort of you know ground truth some sort of labeled data and so that's kind of tough I could generate some label data myself because I going back to the whole idea of domain expertise I think I have a good idea of what constitutes a fairly secure versus and not so secure account but again that kind of almost defeats the purpose of feeding it to a clustering model to begin with so I'm not really sure I think this is probably gonna take some manual review I think I'll probably
have to go through and just look to see you know do the clusters make sense which doesn't really sound super cool and like advanced but that's the reality of what I'm probably gonna do and then another common question I get is you know what other models have you considered so all the models that I talked about here are centroid based clustering and so I'm probably gonna try B scan or some sort of density based clustering model at some point again I'm probably gonna throw a bunch of things at the wall and see what sticks just because I'm really curious how this will shake out and what I'd really like to do is eventually long term put this data
into a supervised model use this to train a supervised classifier and you know I'll probably do something like look into a random forest or some other type of ensemble model so and I think this is really the crux of this whole thing is what is the point of this and how does this actually fit into the defensive security program at MailChimp wait how does this fit into what you already have so ultimately my goal for this is to at the simplest layer if you are a high risk account if we deem you high risk or you're a security risk maybe we don't let you feel that log in so many times or maybe there are certain
things around the perimeter we don't let you do quite as much as we would an account that we deem not so much of a security risk and that's super broad I realized that but part of its gonna be figuring out what that looks like but really what I want to do is provide a heuristic for my team and for our anti-abuse team anyone else who touches this type of problem to focus monitoring efforts accounts that really need it because we have we have millions and millions of users and it's impossible to catch everything so if I can help carve out these groups so one of them would be the like these kind of the accounts that I
showed on the left side of the spectrum earlier if I can help provide attention on those and give a spotlight to those then I feel like I'm doing something to help my team and that's really what I want to do and no you know this isn't perfect at all this is just one tiny piece of the puzzle that you know security is a defensive security requires a layered approach and this is just one more layer on top of other things that we are doing that we will be doing and that I hope will help both my team and other teams get get to the important stuff quicker [Applause] and I think we definitely have time for
questions so I am up for so you looked at two FA to see if there is anything special about it did you look at the other variables as well to see if there's special things about them or not yet yes so I didn't I can't really share all of it but yes there are definitely other other pieces as well and so it looks like you're using Jupiter notebooks and that's been something that I think cero the speakers today have actually kind of demonstrated so could you tell us a little bit about how do pure notebooks have facilitated doing this kind of analysis for you yeah for sure in fact let me go back to one of
the screenshots just as an example so Jupiter notebooks if you've attended any of the talks in here today probably seen some examples of this but essentially it's an interactive Python environment that allows you to break up your code into these neat little cells and it provides you probably provides kind of like a living document you can reuse it so I can plug in any type of other data set like I could just change what actually what data this is and still run this analysis again so it makes it really easy to continue to like hand analyses off it also has the nice benefit of like I can I can create notebooks that have these types of
visuals and then along with markdown and other types of text annotation so that I can pass this off to my boss and there but like I can pass this off essentially to an executive with with this kind of content in it so yeah interactive environment but also one thing that I really like about it is that it makes it easy to share that kind of stuff so did looking at either the medians or the prototypes or the means when you did your cluster and give any insight as to what the clusters actually meant yes so unfortunately the way that the way they kind of took Shea was based on account attributes that could be found really evenly spread
across different different people which doesn't I realize that doesn't make sense I'm not explaining that well yes but it wasn't super helpful for discerning why an account would be in one bucket versus another if that makes sense
can you talk about security education for the users and I mean is there an outcome of the research or retraining users are you communicating with them changing the variables yeah so this is actually something I've been discussing with some folks very very recently we historically have been really careful about reaching out to users or suggesting proactive things partly because you know let's say we decide we want to email everyone who doesn't have to factor authentication on to be like hey lock your stuff up like what are you doing that can also kind of have a different interpretation people can think they will see that and go oh no the mail slips and a security thing
we're really concerned this looks like it could they might have been hacked oh no so there's a really there's a fine line and we're still trying to decide where that line is we're talking about it though and I am fighting very much for us to be able to somehow talk to these users all of the days and say please do yourself a favor and turn on these types of security features that we offer so yeah it's it's a really hard prime is you know there's the whole like policy versus what we want so yeah first of all great talk I love seeing machine learning applied to like data science problems in the security world and it
kind of brought up for me like this question of like what is the dependent variable in this model like it seems like you have a lot of independent variables about users and not a ton that the model would even be able to pick up on in terms of outcomes and so maybe just to build on the previous question like do you have metadata about the kinds of outcomes you're interested in or is that just not available or like what what are you hoping to see at the end of this model aside from identifying user accounts that are nice yes so ideally what I would do and I didn't really get into this and the talk because if there was a lot to cover
we have a known data set of accounts that have been attacked and taken over I know what this look like and so those are things that I want to go back and be able to eventually feed to some of the future models and say can you discern whether these are one interesting for I'm sorry I'm like going on about this but one interesting problem that we have is when an account is attacked or taken over or abused in some way almost immediately we have a team that reaches out to that user to say hey something happened you need to change your password you need to turn on two-factor you need to set up these notifications
and so it's it's great for the user but it kind of is unfortunate for this research because what happens is we don't get a picture of what that attacked user looks like at the time of attack so that's actually an initiative I'm working on with some of our data engineers right now to get those account snapshots so that we'll be able to have that source of okay this is what they look like so I hope that answers your question so a follow-on to that no you're good you're good so it seems as though you would have two types of malicious users so the first type is an account that wasn't malicious where their account got hacked taken over and
now is being used for nefarious activity and then you'd have the other type of malicious user account that was set up with the premeditated idea to be to use that account maliciously did you find anything from the clustering that would help identify or correlate with the first those premeditated ones I'd call yes so this is actually another thing that I am working on kind of in conjunction with this this is this is completely different so the way those accounts look those accounts that are created to be bad they're just garbage they look different they there are specific things about them that unfortunately I can't go into here but there are definitely things about them that make
there they can be very easy to detect and so so I don't wanna say that's an easier problem but in some ways it can be because particularly if it's something automated if we see BOTS sign ups or things like that we have some methods in place to deal with that but this is a little bit harder because or at least it's harder and it it's hard in a different way and we haven't really spent a lot of time on this particular piece of it which is why I'm focusing on that that's also another really a really big vector though for sure yeah
so it sounds like this is something that you kind of you had the idea and then you kind of have learned it involved your knowledge of it as you've researched it what kind of tips would you have for other people that are in the organization and might want to try to apply that kind of to learn how to apply data science to their own internal problems starting from kind of the same position you're starting at and trying to kind of follow in your footsteps yeah so that's a really good question so one thing I will say is that I think I've been very lucky because MailChimp encourages a culture of experimentation and so if I do all of this and it turns
out to be you know we can't really use this in production we can't really do anything with this that's still okay because we've learned something so I will say I will put that out as like as a caveat but I would say you know think about your user base or think about you know what questions keep you up at night and is there a way to somehow start thinking analytically about this and I realize this a lot easier said than done but thinking through the lens of an analyst or a data scientist what types of data do you have available what can you query easily what can you get through get your hands on data yet even
if it's I mean if it's in a Google sheet if it's in a spreadsheet doesn't matter start getting familiar with it because the only reason I knew to go to some of these things was because I lived in an analyst role for so long I knew what what the user account landscape looked like so I would say get really familiar with that and then just start playing with things like start getting your hands dirty like don't be afraid like I got so much of this failed but I learned a lot from that so
I'm sorry but I don't know if this is covered earlier before I was able to get in but I did you have did you you're doing this work did you discover that there was a gap in data so you had to go back and add new data sources in order to be able to get it the correct sort of information yes sorry I wasn't sure if you had a second part to your question um yes absolutely so one of those situations was in going through the different models realizing that I needed not just account security related data I needed just normal what does your account look like how much do you pay us how much how
many users do you have like other types of just account attributes but what there's one piece so I touched on this just a minute or two ago we don't have data around what an account looks like at the time of attack and that's that so those those values aren't they're not time-stamped and stored essentially when the user turns on to factor that gets updated but we don't necessarily like timestamp that and say this is the history this is the change log of when a user enabled these things and so that's something that I'm working on like right now with our data team to get that to get the name because that's been huge that's actually created a whole lot of
extra work for this because that's something that I don't have that source of truth so does that answer your question okay
[Applause] thank you all so much I appreciate it
thank you very much
yeah yeah
interesting
so when you say small how small do you a couple thousand or is there any way you can just so when I talked about like the over Sam play like this kind of situation is there any way maybe you could create copies of the classes that you're interested in and maybe do something like that that's kind of happy but that's the first thing that comes to my mind I might try something like so yeah yes yes yes we do [Music]
test one don't need a higher test one there we go
but there are definitely attributes to the account so like once I kind of zoomed out instead of just as yet security related features and zoomed out I will say they're definitely account attributes that end up having I think a little more bearing on that then through your example time zone like I can't go into it a whole lot more than that and I'm sorry I wish I could I don't know that's a dangerous question
sorry say that one more time has so they don't recall off the top of my head
yeah so I feel like that's definitely where we are like I've kind of been able to say hey this is important I think we should try this and I've got I've been loud enough about it that I think I have been like well we either need to take her seriously or placate her social shut up so that's kind of weird that where that's gone I think a lot of it is dependent on organizations like it so much of this you know and whether this ever sees the light of production I don't know this is still literally just you know in and Jupiter notebook in get so yeah so I've seen a ton of people do
it internally and that's kind of why I was interested in looking at it specifically user-facing just that was such a big a big problem we have what's up yeah yeah exactly no I totally get it we have to we have to be careful because people will thank you yeah okay
hi my name is Gabriel Bassett my voice is my passport verified
[Applause]
cuckoo can everybody hear me awesome hi everyone welcome to my talk reducing in actionable alerts by a policy layer my name is John Seymour ok delta0 and I'm a lead data scientist at Salesforce where I work on the detection and response team performing machine learning on security logs to alert to new attacks to improve our existing alerts and rules and to find and make new contextual data for use in investigations my goal here today is to inspire you to be creative and where you apply data science and machine learning techniques a major impact can be had with very small amounts of effort if you come out of here with a nagging feeling that machine learning would help you
with some where it's not normally applied then I'd call this talk a success right so like a lot of good presentations let's start with definitions often times humans analyzing model generated alerts will throw in a alert away immediately right and we found as we've deployed our models there are two main reasons what we've seen how this happens first when there are issues with the data pipeline so these are things like when necessary logs are missing when parsers fail when joins between host and Network artifacts behave unexpectedly when third-party information is bad or corrupted when deployment inconsistencies throughout the fleet when added contextual information like host names is wrong when added contextual information is stale where it used to be right when
it's wrong when added contextual information is right and an expected leeway this list goes on and on and on right contrast this to like obvious false positives where we mean the model is not able to capture the complexity of the instance where the activity can easily be determined to be a low priority or non-existent threat to the business such as a model alerting on beaconing to a company and resource generally we've seen these handled after the fact a whitelist is added which says simply even if the rule or the model says this is bad don't alert on it you can think of white less as a simplistic example of a policy layer which addresses the two causes for an
actionable alerts in this talk will demonstrate how even simple modeling improves upon weightless and further will argue how modelling weightless separately from modeling suspicious events is actually a natural approach to the problem so here are some reasonable examples for whitelisting we've seen in the past right for a large number of alerts generated we don't actually care about if a connection is completely internal to the network so for these it might actually make sense to whitelist anything that where the both the source and the destination are internal or we might only care about a connection attack that's successful right um so we might think filtering alerts where the connection was ultimately unsuccessful like maybe the firewall blocked the connection we might
think that's a good idea or another widespread use for whitelist is in filtering if the domain is obviously benign so like take the top you know popular white domains and just shove that as a whitelist and say don't alert on any of these things right and these are obviously common in widespread ideas but white lists create some unintended side effects down the road they're extremely rigid if a white listing rule matches an event then the event is just completely discarded even if the activity is extremely suspicious for other reasons um they're also challenging to maintain right if you've worked with white lists you've probably noticed issues with updating them you know since each problem set tends to
have different white listing requirements for example take an extremely loud attacker moving laterally through Network you can't whitelist the internal connections for that or take failed connections they could be n X domains today but future command and control instances where the malware just hasn't activated yet large number of these might also you know indicate an infected host or something like that and benign domains a major way to exfiltrate information is through standard services that are likely wait-listed not even considering the fact that domains are static and most popular domains are out of popular domains not all of them are always benign so I bet that most of you in the room are thinking okay we'll just include those
as features in our models and that's definitely a reasonable position to take but here are some reasons why you might actually want a separate policy layer at least at first let's start easy white lists are already accepted by human analysts and stating this alert has white list of all characteristics but we think it's suspicious even given those is actually still generally well received and you know useful information to the analysts however model centric reasons exist for such a separation to Google rules of machine learning application for separating spam filtering and quality ranking in a policy layer and that a quality ranking should focus on ranking content that's actually posted in good faith the main idea behind that rule is
that spammers the adversaries in that context will attempt to emulate high quality posts so features used to indicate you know high quality posts today might actually be indicative of spam tomorrow that concept of adversarial drift also applies here and the independence of the two types of models actually really helps with tuning training frequency we all know that adversaries adapt they're likely to adapt faster than what makes a model in actor and alert in actionable also separation gives us generalizability which allows for centralization reducing code duplication most of these features will be present in a large number of rules in my model that you use so even if most models exclude some of these you know heuristics you can still have one
whitelist model which applies to a large number of rules or models another is good data is actually the limiting factor for model quality in the intrusion detection space a lot of the time separating actually reduces the problem space so that models don't have to learn both what's a good you know alert what's likely to be thrown away and what's actually suspicious at the same time and that actually means you can label you know data for different tasks and be a lot more effective with your labeling and then finally some of our team's models have multiple consumers with contradicting preferences these might actually be unsatisfiable if you try to push them into the data detection model
so how do we actually enforce the heuristics commonly used to filter false positives in a better way well here's a simple machine learning based approach that we've deployed at scale for doing so oh we actually combined you know these heuristics using a function which penalizes alerts where any such issues that are found without completely filtering the behavior concretely if we let X be a list of binary heuristics which are commonly used to filter you know false positives then let your policy score be some number between zero and one raised to the count of how many of these heuristics are true for a given event for example if lambda is 0.5 and you have two issues surrounding the data
then your policy score would be 0.25 or if no issues are detective your policy score would be 1 and we just re wait the alert by this value right so this is a really really simple approach and it's also simple to integrate with the models actually integrating this score with rules and models is very straightforward you just multiply the two different scores together um it also has a lot of other benefits though the model can be reused for many use cases so even in instances where the heuristics conflict with the actual model generated such as that exfil detection model um being using good domains you can actually just reweighed the final learning threshold for the combined score another benefit
is the models completely unsupervised requiring you know zero effort to Train and the model requires very little data science background to actually understand right but perhaps most importantly it allows you to aggregate all of your different heuristics in one place which makes it much much easier to maintain but we know there's no free lunch when it comes to these sorts of things so what do we actually lose when we're moving to this sort of model for waitlisting well to start the main change is that we're now allowing a small number of events through that we wouldn't previously so there is definitely by definition going to be a nonzero number of false positives we've mitigated this quite a bit through
threshold in but obviously some of the new alerts that you got are going to be true positives and some are going to be false positives um a piece of low-hanging fruit that you might think is that the model is actually very very simplistic right now right you could for example train it on historical data like previous case adjudications that's definitely something we're looking into but probably the primary drawback here is how the model handles repeated alerts with white listing you can completely negate an alert from being generated but you can't do that here you can reduce them by adding sort of features based on prior case adjudications but that makes the model complex um you could also eliminate them
by you know adding additional hopefully temporary white lists but that brings us back to the initial white listing problem and it reduces a lot of the impact of this approach but here are the main issues that we're trying to improve upon this first again the most obvious being supervised learning which would allow us to more finely tuned that you know policy layer after that we plan to try out different stacking methods so using like the output of the policy model as a future when training LeAnn's who those suspicious is in this model and trying different configurations for that and then finally we're trying to more formally incorporate the consumer preferences like we stated earlier such
as different outputs for thread and tell versus our sock or for the different responders for risks vulnerabilities and policy violations so that's actually all ahead does anybody have any questions
hi thank you a quick question for you one of the big problems that I've experienced in the past with trying to kind of work with ML systems in general is the problem of interest of int respectability and for you know coming from an OPS background and having been the person getting woken up at like 3:00 in the morning nothing is worse than having something like sometimes tell you one thing and sometimes tell you another thing I'm like you can't figure out the hell's going on how do you kind of how have you dealt with while working with your customers like dealing with that kind of problem of not necessarily being able to introspect the system in a easy
to understand why yeah that's definitely a great question um so for the actual whitelisting technique here it's actually very introspective the values of the different variables you know how many are set to one you know etc you can just return those to the ops people adding in something like a supervised method or something like that yeah that would definitely obfuscate for example and would harm the introspect ability right so for this um like we have issues for example um with our hostname mapping for example IP hostname mapping and for those we actually encode that into our policy later by saying okay like this is known to be faulty you know IP hostname mapper right um and
we can surface those to whoever is investigating the alert
oh yes
hi so to someone who is completely unaccustomed to ml in general how would you what strategy would you say is best for determining a proper lambda and how would you really tune that appropriately okay would probably do it the same way we actually did it which is start at just lambda equal to 0.5 um so that means like you know for every extra alert that you generate then you have basically the score right the output score um if you wanted to tune it further you'd probably need actually to get data and collect data about how how many alerts are good and how many are bad coming out of your system and that's when you want to actually move to
something like a supervised approach anyways to sort of tune sort of that lambda also have say a different weight for every single feature if you did that technique um whereas here we only have a single lambda for efficiency sort of related to that did you as maybe a middle ground look at talking to the analysts and seeing what they thought about the different alert types and assigning different weights based on their feedback yes so we actually we did so we did definitely talk to the analyst and we did considered assigning different weights to each you know different heuristic that's being true um the only issue with assigning weights here is basically that sort of explodes
the dimensionality of the problem right like you have a different weight for every feature right whereas right now we only have one sort of parameter we're tweaking um and so we decided that basically if we did sort of attempt that route we would wait until we actually had a collection of data that we could use to sort of label things as being you know well generate it or not and then just go supervised to approach anyways
so just to make sure I'm understanding right each feature is a or each element and a whitelist is going to be a feature in the policy layer sort of so for example if we have like a rule that says it's a it's a domain that's in a whitelist or we have a rule that says okay the IP the hostname is missing from this particular log or we're missing the parent log or things like that those are all rules that we would use as separate features in that so do you have any thoughts on how to handle the model expanding as you come up with new heuristics you know every time yes so that's also comes back to the the sort
of supervised training approach um if you start something simple like this like really simple then you can sort of get a lot of those ideas out and know sort of all the a lot of the different rules that you have um even so I guess you're probably going after the idea that like even if you deploy a model today then later down the road you're going to discover something some new in action bowl or rule or something okay well I was going to say like we're kind of assuming here that the the drift in terms of what makes them an alert in actionable changes much much more slowly than the suspiciousness sort of score so
adversaries adapt but your system doesn't adapt that fast cool [Applause]
he's not actually
turn it on at your will
all right testing testing okay cool
or I could just go over there
and if you have any questions we ask that you use the audience - about our YouTube audience can hear you if you have a question just raise your hand and I'll be sure to bring over the microphone for you all right so last talk of the day that means I get to go till no 8 9 10 but whatever anyway say a quick overview of art I'm gonna be talking about so I said my name's Rob Brandon been working in the field for 20-plus years largely in a Incident Response and malware analysis but I did my PhD with the University of Maryland in a machine learning and deep learning applied to security problems currently work for the threat hunting
team with Booz Allen Hamilton dark labs and semantic representations of machine code are really my TM so kind of why even mess with this whole embedding thing a lot of malware analysis and reverse engineering problems are basically similarity questions you know as a instant responder and network defender when somebody gives me a new binary and says like hey we think this was used in an attack it'll usually take me about 20 seconds - ok yep this is definitely malware so the really interesting question to me as always not so much is this malware it's going to be okay what how does this work you know is it ransomware is this some kind of remote access trojan and similarly
you've got for the vulnerability discovery problem you know I've got this new binary are there any vulnerable code paths in here so one of the things I'm gonna really care about there is does it import libraries from other things you know does it have code that has vulnerabilities that we already know how to exploit so once again it's a similarity problem a lot of them we have out there for measuring code similarity things like been diff graph graph comparisons those really don't scale too and an M number of binaries you know doing an n-by-n graph comparison is really not computationally feasible and similarly things like a SD hash and SS deep work for some things but you know
they don't really encode any semantic similarity data they like want for doing these kind of tasks so this is where we move to embed ins what embeddings are these are inspired by the concept of word embeddings in natural language processing where you take your information you know whether it's word paragraphs or whatever and you move it into some kind of high dimensional space where similar things are located similar to each other so in this case we're do the the goal here is to take a function out of X in X cubed of an executable binary and somehow learn a high dimensional representation of it so that functions that do similar things are located right next to each other in the
space and that allows us to do things like say we've got 5 million functions and we want to figure out which of those that which of the functions in our new binary or similar somewhere to those we can use some kind of locality based search rather than having to do a linear search through the entire day that we and another interesting with this particular type of technique is a lot of problems in the machine learning space aren't really so much machine learning as annotation learning you know the real challenge is to figure out how can you come but come up presentation of your data that you can then throw a linear regression on top of to get the answer
so and we look at function embeddings there are kind of a couple of broad ways that we can look at classifying them so one of the ways I've started I've started to look at a function embeddings is to look at how much feature engineering is required to build the function imbedding so we have some some methods such as the one that uh I created when I was in grad school which require no feature of our feature engineering you know they just take the raw data to generate your embedding there are a few others have come out over the last year that other people have come up with that use some more sophisticated domain knowledge in
pre-processing the embeddings and this is really kind of a really new area of research like said I finished my PhD in 26th or in 2017 in this area and since since then you know just really over the past year there have been other people that have come up with kind of independently come up with the idea and come up with their own method to you uh build of embeddings so really there hasn't been any kind of comparison as to you know what kind of things are these embeddings learning you know what what method should we be using to create code embeddings so this is just a few of the methodologies have come out over the
past year to the gener of our n ends is a method data I came up with and so I presented this at a DEFCON AI village last year this particular method works off of just raw bytes you know you don't need to disassemble the function you don't need to do any kind of analysis of the function when I was doing my initial work I did do some normalization to take say API calls and normalize those out from being a sequence of bytes into some kind of hash but the initial experiments I was working with show that there's no noticeable improvement so I left that particular technique out of here so basically it's kind of embeddings are
you take something called a character RNN which attempts to generate new information based off of previous formation seemed so kind of like what's shown in the diagram over there you take your function you show it to the network one byte at a time and after each byte the network tries to predict what the next byte it's going to see is you give it feedback at the end of each training cycle you know based off what it got right and got wrong and at the at the end you kind of have a network that has created its own internal representation of the particular machine code that it's been trained on so kind of what you do with these embeddings are
you take your train network you show it a sequence of bytes from the function and at the end of that sequence you just take the state of the network and that's your embedding so you know kind of the embedding is basically what does the or what what information did the network find useful in order to predict each byte in that function so a worm Tyvek is a slightly more complicated a function or a embedding method that was presented by ab and bruce at camless last year and what he's doing here is he's extending the word to algorithm which is largely used for natural language processing and he's extending it to executable code and now my pert Terp
rotational this is largely based off of a couple conversations with him and his slides so there may be some subtle differences in the way I implemented versus the way he did but I think I've been largely uh faithful to the methodology so basically in any any issues with it are definitely my fault not his but uh so as you can see this is an example of some assembly code that's been normalized into the modeling format that he's doing you know registers are replaced from EBX edx RA X whatever into just a token indicating that this is a register you know same way same way with memory accesses and a D references those are just replaced with you know this is a
memory address this is a dereference one of the differences or challenges in with this type of model is that this requires disassembling the code many people working in the disassembly space will tell you this is disassembly is still a far from solved problem so as I'll show later they're having to do that does lead to potential issues when you do it doing this kind of work and this falls into what I would call like a moderate level future engineering you know you need to do some disassembly maybe replace a couple of tokens but you are doing any significant code analysis you know you're not trying to compute the call graph or anything like that so
the next uh method I'm gonna discuss is called safe self attendant function embeddings created by luca boscorelli and another group i believe they were based out of italy this is a little more complicated model than the previous two but in a lot of ways it's kind of more sophisticated version of worm Tyvek and you can get the modeling code for this particular methodology at the at their github site but basically the way this model works is they uh so the previous model worm Tyvek you basically do word Tyvek on each of these each of the instructions and they need to you take all the all the vectors for an interferon ssin that were computed but
for each of the instructions then you just add them together and basically the concatenation of all of them gives you your vector the difference between that and safe safe also takes each of the disassembled instructions they have a slightly more complicated tokenization scheme where they're looking at what memory ranges was each memory access in and doing some other slightly more complicated stuff but largely a very similar thing and there then you there there is then as well using word to vet to compute a embedding for each instruction but then their feet feeding these things of instructions into a self attentive recurrent in order to compute the final embedding so very similar to a worm Tyvek but a little more
sophisticated this falls until you know what I would consider it's a slightly more significant feature engineering and finally there's ASM to avid SMT Beck which was presented last year at the jailbreak security summit by Sophia dent on and this is probably the most complex to embedding methodology that I've seen anybody attempt to use to she included a lot of compositional feature or a lot of features besides bytes of the function such as its neighbors in the control flow graph and a lot of other very computationally intensive and very domain-specific things into the embedding method that she was doing unfortunately the code for this was elbow and I wasn't able to figure out how to recreate it just from the data
that's out there so I wasn't able to include that in this particular experiment so finally kind of to the that said I was using I really wanted to find a data set where you had some very easy to define classes that were going to be present in the data set but I also wanted some code that was going to have some relation over time so kind of a natural fit for this for me was the code base of the various BSD unix's so if you look kind of at the history of things in 1992 you had 386bsd you know your so later that got forked into net BSD and FreeBSD then you know back in 95 an FPS
ve got forked into open BSD so we have this lock this code base that had a similar start you know a similar origin you know 20 20 years ago or so but that has evolved significantly in the years since so do it trying to come up with some kind of you know code similarity analysis this looked like a really interesting corpus of code to take a look at
so what I ended up getting out of this data set was 222 binaries each one of the binaries was present in either bin or user bin across all three operating systems so you know your LS your sh what whatever other of binaries had a common name the FreeBSD binaries I compiled with - o0 and - o - nat PSD and open BSD I unfortunately was not able to get the make comp files to work for changing the optimization flags from the default and after a couple days of banging my head on that I just kind of gave up I additionally I don't function - that were greater than 16 bytes because I really didn't want to have just a bunch
of thunks in there that were you know of course those are all going to be the exact same code and that's gonna throw off a lot of similarity metrics so basically the criteria was only functions that are greater than 16 bytes in less than 2048 bytes we were used mainly cuz anything over 2048 bytes is really just an outlier size-wise and to keep it easy I wanted to uh have some boundary there the initial parsing of all the functions was done with binary ninja so I used by - it's to determine where the function boundaries were and all of these were compiled with debug symbols so it was you know didn't have to do any significant work and there
wasn't the problem of where you did how do you determine where the function begins an end ends so binary ninja was used to extract the binary code and the disassembly for all of the code except for the safe model so one of the things really wanted to try out here was the the safe group they trained the model that David leaked on a 64 by thérèse from Linux so several user space library binaries from Linux so the thing here it can you really can you just release a model trained on one set of code that generates embeddings and still have that those embeddings be comparable to embed inside our trained using other code and I got some really really interesting
results with that as I'll show you in a few slides but the other problem was safe uses verdere to to embed or to disassemble the code that it feeds into their model fortunately verdere 2 is not always the most reliable binary analysis system so it broke on me trying to forget the functions of so set out of those xlix and I ended up with about 19,000 functions that went into the model so some of the parameters used to train the model worm Tyvek I trained the that model for 40 epochs the RNN model trained for 20 packs at that point it looked like it had largely uh stopped changing in perplexity safe model I didn't train the model simply because
how well that generalizing yes I know that this is definitely a an experiment where I'm throwing a lot of different things out there and if I had you know over the next few years there's definitely a lot of interesting things to follow up and maybe get a rigger but this is still like I said I still got some overly interesting results out of this and all the models used a dimensionality of 100 in whatever you're doing comparing any kind of embedding type of problem you really want to make sure you use a constant dimensionality because different dimensionalities are going to have different capacities for encoding data so you really don't want to uh you know say okay we're going to
trust this one with ten dimensions and this one with a thousand because obviously the thousand dimensions can hold a lot more data than ten the other things each each of data set used a different binary so FreeBSD compiled with clang seven by 0.1 OpenBSD was built with clang 6.0.1 and epi SEO was compiled with GCC 5.5 Leto so as far as the class breakdown FreeBSD - chose hero had significantly more functions or was a significantly larger part of the data set I think it's still not large enough to further relate for any data set imbalances to really cause issues especially since most of the issues here are more with a model generation rather than with
classification another interesting thing kind of as a side note is that you know FreeBSD o2 that both FreeBSD S came from the same source code so there's obvious a lot obviously a lot of function inlining and other optimizations going on between - oh and - OH - which is something to keep in mind when you're looking at trying to take two binaries and determine did they come from a similar codebase so one of the things I like to do whenever I'm doing any kind of unsupervised or cluster analysis is throw up it's kind of a stir plot of the data that I'm looking at so in this case these are just basic scatter plots the
data set for to the models and just kind of looking at it off the bat you can see there's some really interesting things in just in the data from right here you know you notice the safe plot the classes are a lot more intermixed when protected down to two dimensions vyas neat tease me than the other two you know you get kind of the cleanest separation within the RNN model and you know the warmth of ech is kind of similar to the safe model I'm grant in to in laud cases you really don't get much more than some intuitive ideas of what kind of space class separation you might have in the data and some pretty
pictures for a presentation but I've still found this to be a fairly useful technique so kind of one method I've come up with over the past couple years for determining whether to embedding methods work as well is heavily inspired by language processing so one of the main embedding or main ways of testing natural language embeddings are to take your embedding and see how well it worked for determining what words are synonyms so if you look at English and other languages we've got what plenty of lists out there that says you know loves likes appreciates those are all similar words you know they're all synonyms for each other so they should show up in similar parts of the
vector space so with fun with binary code we really don't have that common knowledge that common ground truths consistency of well we know that we should say when we see these two functions we should say that they're similar that's right with the concept of consistency we all that consistency means is that hard consistency is that if I take function f from set a and I take that same funked function with embedding method B those two out of the entire data set those two should be the nearest neighbor for each other so it's basically that you know embeddings you know if I embed a function in safe hard consistency with warm Tyvek would be that same func function has the same
nearest neighbor in Word or in worm to affect as it does in safe so soft consistency just means that the nearest neighbor from say function f in safe is within the ten years neighbors or you know the n nearest neighbors of the same function in worm Tyvek for some value of n and actually I got some really good numbers on that so you'll notice the so the consistency between the RNN methodology in worm Tyvek was basically non-existent you know it's basically was a show that there is no consistent but remarkable amount of consistency between the worm Tyvek method and the safe method you know to the order of a 43% of the functions had had hard consistency and
then even with 10 which ously kind of a small number for and for the soft consistency that still brought it up to about 68% consistent functions from one data set to another the kind of the takeaway from here is that at least for this for this experiment models that had similar feature engineering ended up having similar levels of consistency with each other and I thought that was very interesting because you know simply adding a few vectors together is very different than creating a uh self attentive Network over the same set of vectors so you know kind of the the simpler amount the very simpler model with the st. with the same feature ization ended up learning the same stuff
as the more complicated model and one of the things I'm interested in trying later is a upping the value of n in this experiment to see just how far that increases a consistency you know maybe try with an N of a hundred or you know 20 and see how that changes things so like said the data sets that trained on the same stuff are contend to be consistent data sets trained on different models tend to be consistently inconsistent the additional tests I did which was so common in natural languages given a supervised learning task how well does the embedding work in this case I most of the times that I've tried this on code embeddings it's been kind of a
waste of time because all of the supervised data we have ends up being trivially easy to model and watch every vector so in this case trying to determine you know the supervised learning task was which data set did it come from is it a FreeBSD is it an open BSD is it an FBS D every single embedding methodology got 99 point something percent on that so in order to really exercise code embeddings we need to try and come up with some more challenging tasks and datasets to maybe get some more useful comparisons because ever saying everything works great isn't you know necessarily the most useful thing to say so kind of some of the conclusions and takeaways from
this is a tooling can definitely make things complicated the relaxed faith definitely makes that particular methodology hard to just take off the shelf and put into use without doing some significant engineering to say maybe put a different disassembler on the front so another thing was you know even if consistency between models is poor that doesn't mean that the embedding it is not going to work for your particular use case you know as I saw even though the two different embedding feature sets learn different things you know they still work to use them for India at the end of the day and so getting kind of what up to where things were going with a ASM Tyvek
features besides the compositional features of code are definitely somewhere that's right for research you know there's a lot of really interesting things that we could be doing say with a trying to incorporate the control flow graph and other features from the binary into the model so one of the things I think would be interesting to do from this in the future most of the work so far has you know people have naturally looked at functions as being the best way to are the best you know kind of semantic block of code to major things in because we you know humans tend to look at things in terms of functions but looking at things on the from the computer side
maybe looking at basic blocks would be easier you know and the basic block is just a section of code where you don't have any control flow changes so that side steps a lot of the problems you have when you're doing program analysis you know as far as determining you know have we ended another have we ended the function or did we just jump somewhere else due to something else you know or is this just a piece of code that just has nothing but jumps in order to get around function analysis so basic block embeddings there is definitely a really interesting area that I plan on taking this in the future other than that there's my twitter and
email anybody have any questions [Applause]
Thanks awesome talk this whole area is very much my jam so I really enjoyed that thank you so um having said that I'm not gonna hit you with kind of a sneaky question one of the things that's come up in looking at embeddings for natural language models is this question of bias so I think most of us are familiar at this point with the notion that if you translate from Hungarian that doesn't have gender pronouns you get different gendered pronouns in English for nurse versus doctor so to what extent do you think based on sort of your experience with this that that might be an issue going for because obviously a lot of the code bases that we have available to us
via from you know malware aggregation sites or large code bases or stuff like this these are all skewed from sort of normal behavior so if you get into actually using this for like reverse engineering at that point the quality of those embeddings and whether or not it's actually mapping to the right functionality might be biased by that data and might might lead you sort of down the wrong path so do you have any thoughts on that or how it might yeah so I think that's definitely a very heavy potential area but I think one of the key difference is that we can leverage in terms of code as opposed to natural language is that ultimately we're not
really looking at what's the semantics intent we're looking at what did the compiler generate so that's where I that's one of the reasons that I think moving towards a basic block approach where we're looking at how we can do we combine basic blocks blocks and then derive meaning from those can potentially help a lot of the issues with the data set bias because I mean a compiler is only going to generate so many basic blocks humans are only going to code so many different things before they put in if they in there or some other thing that's going to cause a control flow change so I think that's definitely one possible mitigation but yeah it's that's basically a huge area
of potential concern and one that I don't have a good answer for cool um so I have a question as well kind of similar on the similar track so have you looked at function embeddings when you say injected like a single line or something into the function and seeing whether the function embeddings would also be similar there - no I haven't interesting to do I'm just reminded of a news post from last week for silence and malware classification so yeah no I that would be interesting I think these should largely be robust against that particular type of technique just because the code that you inject would have to in somehow in some fashion over over over way the code that
you want that's already there so I mean if you look at the like say the worm safe model you know whatever line of code you're going to put in first off you have to know what vector is going to be generated by the instruction vectorizer and then that that vector would have to be of high enough magnitude to throw off everything else that it's already seen so that's I don't know have AI think that that's definitely interesting to do
all right [Applause]