
so birthday hunting we're only looking at applying some of the ideas behind the maths of the birthday paradox to some work that goes on around the SIRT so just for anyone may be unfamiliar with cert work or birthday paradox we're going to go through a little bit of that before moving into how it all fits together so me I'm a security data scientist worked at the number of certs over the years and have the formal background in physics and risk before that so one of the problems with the cert that can happen over time is this sort of tunnel vision so a few signatures tend to dominate all the attention from you know IDs DLP fishing
reports that sort of thing and so then in a way like this map that you're looking at of what's happening in your network becomes the terrain and so this is what everyone's skills become we're very good at closing phishing tickets you know very good at like closing DLP tickets and so then when something serious happens under duress this is the sort of skills are falling back on and maybe it's not so common that these incidents are the ones that cause the you know the big crises in the sock so one of the other problems is evaluating these machine learning products so there's a lot of them out there they all do some very fantastic stuff but on the ground for the analysts
often it's very much just you'll get a list you'll have a look at it and you say oh this looks pretty good maybe this looks a bit funny maybe there's some false positives in there and then you have to make a decision mine so the birthday paradox itself some classical presentation how many people in a room before you have like 50 percent likely who would have to sharing a birthday so if you have about you know 23 people it's on 50% if you go up to like 40 suddenly it's like 80% and so you can represent it it's much easier with the math to consider just the likelihood of two pairs happening and so you have two people combinatorial over N
and so then there's you know 365 days and so what are the odds that they're not one of those 365 take that away from one and now you have the probability that you can work out per person in the room so to get just to get a more intuitively how this works is there's only 365 days rather than the one in 365 that you do in your head and so if you think about that in terms of like a solid grid with about 300 squares in it you're trying to throw say 20 stones at it as you get to say the nth stone over here the likelihood that it's not going to heat into anything else it starts to
run out of space pretty quickly and so you can imagine that as this number gets higher very quickly you run out of spaces for it to land so threat hunting and so just for the purposes of the talk this said so so it goes in investigates indicators and so the two main ways for this to happen is through matching which is you know your signatures your IDs you're like a DLP that sort of thing and then hunting which is more hypothesis based like I think this bad thing is out there maybe until driven a little bit that sort of stuff and so then both of those feed into the same sort of process so we roll
over to containment you know we park the van out front we go have a look at the pixies done bottom of the garden see what's going on circle around and then that follows into the usual remediation steps you know call HR legal network infrastructure teams that sort of thing so in practice what if often ends up happening is the cycles become very front heavy so this is like feed or like an incident response cycle you do a lot of identification and then a lot of just closing tickets so it particularly has discussed around the phishing the DLP in the malware these become very Hevy focused s-- and suddenly our dfi our process becomes a lot closer to just
pulling from the network and reimaging then the process you might need for a big serious incident so the question then becomes how do you use of unbiased this in a nice structured way so everyone gets a good experience about what's going on for maybe more critical incidents that are gonna damage for the stock or the company so so for the math seeing the first thing we're gonna do is generalize this birthday paradox so we started with 365 days we can do some Taylor expansion and some tricky triangle number maths and we can end up with a much more general form so this one and so here we can say okay we have n checks if you know something going on
for each of our assets and we have M assets then the probabilities of finding something interesting comes out something like this so you can approximate it a little bit further if you just want to do another Taylor expansion on that exponential but as is is you know easy enough to put into a spreadsheet so what this looks like from the adversaries perspective one there isn't time dimension to this equation so it all sort of happens at once and so in time this is roughly the dwell time you can make an argument for so at say 90 days you can certainly look through a lot of assets for you know wmia services schedule tasks that sort
of thing you know the greatest hits to find something interesting and so you are doing all this over a 90 day period you can certainly get a high number in there where can the adversary possibly you know compromise it isn't going to be looked at in that sort of time period and the same way as we had with the birthdays you're running out of space now in your network the adversary is running out of space as well so in terms of like actual numbers on the board we can have a look at a couple of interesting things here is that for 1k assets which is maybe pretty small organization it really doesn't it take much to get
that up to a pretty reasonable likelihood of finding something interesting the other interesting one is for 100k assets maybe you're a very large organization now to find say 500 hours over all your staff in say a 90 day period it doesn't sit dry I mean is that challenging especially for something where if you want to like open it down to just the critical stuff it seems pretty reasonable so the other thing that's interesting here is that this is only talking about a single compromised host so I don't know about your organization's or so much speaking for mine but it seems like the likelihood of a single compromised host out of say a hundred thousand assets is
may be unlikely again in that case those count as an extra hunt if you like so that's one that you can find that as well is going back to where sort of one of these guys so as for like how valuable it is to the organization the first one is working out saying acceptable likelihood so this won't work every time you know there's a statistical element to it so what's acceptable for you and for the organization so you could putting people on this maybe you don't necessarily need your super forensic aiders but just the regular people maybe some guided learning going on can get a pretty reasonable crack at it the other thing is this does a lot better at looking for
existential risks so it's not hard to imagine some kind of incident regardless of how unlikely if it could certainly happen that that's the end of the company so maybe all your investors leave maybe all your computers get wiped maybe your mask and your domain controllers are gone and so these sort of things can happen but they're also not that likely to be like IDs straight away or like malware or like fishing maybe they'll start that way but something that's maybe been lurking there for a little while is much more susceptible to being found through this sort of approach the other nice things about this is much like some of the cyber you know products out there this one's self learning it's
agentless there's no end of life it's all just driven by your people and so the sort of thing that's you know relatively easy to implement maybe it's just setting a calendar thing in Outlook and you're good to go you also get it's very good low-pressure environment so you know we know something and there has happened but that's always true for a sort of sorcerer work anyway so to have people on that use like a good experience so you get out there in the network you're pulling logs you seen what's really going on in your network rather than viewing it through the lens of maybe you see more your IDs so there's a couple of tricks you can do to shrink the number
of assets you want to worry about so you can show you through like a two-player game of you know good guy bad guy two strategies of like hack something don't hack something and then investigate something like you're going on don't investigate the alert going on you can show that the likelihood of you know the bad guy doing something is kind of proportional to the value of it so your high value assets which intuitively makes sense and it's nice that we have the maths to back it up what that means for your approach here is they may be you don't need to go and look at you can't pitch your server you can probably leave that one alone while maybe the
exchange server is higher up on the list there's a couple of great like log reduction tools to pull out the interesting stuff and so maybe the stuff that doesn't have anything interesting on you don't necessarily want to worry about so David J Bianco is clear-cut it's pretty good so no key Hubb you can also knock together a couple of Python scripts looking at things like Markov chain transitions for the log lines you can sort of summarize entropy you can look for just like small anomalies on those just simple fresh holding that sort of thing another good one is shadow 80 so we've through this process you start to find or don't know what this M is I don't know what this
acid is don't what this acid is maybe it's a good one to hand ball to you network or your end point team to just get them taken off or you know shifted to the proper channels and make it somebody else's problem some things you can also do to get the number of blank investigations you can do up so talking about investigations there's a lot of great guidance around for like you know InfoSec education so you know there's some cheat sheets that walk you through it there's great books around like malware analysis network analysis that sort of thing you can cut down so you don't necessarily have to look for just everything everything things like lateral movement things like
persistence privilege escalation those sorts of things have a lot of like forensic revenue residue that sticks around and so make a good target for this sort of thing and then they can be cut down further to you know just the hits like WMI stuff maybe it is services schedule tasks another one is getting compromised more is like a reasonable strategy it's certainly gonna work each additional like server with something suspicious going on means that if you're just looking for one the maths doesn't care about that one anymore and so you can sort of excluded from your pile and so then sumitra see that stuff happens if you get to the end of this process and you haven't necessarily
found anything so there's a few ways to look at that one of those is maybe you were just unlucky and you know that happens but if it's happening consistently there's some stuff you can infer from that so while one maybe you're not compromised at all we have a beautiful secure network which is feasible for the maths could be wrong which you know is probably unlikely but then there's other ones around maybe the hypothesis that you're generating for your hunting and the places that you're looking if there's a significant difference between where that should be and what's just happened randomly through this process maybe that's a good signal to review those processes and have a look at what's
going on there so that's one way to apply this idea to it so what we have now is this like baseline of how often you should expect to find something if you're just going out there and sort of pulling BOTS you know pulling the memory images get there like call forensic stuff out of it having squids through that so you can also use this to now evaluate the big expensive security products you bought so often for the purposes of the talk anyway we'll say it's a black box you don't necessarily know what's going inside they won't necessarily tell you and it'll give you just a list of these are the suspicious assets you should have a look at the
other property of these is they usually capital e expensive so the marketing for these is mild often you know you've got some mythical bird god it's gonna swoop in find your stuff so one of the core problems here is that only if the POC is successful is the company you're going to make a sale and so that's fine for the company but for you what because we know there's like a high likelihood of just finding something anyway and this company certainly isn't selling just to your company they're selling to everybody at once and so that means that while you may or may not find something they're certainly going to make their revenue just potentially of luck
depending on you know how many things that can you pulled out and how great the marketing is and so this random number generator might end up being cheaper anyway then you implementing some kind of random process where you go out and have a look at it maybe you want to offload that and so maybe you know it's worth it from that sense but this is the sort of baseline you want to be comparing it to rather than something like we found something you know very interesting in our logs and this is why we should buy the product so if you take this idea of we're gonna validate these answers that come out of it and also then spending a
lot of time figuring out the exact inner workings of it it doesn't necessarily matter unlike what a lot of how to buy machine learning product guides will tell you of figure out how they're training it figure out the sort of models that they're using figure out you know health and they're retraining it like for this sort of process doesn't matter so much if they're doing some kind of Mechanical Turk thing where they just pipe the numbers off to Amazon and have someone else have a look at it what you're really interested in is what stuff are they telling us and how much better are they doing than just a simple random number generator and then is this
cheaper than I could implement a random number generator myself so yeah you can um so from the equations from before you can just calculate those there's not too much you can do in a spreadsheet and maybe even simplify that again if you're spooked by having an exponential in there so so coming up the precision that we're looking for here it's not so much dieter's it's not one in like however many assets you have it's just one in the assets that you have there's not any more than that that's certainly not any less than that and so if you catalog these and have a good understanding of your assets while you know it's a boring answer they get from every security talk
it's also a great foundation to build up this like more sophisticated ML sort of thing and so the nice thing here is if you know how many you should be expecting to find in there or at least a portion of that from the red team you can use that as the baseline and so now you have an M you can solve you know how many you went and had a look for and so the percentage that comes out you solving it the other way around and now you're looking at a measure of success for how could your sort of not necessarily threat hunting team that regular analyst type let's go we found something have a look at it work is what this also
gets from us is just looking at the logs rather than depending on just whatever comes out of the seam at the time you get a nice structured way of analyzing that and so so it's a nice drill it's a detection method in itself because the things that you find aren't going to be part of your regular maybe you've mapped to Midas attack framework and these are the you know thirty percent likelihood of finding this is a well covered this potentially gives you a much broader coverage of that for not too much well not any really infrastructure investment it's also maybe you get to say you're doing chaos engineering on the back of it there's some probability here you're
working through the whole process and finding out the the long-tailed stuff that happens when the incident goes down so yeah probably the last one is that even like a broken machine learning product is Right 22 percent of the day so the PRC's they're not they're ghatak so if you do 100 pocs is different to the company selling to a hundred people and so there's not necessarily just pull out the good stuff and have a look at it you want to really compare it to this baseline so thanks
all right do we have any questions so I know this was like a half hour talking so it went really quick would it be possible for you to put on like do a jupiter notebook laying out here's an example of you know testing a RNG against you know some just basic classifier you know a linear regression or something yep just so that we can see kind of the work through not against even any real data just against like just a simulation so that we could see how the calculations would occur so I'm you could it's really around you have your thing that'll give you a certain it looked at a hundred thousand assets it came back with ten twenty hundred
whatever and then you would just send maybe like a little to analyst on the Friday grabber grab an hour ago look for persistence in you know there's many VMs as you can get to that sort of thing and so then because you know how many assets you have you know how many they looked at if those numbers are the same just compared to this then that's the measure of the thing it makes sense but for the purposes of remembering this after I go drinking tonight yeah it would be very helpful to have it like laid out in like a little notebook or something yeah sure we've got one in the back
so the one of the underlying assumptions for the birthday paradox is that you basically have uniform birthdays right because everybody has row or there's low probability that doesn't strike me as the case here yeah so it dressed out a little bit back in the N stuff yes you can assume like the key thing is like at least one acid is compromised so from this assumed breach if you really believe that this is a good okay well you better which you know may or may not be true so that's fair the idea here though is it it doesn't necessarily matter if it's equal or not it's just a question of more or less and so there's
some minimum amount they're more likely to be in high-value assets and so that's going to fluctuate around absolutely a bit to a first-order sort of approximation maybe you can do some stuff like this but you really not really have anything to say something would be more or less likely to be compromised so over you know ten thousand assets it's probably going to average out any other question any other questions all right um thanks again then Thanks [Applause]