← All talks

The Obfuscation Situation by John Atkinson

BSides Leeds24:4985 viewsPublished 2023-07Watch on YouTube ↗
Speakers
Tags
TeamBlue
StyleTalk
Show transcript [en]

hello everyone uh I'm John Atkinson and this is the obfuscation situation where which seemed like a good idea for a talk title before I had to say out loud uh I'm a security data science person uh I'm currently a data science lead for one of the larger us-based DDR vendors and previously I worked as a head of security data science for one of the government departments down the road and I've done some other stuff before that of course um all the code used in this talk is like Get over people interested and yeah let's begin so uh the message for this talk is basically if you're a blue team app and you want to share some information

handling that data and obfuscating it the way you want it to is really difficult and sometimes as researchers or investigators we happen a Reliance on this sort of data um but the fun part is if you work in cyber security uh the bad guys have this exact same problem too so this talk is about fundamentals it's not about ml it's not about AI Magic there's a tiny bit of stats but I've covered it in bright colors so it's fine so uh we're building a data set we want to somehow remove some information so it's safe to share it internally externally whoever you want to give it to our friends down the road uh the ODI the

open data Institute not so long ago published an article called another anonymizing data in times of Crisis and this is the techie bit you'll note it's step four step one is uh ethical and legal consideration so if you're considering doing this basically Step One is get help don't try and do this on your own just for some data you've got but yeah there are some things we can do we can either keep it that's really obvious we need to press it and remove it I throw it away we can generalize things so move some detail maybe you don't need to know my address exactly maybe it's fine that I live in Leeds or live in England

we can randomize or show some examples of that in a minute where we add some noise or we can pseudonymize it uh which from now on we're going to call adding fakery because that's really hard to see so this is the boring techie bit to get out of the way so we've got a data set we want to make it realistic looking but not like the original data so originally we had a title your Mr misses your Miz whatever uh we're going to throw that away we also have first name and a last name uh what we're going to do is generate some new ones but we're not just going to generate them randomly uh every John

is going to be replaced by the same value so if John goes to Chris or whatever every John will become a Chris same for last names uh Atkinson's will be called Thompson's or whatever gender we're throwing away pin we're throwing away duck is secret um we're basically throwing away the information but instead we're just going to overwrite it with the same static string which is the same as the like completely removing it but if you've got systems that are expecting that as an input there's still something there and they won't break uh favorite color again we're going to generate some colors phone numbers we're going to generate legitimate looking phone numbers email IPS you get it

they're going to be private ipv4 and common analyzes we're going to pass straight through because that's just the column I use to try and work out how things map together so we've done that we've got our cast of Cluedo and you'll see if you look at the blue bit um those highlighted phone numbers they've been consistently replaced if you look at the green IPS they've been nicely replaced uh in green by different IPS and yeah looks like it's going okay so maybe this is safe to share there's a bit of weirdness though so if you look at um favorite colors if you ask human what their favorite color is they probably see a red green blue

um they probably don't say Old Lace unless they work for Dulux and similarly if your name is Ashley Sanderson um it's possible you have the email address Williams Kevin example.net but it's going to be unlikely particularly if this is a work email rather than something that's likely to be shared so even though we've started to sort of obfuscate to anonymize this data we've already even with really simple Transformations started to break relationships that we as humans might expect to be in there and this can be a good thing or a bad thing depending on whether you want to keep them in [Music] um and it depends on who you're giving this information to this might be really bad

if you're later trying to work out what are the what the username or email address for a particular person might be based on the data but of course the bad guys have this problem too so if Tracy came here she's very kindly sent me uh spam email in French which is a language I don't speak it's a bit weird that they have the email address J Bell ndfcube.com so this is just something that's inconsistent um it sticks out plus you know the Earl shorter and everything else that Google's decided that makes this spam but it's another bit of a data point that's hard to manage unless you're really looking for it so yeah doing a substitution is

actually quite easy consistency is hard and it's really hard if you want to try and keep it consistent across different variables um yeah you can also go to real so dangerous defaults is a fun one so uh if you've got things of the form maybe it's IDs a b c x y z there's the really fun example of a U.S wallet company that was very big and to sell their wallets they put in like a fake ID card and they are on there made up a social security number and just put it in unfortunately that was a real person's social security number and some people because there were lots of people with wallets but look in their

wallet see this printed name and then use that as their own Social Security number so the people that designed the bullets never expected this to be a problem never really thought about it but it did uh and I think they ended up being sued so that was a fun time for them I'm sure it's trying to explain that similarly the library we're using here to do all these substitutions uh it will run out of names eventually there are far more names in the real world then we'll be in the list that it has to draw from and relatedly um if we had kept titles in your misters your Mrs Etc and hadn't thrown away if it replaced

those and if you had a big enough data set say for the whole of the UK you'd still be able to work out which were the misters originally because there will always be the biggest group assuming you've got a big enough sample size of that sort so yeah you can leak information that way too uh randomizing things this is a bit more of a subtle one than just replacing them so um that's the Red Dot more or less um maybe we don't want to give our exact information maybe we want to lie slightly um about exactly where we are so maybe we can add a bit of randomization when we're giving our coordinates and instead

of being in this room we're going to be having a nice walk down by the river uh and this works right as long as you do it once if you start doing this again and again because your GPS does it every 30 seconds or whatever aggregation will completely unravel this because you don't need to be a genius to work out the actual point the real location is the exact midpoint of all these additional events that have been generated so this is one where you can detect spoofing there are Guides Online to do this on your phone but interestingly the main use doesn't seem to be fraud or that is a reason to do it uh the main reason is

actually to cheat at Pokemon go so there you go yeah random is not undetectable so uh in fact the opposite so we're going to talk briefly about the main generation algorithms which are um a technique that malware uses to work out what its schematic control domain is so yeah which of these looks more legit to you guys as humans is it goodhousekeeping.com or is it this thing that I'm not going to try and pronounce you don't need to be a malware analyst but if given these two options instead tell me which one of these is the wannacry kill switch yeah you don't need to be a cyber professional to order that um but that's interesting because if we

think about why we know that even without knowing um it's interesting to think about how we would teach a computer the same logic that our brain is using to go that's weird um so one way is a very simple way um that you can do in some code very quickly which is why I've chosen it is to look at the combinations of letters in these domains so if you've got Google your diagrams we're calling them uh two letters so Geo o o g etc etc so what you can do is you can take the top million email sorry domains that are in use across the internet from this data set called The Majestic million you can do this analysis work

out the probabilities of how often they occur and then when you get a new domain in you can get the computer to score them based on the likelihood of each pair within that domain and as you can see from the graph here this works pretty well so the Majestic million of length 16 and the emotec which is a constantly recurring email campaign that sometimes uses DJ you can separate the two not perfectly but quite nicely [Music] um considering the code for this is really short um that's something you can teach machines to do that a human sort of knows through intuition um but might be hard to articulate um you probably wouldn't use that on his

own so the other thing you've got to wonder about when sharing data is external aggregation so yeah that DJ thing's all very nice but there will always be some overlap so you can to what the main reputation services do which is piling a whole load of other data as well so things from who is information about who registered the domain where it was registered how long it's been live any scanning that's gone on to work out but what's been on that domain previously um and that's super powerful if you're trying to work out whether you should allow this connection out of your environment but as Defenders this can of course also work against us so

OPM is the US office of personnel management for the federal government basically it's what manages all their staff an Anthem was a fact still is a U.S medical provider these two were breached in 2015. and yes it's bad that both these places were breached obviously that's medical insurance information and Staffing information but the worry was also that if you combine these two things different people in different parts of the US government are going to have different types of health insurance based on what they're doing so that was the worry there this was back in 2015. and yeah just because you've curated your data set to be you know consistent within itself it's always worth thinking about what else it could

be joined with that might unravel all of your good work uh oh yeah this is a bit of fun so evap was a piece of malware that used a main generation and the thing about the main generation is it looks really noisy because it tries all the different various combinations that are possible at that time which comes looks quite noisy on the network so what they decided to do was instead of trying all these different combinations what they do is connect to the European Central Bank look at one of their rates and then use that as the seed to know exactly what um domain was valid for that day which is quite clever but of course then

a connection to the European Central Bank becomes the indicator if you're not the sort of organization that does that every five minutes uh the other thing is uh your Reliance on the library in this case Faker for whatever you're anonymizing so rather hilariously the JavaScript node version of this the author decided one day he didn't want to do this anymore and that was the end of the library he just overwrote it there's some swear words I recall and yeah if that's part of your data sanitization Pipeline and you don't catch that that's going to be a problem so like everything else your Reliance on a huge stack of tools to you know format the data the way you've

expected anyway you um want the output to be so yeah walking the balance attackers have this problem so there's the your simple identifiers your iocs your really well known domains your hashes on one end and then random signals and detectable variants on the other side where everything just looks a bit too weird and becomes detectable so you've got to constantly balance and you're constantly fighting against The Blue Team attack attempts to detect this and the data set curators have the exact same problem but for a different reason so you can give out your information with simple identifiers you can give out the raw data if you really want to and you think it's safe um but that probably isn't what you want

to do there's probably things that can be misused in there but if you go too far the other way particularly if you don't know exactly what this information is going to be used for just going into ml you don't know what um relationships you might be breaking by applying these Transformations so again you can go too far either way you'll end up with a random signal and you have no useful information for whoever you're giving this in data to to use um yeah practicalities uh collaboration is key number one please get specialist advice don't just decide to share some data on your own you will not have a fun time um but more practically if you're going to share some data with

someone try and work out what they're trying to do and what exactly they need this seems obvious um but often that's not the way it works um in particular try and work out what patterns you do and don't want to persist and if you can try and either get rid of them or ensure that they're retained in a safe way the other thing you want to do is try an obfuscated source which sounds really easy you know take the data set out of wherever it's stored and have it immediately safe in the anonymized form this is actually very difficult as if you're dealing with quantities of data of any decent size or your um you know it's in a protected

environment you probably just can't rock it with a laptop apply your Transformations and take it out there's going to be some sort of pipeline some sort of interface and often that ends up being uh you know something that's connected export and then there's a stage in between so that means whoever's doing this if you're a researcher or data scientist or whatever you become a Target because you have all this valuable original data in the same place where you can run a load of code to transform it um yeah so this is what we're looking to try and Achieve um if you're a blue team and some representative useful data sets where you've minimized the likelihood of

excuse there is code for all these things if you're curious that all running Jeep is enough folks or on Google collab um if you're a blue senior maybe this is some ideas for detections or if not detections because this sort of logic can be quite noisy uh maybe we're ranking things if you're mid investigation if you're a red teamer maybe it's some ideas how to be sneaky or perhaps more likely just how to make a lot of noise and do something else over the side um and that's it any questions

thank you yes it is extraction data device

uh so I guess the so there's there's a lot of stuff so there's the sharing part which is basically get advice from people who've either done it before uh and or the people who are legally in charge of assuring that data in terms of um practically managing the data within your own environment the main thing is just to keep what you need and that will be dependent on whatever your business context is um you can set up various parts of you know transforms but again it comes down to keeping what you need and sometimes it will be the simple version of this which is just get rid of it and sometimes it will be worth

transforming it because if you've got people that want a data set to be usable you know it needs to look like a real record it needs to look like a real person because they're testing the system and they want to make sure that works um yeah

um so the the interesting thing for data scientists is they need to sit between the two so your developers in theory can have an entirely fake data set um as long as it's within the normal boundaries of what is in the real data set um so your production data can be completely separate when you've got data scientists if they're trying to get for example machine learning algorithms to learn features from data you need the real data so you've got this really awkward scenario where you've got Live code potentially Hands-On keyboard if you've not got some sort of cicd system setup where you've got code and live data in the same place and then it's about how you manage the risk of doing

that yes

and say I'm sorry

problems thou shall never go because

that's certainly one way I thought many of these battles there's different ways you can uh get around it ideally if you're going to be doing this on an ongoing basis you want to be able to support the use case for data scientist I.E Coda and data and yeah it's always going to be risk assessments because everything is risk acceptance but um having a proper environment for that where everything is decided um beforehand and put through a proper project rather than just doing it for two weeks and then a year from now we'll also just do it for two weeks it's it's one of those type of decisions right now conversation points

so there are obfuscation products um and yes they will help avoid that and the problem is you probably want a data scientist or someone who knows the data whatever their job title is in to make the decisions about which relationships are being broken um because it's very difficult well the tool can't know that maybe Consultants who have worked on similar data will can help but um yeah you do really need someone who knows status act and what they're going to try and do with it after it's obfuscated

Solutions more so you can to understand what you've got access to business

[Music] so

they're great

um

so I didn't note it on the slide but this default dictionary is what stores the mapping between the original color or name or whatever from the replaced version so if you're doing it one way uh you just throw that dictionary away after you've done the conversion if you want to keep it if you use on a different data set that needs the same transform um you you can keep it around somewhere safe if you want it's just the mapping between the two on the colors thing yeah you could try and use more sensible colors so you don't get stupid Deluxe paint stuff but then you're restricting your input so when someone in the real world does say something weird like

Tangerine um it's you know it's going to run out of colors

yeah if you save it

so I mean it doesn't have to be done it's done in Python so you can do that where you want

[Applause]

um right thank you very much