← All talks

Robot Codebreakers and Hunting Nazis: Machine Learning and InfoSec

BSides Scotland · 201833:59273 viewsPublished 2018-05Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
An exploration of machine learning fundamentals and practical security applications, from intrusion detection and fuzzing to automated identification of extremist content online. The talk demystifies neural networks and genetic algorithms, discusses adversarial attacks on ML models, and showcases a real-world project using ML to detect and track white supremacist propaganda on social media.
Show original YouTube description
Talk delivered at BSides Glasgow 2018 on the 27th of April. Abstract - "Machine learning promises to revolutionise everything, including infosec – but can it deliver? This talk will cover what machine learning is, how it works, what it can do, and the kinds of things it has been used for so far. We’ll talk about intrusion detection, fuzzing, and hunting Nazis on the internet, as well as all the things you can do to mistreat robots and trick AIs. It’ll cover what frameworks there are and how to use them, as well as some broader concepts, giving you everything you need to go out and build your own helpful (or harmful) robots."
Show transcript [en]

excellent palliative um so today I'm gonna be it's difficult so when I submitted this talk I gave it the title machine learning in InfoSec and then realized that um machine learning in InfoSec is actually like a 12-week University course um so I'm gonna do today is talk a little bit about and some of the things means and friends have ever done to apply machine learning in a security context to do some half interesting things one interesting project one very boring one as well as give a flavor as to what machine learning is and try and demystify it because um if I can do it you can do it so I'm about me I'm a security consultant with NCC along with

probably about a third of this room I guess and I was working on my PhD applying some AI techniques to security problems before I dropped out to be a pen tester and prior to that I was a software engineer and I did some blue team work for the for the university I was working so I've kind of seen InfoSec from a bunch of different sides and despite the fact that I haven't actually been here for all that long so um what we're gonna do is we're going to talk a little bit about what machine learning is what it can do what it can't do how it works and how you can do the thing because it's super interesting and it's

um it's useful for the kind of projects and to tackle the kind of problems that traditional traditional approaches don't really work so it's not magic um it doesn't fix everything and if you're going to use machine learning in a project if you're gonna try and use it to solve a problem your solution kind of needs to be about it you can't just sprinkle some machine learning onto onto your solution you can't and you can't okay well we'll have like this and we'll have like fifty thousand man-hours working on this one giant system and then we'll bolt on a neural network at the end it's and it's it's very much the engine that drives and the drive solutions and

it generally breaks down into a bunch of different parts you have you have data lots of data the more data that you have the better you have some kind of model which is what's going to be interpreting the the data that you've collected you need to have some way of changing the model based on the data that you've got and then once you have this trained model and that's when you can then apply it to new data that you gather as a staff comes on the devil is in the details in terms of and what your model looks like how your model changes and and especially the data in terms of labeling stuff and I'll get into that so

there are a bunch of different flavors of machine learning it's a little bit like when people say oh I work in AI which is what they used to say before machine learning became cool and AI gets treated as like this giant monolithic thing when really it's about two dozen different and vaguely related and approaches in terms of and the stuff that I'm going to be talking about I will talk about neural networks because that's what's currently most fashionable and also talk about genetic algorithms which is um it's a weird one and I personally love the biology analogies that people have from machine learning because it's a little bit like saying oh I can be a vet working on canines

because I sold hot dogs for a while it's an that cut yeah it's like that kind of illusion just it's anyway it's rampant and you can also get like rule-based systems so the the neural networks and the genetic algorithms they're in very much case of we're going to put together some stuff and we are going to throw some data at it and it will change in some way and we can't predict it and it'll be fine maybe probably the rule-based systems and and others other things like that they're a lot less um I guess the technical word is like stochastic there are a lot less random there's a lot less swishing but you generally get those for things like

expert systems for for diagnosing diagnosing conditions or building navigation systems for autonomous robots that kind of stuff and I unfortunately know absolutely nothing about that kind of thing so um I'll it sticks you what I do know in terms of what version of machine learning is best for you and that depends entirely on your on your problem I am a security person I am NOT your security person and in I highly recommend the book machine learning and security and it kind of highlights why my talk can't cover everything I'd like it to because I have 50 minutes and I can't just read and that book is highly recommended um it's probable which is what helpful so I turn it algorithms I'm

covering this first because it's less interesting the neural networks and that way you know I know that I'm hopefully some of you had time to get coffee you're currently riding the caffeine wave now is a good time to get through this the basic idea is we don't we have a problem that we'd like to solve and we can generate random solutions to that problem and we know how good the solution is like we can measure it in some way but we can't say we can't take that knowledge of knowing what a good solution looks like and use it to build one we have to we can just be like I we we need a car I

can't tell you exactly what kind of car we need but I can tell you that this one is wrong and so is that one this one's okay it's that kind of situation that this kind of technique is useful for so what you do is you generate a whole bunch of random programs and Holland's friend once you see how good they are and you select maybe a proportion of those to go through to the next round and then you fill up the population with um with a combination of like the babbys if you combine two of the and two other solutions together or more you can take winners from the previous round and just change them

slightly and you can also fill in the gaps with like more randomly generated solutions to try and like add variety to the gene pool so to speak if there are actually any biologists in the room I sincerely apologize on behalf of computer science and so your your generating random solutions you're keeping track of the best ones when do you stop normally when your AWS account runs out or you have like a threshold of how good your solution is and you find one that matches it or you just give up after a certain number of rounds to be like okay well this is as good as we get um rollout code breakers so I mentioned that my dentists and PhD work and I was

trying to apply genetic algorithms to solve crypto problems the idea being that crypto systems are designed in such a way that even if you guess close to the right key it's still wrong and if you try and decrypt stuff with a slightly incorrect key you're gonna get very wrong output which is terrible for things like genetic algorithms and other search based techniques because what you want is a gradient you want to know when you're getting close to the right solution so I was like okay fine I won't try and do that and I will disturb my microphone sorry and so what I decided to do was I'm rather than trying and looking I look for keys I would try and

look for for attacks so who here has has looked at like linear cryptanalysis nobody cool so wow that works is you have your plaintext your message you're trying to encrypt you have your ciphertext which comes at the end of the last round and what you try and do is rather than guessing every individual round key you guess the last round key or a bit of the last round key you push the ciphertext back through the cipher until you get like the input to the final round and then there's a relationship between the plaintext and the input to that round that you can exploit because that way you can guess part of your key see whether your guess

is correct and if your guess is correct then you move on to the next bit so it's the difference between trying to guess every bit correctly all at once and guessing each bit correctly in sequence that last bit is a lot simpler than the first bit so what I wanted to do was to try and build a and try and build a system that rather than try to guess a key it tried to generate attacks and I could tell how good an attack was and I couldn't like say I want you to generate I could say I want you to generate something like this I want you to generate a linear approximation between the plaintext and ciphertext instead

what I did was I gave it the tools to to build that so um the the language that it used to build the programs for was built of a bunch of different components P for plaintext I for an input to the final round C for ciphertext and K for a round key I had some some operations like X or or and or or like general logical operations and then I would get it to generate statements that involved those and those kind of and those kind of operations and those kind of inputs and just see what happened profit kind of so and I don't know how easy that is to read from here but what this is saying is that if you

take the number one twelve thousand seven hundred and sixteen and you get the bits at position plaintext modulo sixteen that is equal to the zeroeth bit of the input to the final round I think it's it's true enough of the time that you could use that to to extract information from this cipher which is cool it's a useful thing not sure why it works but it kind of did this was another kind of answer I got yeah I don't know what that simplifies to and this is a problem with machine learning because it you asked the computer for an answer and it gives you one and it looks like this and you've got no idea what

this means you've got no idea how you can use this as like a design factor for improving stuff it just is a thing genetic algorithms are really cool for certain circumstances but deriving explanatory power about problems is not one of them new networks this is kind of where you decide okay we don't care about explanatory power we want a tool a tool that's useful I knew networks work a little bit like this and you have your inputs and you have your hidden layer and then you have your output and you feed your data into and you split your data up and feed it to the input they're called neurons again sorry biologists and they're called neurons

you feed your input data to those and to those input neurons they fire probabilistically based on the input and those signals get sent to hidden layers there might be more than one I'm depending on the kind of neural network you're making and then those hidden layers those hidden neurons may be fire and then eventually you end up with output neurons which again may be fire and that allows you to gather some kind of information about your inputs which might be this is a picture of a dog or it might be this packet is malicious it's that kind of like classification problem that you tend to use things like neural networks for so neurons neural network neurons

are not the same as brain neurons and they are either on or off they fire maybe and there's a bunch of different ways of figuring out whether or not they should fire so um the strategies for figuring out when you should fire there's a bunch of different ones some use like thresholds like I am listening to three other neurons if at least two of them fire at me I will fire on words if I if like combined signal from the neurons that I and that I receive is above a certain level then I'll roll a dice and try and match that probability and then I'll fire or it might be all fire if bits if inputs one three and

seven all fire there's a bunch of different rules and exactly what rules you use is more important for people doing academic research than people like us because there's um we can especially when it comes to like explaining to clients like I might have a job coming up in the next week or two to help build a learning tool for a very very very big company I'm gonna find it a lot easier to explain this kind of diagram then I'm going to find it to explain oh well we have like fifty thousand neurons with a threshold for firing based on whatever and in terms of the learning so you have this and you have and you have neurons

which might fire am I not based on and what they see but that's fixed that's static and how does the actual learning happen its technique called back propagation which is where you and during the training process you feed your data into I'm the network the signal propagates you see whether or not it gets classified as the right thing or you know you see what the output signal is like you figure out what the error is between what you are expecting and what you've got and you use that information to correct the probabilities of firing with all the rest of the neurons um exactly how you adjust that depends you ideally if you want to get like a really

accurate model that classifies things really well you want to make small changes to add to the model based on your training data you don't want any one training element to have like a disproportionate effect problem with that is it takes forever and you need masses of data in order to and like properly tune your model you can cheat and make big changes and it becomes very fast to train but your model isn't necessarily going to be very accurate which is great if what you're trying to learn is really distinctive really straightforward if you're trying to learn something that's a little bit more subtle than that requires a little bit more thinking um I did mention that I

was gonna like in my submission that I was going to talk about attacking press it's it seems mysterious but it's not one of the very first things I learned in like IT class in GCSE was the phrase garbage in garbage out if you can influence the training pipeline for a model you can feed it garbage you can say oh there's picture of a tractor it's actually a cat I promise it's a cat or um this packet totally benign the fact that it's like a PS exec for malware is it's highly irrelevant definitely fine um and that way you can and you can get the model to to learn things that are false which ruins its ability to be a

useful tool for defenders and you can if you have access to the model itself but you can't influence the training you can use that as essentially an Oracle you can make very small changes that don't affect the nature of your input or the thing that you're trying to escape notice and wait for it to pass the model and then send out instead um this is a little bit relevant for the next thing that I'm talking about so there are applications like detecting malicious traffic identifying malware and also um hunting Nazis which is what me and a couple of friends have been doing in my spare time um so part of this is going to be

showing you the kind of things we've found nothing is graphic nothing is like racialized but they're Nazis and Nazis suck so apologies in advance for for some very poorly constructed memes so um the a theme which is what we Bend of people have been calling ourselves and I am the least interesting person on this team we have Emily who is a former NSA former CIA US Army works the threat hunter for some private company and based in the u.s. we have Ashanti who and they are a rust developer and rapper and they're really good at it like really good how can you be good at so many different things it's really not fair and then you have me um anyway I'm

you can follow us on Twitter if you are interested and we've been working on a tool called nemesis which is a tool made to automatically find and classify white supremacist propaganda we it's aimed at Twitter at the moment so um we have a we have a section called catch me if you can written by our and by our rest developer which looks - it looks at Twitter for certain keywords and finds images that come up in tweets keep cinema database along with the person who sent it and the time and various other like bits of metadata we have the Nemesis model itself which Emily trained and bless her she when she spent hours and hours and

hours looking at like the most awful garbage you can find and like highlighting bits me like that's a swastika that's a black son that's an Oda rune and that has got to be like one of the worst jobs I have ever heard of anyway and so she built the the actual model which is trained to represent Nazi propaganda um I did a bunch of plumbing so and we we are now in a position where catch me if you can runs gathers images the model based in written with tensorflow which is like Google's platform for machine learning it's like a library and it'll classify the images we the information about bad images is then cross-referenced and the

people who are spreading really bad stuff get identified um okay so I'm gonna show you examples of the stuff we found again content warning white supremacist propaganda and I am NOT a white supremacist I don't condone being a bigot and here goes so here's a pretty easy thing that we found it's pretty dark um both metaphorically and literally um it's a poster saying that Nazi youth are operating in this area with a fairly easy to identify swastika straightforward we have this who's seen rose on the TV show yeah so Rosanna um is Jewish among other things and she did this photo shoot as part of like a um like a satirical magazine if I remember

correctly and this is another important point about machine learning machine learning has no clue about cultural context or any other kind of context you get it to train on a particular thing and it will do that and it will only do that and there's a story and I don't know whether it's um real or not about the US army using neural networks to identify camouflage recent Russian bases in sand like spy satellite photos and they did it and it worked really well in training but that when they used actual data it failed and it turns out that all the photos of the Russian bases they trained it with were taken from one satellite and all the

photos of like benign areas were taken with another satellite and what the model that they trained was really good at is spotting the flaws in the lenses in the satellite that generated the Russian base stuff um and so when they had other satellites didn't work anymore um you can never be quite sure what exactly about your training data the things picking up on which is why like diversity is extremely important and both the the kind that this project is about and also the kind that is a bit more technical so here's advanced one um that doesn't look like a swastika except that it is so the fact that the model was able to pick out that shape

independent of things like being black on a white background or with like a it's surrounded by redness or um it did actually identify the shape and that's super exciting from like a machine learning point of view um from a from a let's battle the Nazis point of view I'm just glad that's effective and here's another advanced one so and that symbol at the back is and the Black Sun symbol an unfortunate side effect of doing this project is that I'm now ofa with white supremacists and symbols which I really wish I wasn't and that symbol is normally black on a white background and but the model was able to pick up the arrangement of shape with something in

front of it by the way if you are on Twitter avoid the hashtag art and our WDS because it's full of people unless you want sighs your block button and also exercise your report button and one of the things which were wanting to do with this tool potentially add some features like automatically report people um I'm not sure Twitter has an API for that but we're going to make one um here's another example um so this is kind of what it normally looks like except it's been slanted and covered with fire because um um I don't know their masculinity is under threat and so we were able to pick that out still which is quite cool and we have

this clown a a link definitely an alpha male with a very very small KKK icon on there um bathrobe anyway and it's small but we noticed it which is great um yeah so we accidentally classified this luckily this is still in development and so we haven't done anything like automatically with the stuff yet but machine learning is not perfect so what what we're kind of hoping with this project is to do a bunch of things me Emily and Ashanti between us we are like every letter on LGBT and we're multicultural we would be first against the wall if the Nazis were to take over and and like we have no budget we have no time this is not sponsored by anyone

we're just doing this and you know we're we're relatively successful in our own fields but this is like hack together in stolen hours and were able to automatically find stuff like this and this and keep track of who posted it and who their friends are and what time they posted it and if they have location data upon yeah um so imagine what like Twitter could do I am NOT saying that Twitter supports white supremacy or like tacitly allows it to happen I'm just saying if they wanted to they could fix it and they haven't anyway um you know none of us are are like machine learning experts we just put together stuff with tools and I'm like red tutorials and

stuff like that so if we can do that you can do that if you had four people working on the project rather than three you could probably do a better job than we did um in terms of Nemesis itself what we need is like more training data because it's just been n lane trying to put this stuff together in her spare time um and some of the classification is not perfect hence like saying that police Scotland is anyway um it's currently supported by like Python scripts and chrome um like my software engineering like background I used to work on like middleware for investment banks what I really want is like a message based system with lots of different input

independent parts and much more failure resistant so that's the next plan in terms of the actual data we're collecting we're trying to do more to map out hate networks and we know some people who do more like political based stuff that would be very interested in things like that and maybe by demonstrating just how easy it is we might actually convince media groups to do something about the goddamn Nazis and anyway here's hoping and so if you want to do the thing there's we've been using tensorflow the Google framework and there's another library that sits on top of tensorflow called chaos both tensorflow itself and chaos are super straightforward to use the turtle add the tutorials they're

really good and a little there's a little bit of difficulty in terms of getting your input data to be consumable by the model but they have example input data files and it's fairly easy to like reverse engineer that um yeah so here's like an example bit of code and because this is a technical talk I promise and so this is um some Kerris code and that little snippet there model equals sequential that's defining a neural network it's saying okay we have an a densely connected set of nodes it's expecting an input roughly of like a list in roughly this size the neurons activate according to a reloj model I recommend reading that up and it's not

actually important but it's it helps you kind of understand what's going on it's not important for this talk so I'm not going to get into it and so you have like one input note one input layer we have like an activation set rule we have another layer that's also densely connected so every node is connected every other and activates with and the softmax strategy and that's it that's a neural network in one line okay it's a few lines for easy readability but and it's super straightforward and I highly recommend to you you don't need to be an expert in machine learning to pick up these tools and apply them to interesting problems some of which are

you know super pointless and boring like my PhD work and others have actual tangible you know results in the real world like this thing we hug together in our spare time um highly recommend the tensorflow tutorials again I'm paid by NCC not Google um so this is entirely because I like it and I think that you should too um caveat again about figuring out the input data but it is reverse engineer able it's not too bad um I was like a general gist of this whole talk machine learning is huge and so I'm kind of sorry I didn't talk more about it but um I understand that we're over running a little bit um it's really

useful when you know the kinds of things you want um but you don't know how to get it so you can tell a computer this is good good computer um not so great when you you don't know what you want and it's uh if you can come up with a solution that and if you can come up with like a classically programmed solution rather than use machine learning that is probably a better choice because then you have a little bit more explanation you know gives you a little bit more understanding about the problem you're actually trying to solve also you can do the thing I believe in you all of you I can do it

which means any of you can any questions

so the question was about an adversarial images in terms of nemesis like have we seen people giving us giving us malicious input so far no but that's because our training is is done offline it's and the input images are they're not selected automatically and because you have to like the way it works is you take an image like this say you draw a little box around the KKK icon and say this is a KKK icon and then the model gradually figures out what that looks like by not giving attackers an exposure to our input like our training data flow we're hopefully trying to mitigate a bunch of these attacks also um I don't know if you've met any Nazis they're not

actually very smart I know that we've who's heard of we've yeah a few people don't Google him and he likes it when you do that and he he was an administrator for a neo-nazi website and maybe still is left the u.s. to go and live in Ukraine I think and he is apparently working on a on an AI tool too and an intelligence is a little bit hit on the ground as far as this is concerned I believe he's trying to build a tool to identify people of color who are based in the US and get Immigration and Customs Enforcement to go and deport them so we're hunting memes they're hunting people and that's probably the

best description of like Nancy's version versus everyone else I think um with any other questions thank you [Applause]