← All talks

2015 - Edward Bowles Artificial intelligence and security

BSides Manchester37:13113 viewsPublished 2015-10Watch on YouTube ↗
About this talk
Within the field artificial intelligence, there are many tools which we can use in the security world. On the offensive side, you have fuzzing and guessing credentials; defensively, you can have a smarter IDS. The tools are not difficult to understand or learn, either; this talk will introduce two main ideas, metaheuristic search and neural nets, show you how you can code them up, and highlight some examples of how they've been used in the real world to break and fix things.
Show transcript [en]

next up we've got Edward who's going to be talking about artificial intelligence in spirit Ian the way you break things using an AI n is a university student at York so yeah my name's Edward and poles I am currently in the middle of doing my PhD in New York but I stopped for about a six-month period to work full-time so fantastic for sec one name ET and going to are going to be moving to part time what my PhD is as far as using some of the techniques which i'm going to be talking about to find interesting and cool and ultimately totally useless attacks against program however the techniques themselves are quite useful and they're the ones which you don't

really tend to didn't come across in the industry quite so much one slide summary so that some of you can still make the patient old and some problems in security are really hard there are many of them finding folks to me to a finding in somewhere really difficult to put decent ids rules really really difficult some of the techniques which I'm going to be talking about can help you the techniques aren't really not hard they there's some there are some long words and some of the explanations that you're gonna find a wine already using ultimately and once you kind of dig deep also there are lots of tools you can use admittedly most of them were written by academics however

they have been used a lot I'm not saying contest against itself I suppose and why why should you be here are still artificial intelligence it's a very big deal it's unfortunately not it's rather a collection of various interesting techniques some of which are clickable security some of which are not I'm gonna be talking about ones apart ultimately think about artificial intelligence is it real it really boils down to a problem of search and you you have some kind of problem you you want a tool to do and think and you you can kind of recognize when your tool is doing the right thing okay yourself get a tool to generate one for you some people have done exactly

that that's a really interesting stuff I'll talk about those to give people some inspiration and also all and I won't quite give you pseudocode but I'll give you enough information but among technical people understand other technical people team building actually influenced chebula is everything more balanced artificial intelligence in general for those who don't know and AI is essentially the science of it'll do there are really difficult problems we don't know how to solve we don't know how we might be able to solve small variants of those problems but the kind of instances of those problems that you get in real life are just too complex some some of these problems are and they're in the complexity space of NP

which means they can't be solved in biomatter monistic Turing machine in polynomials and polynomial time what that means is you can't solve them for big and the big instances not possible whatever you see NP just replace it with those two words make sense for example and one is called the Traveling Salesman who's heard of the Traveling Salesman problem yeah who knows how it works as in who who doesn't need them to that explanation okay well for those of you that do imagine you've got I mean you've got a map but bunch of the decent bunch of roads connecting them and you want to start off from the city or any go to every other city once and come back to

where they started in the most efficient way possible if you have we can generally solve them for about a hundred thousand cities but we're talking massive amount of computing power required to find ideal solutions however we don't necessarily need the best solution especially if it's gonna take you decades to come up with you can use some of these search techniques to find what our goal here is the exclusions ie they roughly approximate the best solution they they're pretty efficient maybe only like two or three percents of difference between the best solution and you can use the pen search techniques I'm going to talk about to find some them you depending on how long you look

you might end up with one which is just really bad similarly you might also get some good ones ultimately you're trading off time spent looking for a solution I mean a memory compute power and all that stuff for the quality of your solution ok so that's why a reasonable solution for problem how do we find heuristics the answer is with of course medicine heuristics there they're also different types and ultimately what they belong to is these are ways of searching a really big search space imagine Ivica all the possible solutions to a really difficult problem like a book finding algorithm or fuzzing or IDs rule to stop people from sending viruses across the wire or something like that if you can

if you can just represent them as a point on the line and you can vaguely measure how good of a solution that is you can end up with a with a really awful ground sorry I suck at paint in real life obviously unless you've got like how many is that what seven different possible solutions it's gonna be a lot more volatile right anyway imagine being carving weirdly a weirdly nice-looking search space like this you can see you know some solutions suck one solution is the best and there are a few others which are vaguely good right so what you can do is you can use technique called hill climbing you pick a random solution on this graph say pick this one

here what you do is you take a look at your neighbors okay what is the solution over here like what is the solution is this one better than this one I'm also better than where I am currently it's so we start moving up this hill and eventually you keep running that algorithm until you end up here it's like well the neighbors next to me both suck harder than out of you therefore return this is the solution of Windows there you know and and you have a reasonable solution won't be the best but yeah it's fine however what if you're stuck you know here Brady I've got this solution wonderful I have to drink what you can do is use a variation

of hill climbing all simulated annealing which is a little bit like that except you have have another variable called temperature and what that does is the higher temperature so inversely proportional to see you temperature you might jump from randomly just like well also check out that guy at the far side or over here or whatever and then eventually what you end up doing is you might start off here and then rapidly jump there and they may be here and then back there and then gradually along here you run the simulation for the likelihood of you jumping somewhere else random drops and so it ends up evolving climbing but you're relying on the odds of actually ending up

or Peaks being quite low that unfortunately is not always the case and besides we have multi-core right and phones have multiple persisters these days so why just look at one particular solution why not look at one over here and one here and one over there and see how good they are so just generate a thousand maybe a million billion see how good they are and then make them fight to the death that was actually a really difficult to find I have this image actually mean is get the really big ones and then gradually fill out like your of your next your next generation so generate a thousand ones paint the best ones fill out the next thousand with and with

twitch versions and bread versions and the baby generates a new rapid ones to repeat it's a really simple algorithm come up with a bunch of random solutions see how good they are and pick the best ones combine the best ones you take the best ones fill out with surrounded ones repeat until you've until you've got bored run out of time or you or you found a solution which is sufficiently good that you don't really care about getting a better one and there's a whole bunch of things that you can do with like come up with uses crypto vulnerabilities but in terms of the real world it's actually really good for and for fuzzing and by fussing what I

actually mean is this is an American fuzzy look which is a type of rabbit it's also a really good tool and it uses this testing technique which is through around the garbage of API until the application crashes I mean exercise the applications internal state you you want to come up with some kind of test or input or some set of environmental conditions or whatever that causes the program to enter some unusual or never-before-seen state and you can measure that and you can generate test cases which means you can generate test cases that give you a higher degree of new status which means you can optimize for it which means you can use genetic algorithms and that's cool because if

you come up with application crashes that are novel that's an O date possibly you might come up with what's really cool exploit American car swap uses genetic algorithms and the whole bunch of other techniques to to develop these kind of crashing custom aces trophies in the American fuzzy walk you know programs they mount on the wall in terms of we found the critical vulnerabilities in these software includes things like open SSL might be new PG good new TLS you know major major programs that are supposed to be developed with some kind of quality so I heard anyway if you want to do this yourself and there are whole bunch of libraries I use this or

it's pretty heavyweight but it's really robust and has a whole of options the tutorials are really good and if you're doing academic related stuff this is a good one to use because everyone uses it and so when used to make your paper and you get a review one of the things I'm not gonna complain is you use the library heard of therefore it's rubbish use this other one instead which which is fine it's um it's on the fastest and it's Java based which may or may not be a deal-breaker for some of you it's pretty reasonable or you could use place them I really like pison and I used inspired which and which is quite cool I

don't really know anything about pi ball or page a PI other than people seem to like it because it's Python it's really simple to get up and running it's pretty fun or rather using someone else's implementation because the whole you know generate mutate thing all these tools will deal with all that for you you just have to say I'm looking for this kind of thing this is how you see how good it is and I want you to generate a thousand of them and run through twenty generations see what you find so you don't necessarily need to understand very much about how genetic algorithms work and trust me there's a lot of theory known which is interesting

but the tools are really easy to use so yeah you have some kind of problem you know what the solution might look like you can tell the computer how to generate it go and then you wait and you wait and eventually it's which is fine but what if you don't really know what the solution is supposed to look like you know you and maybe when it comes to and keep coming back to this idea of ideas rules because I'm who went to steal come earlier this game okay who sold a talk by David day and I like Sheffield about generating ideas rules to love you excellent which means all this will be new so anyway

well one thing you can do is there techniques where you and you produce a program and you teach it how to recognize the kind of stuff you want to recognize you want some kind of classifier this is good this is bad here is a mystery set is it good or bad you know that's a really good for that and so as this diagram very clearly states you've got and you've got three different layers maybe one set of input neurons one or more hidden layers and then some public neurons and so let's say you wanted a classifier that could give you it just gives you a feedback one or zero good or bad maybe you'd only

have one of these on Fitness because the way it works is you you take your input your image or your peacock or whatever you feed it's here in the notes like one bit at one bit at a time those input nodes far off a water zero of their own based on some probability which you said I say some probability because what kind of probabilities is research question in a minute those and see those arrows coming off the input that's when they fire they'll transmit their signal down each one of those of those arrows to what's called the hidden layer and those hidden layers will take those ones are juniors based on their input and again far off a 1 or 0 of

their own based on another probability that's in a dimension and then finally after potentially many hidden layers you get output notes who will get some collection of signals and we'll go 1 or 0 or 0 the point and output based on based on their input just before we get started there's a pattern in the whole field of AI and evolutionary computing and all this kind of stuff genetic algorithms is to actual Jetix as hot dogs are to an actual dog who's over warm they share terminology but the way they're actually likely very very different neural nets are a little bit different in that you know a neuron that which might not look very different this

is to our actual brains is much like my bookcase in the British Library except at the British Library ^ something they and this is very much about being inspired by nature rather than trying to just imitate it so you want to build something that replicates a brain because you're dissatisfied with your own the thinking about the ordinance is while they're really good for classifying stuff it is kind of difficult to know but it's actually classifying on just as a great deed for my mentor this talk was was excellent right here he gave me lips I'm really good feedback I had a line in here I'm called the Russian based problem I said it's really important I

actually explaining but that is a problem just having you care so I'm going to explain it now first today's problem problem the US military the English National Reconnaissance Office they wanted to to identify camouflage ocean bases from satellite imagery really important so and they builded me on that and I had a set of images that they knew contained hidden Russian bases and they have another set of images that didn't and they fed it through to the to the classifier and so I have both sets it's like okay it's doing pretty well here is and here's notice that where we know what the answer is but the deal network doesn't how does it behave with this and with this third set of data

okay fine and then they tried to run it and it was rubbish absolutely awful and couldn't hit in Russian faces and it turned out what the neo network was actually doing was looking past the images and spalting which satellite took the photo they had the stop lights over Russia where the universe spaces were and that has a lens wall which made the images look a certain way and ones which you didn't have basis or over the US and I have a different satellite with a different lens pool and so it's important to make sure that you actually having clear ideas to what your classifier is doing and to people who didn't know what the classifier was

doing it was and David day and action field from Sheffield talent and their problem was we how can we get support to tell the difference between shellcode coming over the wire from presumably people doing this actual workouts and things that look like show code like JPEGs with your house any thing coming they came up with this my idea of just getting a computer to figure her so they generated an eel Network gave them you know the known bad peacocks and no good caps good you see how they even given the unknown set of packets they knew but it was a mix so I haven't you know did with that third set picture and then decided what they needed to change

anything here repeat in terms of actually coming up with numbered input neurons number of output neurons the waiting's probabilities of firing and the number of hidden layers they actually used they're coming mechanistic stuff they're talking about earlier algorithms to generate the neural networks to then do the summer stuff so these techniques work really well together and so you know you're happy to go back you have you known good you have mystery pockets magic those here and the nectar falls out how this actually works I'm not an expert in new networks but my understanding is we don't know they seem to work and you know we're all carrying around their heads and mostly but the in

terms of how they are how to actually construct them to solve is the big problem that we know about is very much an open research question but ultimately all of this kind of stuff pulls down on to two things can you tell if you've got the answer mostly right and can you compare two solutions and see which one is better even if they're both wrong if they are if you can do that my Tara sticks might be for you which is great terms of actually figuring out which one to use that's a little bit more difficult it requires and it requires one information about about what the problem we are trying to solve is if you have two very similar looking

solutions with very different Fitness characters things like they was really good it was really bad but they're almost identical that you probably don't want to use something like genetic algorithms which relies mostly on like small tweaks and combining into big ones because if you've got two similar solutions that they hate very differently why do you think that would give you something better so what you can't do is use something like genetic algorithms to try a bunch of different and metaheuristics to generate solutions and then see which solution what solution finding algorithm works best at finding solutions yeah difficult also another area a very very active research so you don't really need to worry about it and seriously I have

been working on this stuff for about two years now I have never found a single case where this seems like a good idea anyway so journal takeaway is hopefully this talk has convinced you that some of these techniques are not too difficult to understand all this with me all the technique this one sorry but if you do have any questions you can either ask me by email or order of these but

since obviously a lot of tech companies out there sponsors and they're all doing threat intelligence they're doing all sorts of women think it sounds like the feels you're looking out there is absolutely what they should be applying in there back then just think about threats in so this is the sideline participation you're pulling the patrons more over the show and you don't know what that data is bad things about dates and what it's about right actor you don't know not your openly classified later in the middle so she gonna make yet make it out for you anything exactly what you're talking about its industry actually taking that seriously usually well my understanding is that and David

Alex we're doing this the answer is almost certainly not really but at least a few people are and I'm not so important because ultimately a lot of generating a lot of this stuff and just requires knowing how to configure digital something that's really straightforward so really anyone

eight-year-old her sticks for antivirus this is exactly exploring I mean part of the problem is you can't guarantee how good your solution is actually going to be you can just say keep going until it's good so but it's at least worth a shot to research your pores yeah as far as threat intelligence is concerned that's interesting because and if you're trying to spawn delicious eight years you don't necessarily know what a malicious behavior looks like I knew all Network chances are they are going to be able to spot things which we as humans will because we just don't work it could be literally a case of malicious traffic tends to have the bits in these

positions XOR the bits from the immediately preceding packet in these other positions all legs work together equals there I don't know maybe that's true that's something which I'm a really stupid and naive classifier so this is actually my second security conference ever like industry secure difference there's been a couple and academic ones they're a lot less interesting all this a lot fewer t-shirts and beer but okay so this is the homepage of a CJ it looks like an academic webpage I'm sorry however this book essential submitter hair sticks it's actually really really good the guy who wrote it is he is not clinic but he's like he's a very good communicator and plus it's free if you want Naumann free

resources

and anyway anything by this kind John noser is going to be this book is about this thing and tells you everything means about genetic algorithms and want to study you know and so if you really want a resource for learning programming that looks good on the desk that's why however in terms of actually learning stuff I hold and pull hardly recommend going to the ECJ website and doing their tutorials and reading these

this is an interesting I have the kind of research I've been doing and this is brand-new and published research so it's not the kind of stuff I've been doing my PhD in polls and attacking attacking ciphers the way that classically you attack a block cipher which is just a sector that takes to fix input just stuff to it the way they eat classes they took those using techniques called linear or differential cryptanalysis is essentially you have some kind of relationship between the message and the final round of doing stuff to your message because what you then do is you guess part of the key party secret you kind of partially decrypt the encrypted message that you have and then you see

whether that relationship between the message

and if this relationship that you already know exists holes or doesn't hold with some probability that's different from just random chance that means you have the correct round because what I'm trying to do is use genetic algorithms to come up with types of relationship so it just has I have these functions which I can change like oh the message relationship one way when you guessed it came correctly I'm a different way when you guessed the key in currently is that way you can actually try and apply that relationship try and get some key and if and you should be able to tell whether your key guess is correct or not because if you can tell the your key guess is correct

before that one small part it means that it's having to guess all of the key ones you can guess one part so that's that's the kind of stuff I've been doing and the end result is this is kind of an example of why I mentioned her socks are weird I rediscovered a linear cryptanalysis I also haven't found that the late-night program curtain works is it seems to get fixated on a statement where this relationship always evaluates to true message this ciphertext if he is in this set and then always equates the falsification mass which gives you one bit of information I need your keys and so I'm currently currently kind of stuck because I can either try and change the

codes of the more general-purpose approximation this like it gives you or come up with the whole sequence of statements where it's like okay well this table tells me he's in that pumpkin and of that bucket this statement tells me something sound this ha and so on in turn the search tricky into and what's called a binary search which again is a lot faster it's like looking at key

yeah and in terms of actually coming up with actually useful security results underneath but other people have and hopefully some of you Facebook any other questions I understand the only thing between you and also me and coffee