
afternoon everyone our revenue stays a week after the post lunch slot so i'm pete burnham i um lead cyber security research at cardiff university um and i'm here giving a talk with matilda rhodes he's a colleague and a security researcher in the team so it's not an academic talk don't worry it's really around what we're doing in the area of cyber security analytics so just give you a bit of an overview of what that is what we've been doing um in the area of active cyber defense matilda will give a bit of a demo as well um and uh hopefully you'll get an idea of the importance of some of the work that we're putting into practice
so really quick bit of context in terms of the cyber security landscape obviously the uk has a cyber security strategy which is to keep the uk safe and secure it also has an industrial strategy one of which is one pillar of which is artificial intelligence and the data economy part of the national cyber security center is active cyber defense seeking out and doing something about attacks so essentially at the university our research is about pulling all this together and we do a lot around ai particularly machine learning but also moving into more genuine cognitively aware ai for detecting and mitigating cyber attacks so just a quick nutshell overview this is a slide i normally use to introduce
the sort of themes of the research i won't go into all of it but to give you an idea of what cyber security analytics is in our view so starting our on the end towards me you've got risk assessment and modeling so typically risk assessment you'd normally do you know find your assets look at the threats to those or vulnerabilities of those threats to those run abilities uh put them all together maybe with an impact number that might be historic or maybe just a finger in the air and that you know you get your risk assessment you can use that for compliance maybe a bit of operations but generally it's not particularly dynamic so we've done some research in the area
kind of flipping that on its head thinking more about goals and goal-oriented approaches which are perhaps more intuitive you don't need to know everything that could go wrong which is a big issue with risk assessment and goals tend to have dependencies so you can capture dependencies between your assets and your goals underlying them a little bit more so we've done some research in that area a bit in the middle is about communication governance decision making this is where we tend to draw in psychology experience and criminology who talks to who around attacks and a bit on the end is the bit that we're going to talk about today which is the ai or machine learning machine learning
based attack detection using different types of behavior in different environments so the bit cutting through it mobile iot cloud desktop operational technology and going forward into connected vehicles and also even the security of ai itself so the bitcoin through it varies as technology emerges a bit across the top um all about human factors why people do stuff how people do stuff and a key key important factor is if you're using data to drive machining and algorithms to detect attacks to have an understanding of what would you actually be looking for so bringing a little bit of human knowledge into the equation so you can actually interpret the output of these machine learning algorithms so i won't go into the details of
everything that we're involved in but bottom right hand corner there you've got the airbus center of excellence in cyber security analytics so the work that we're going to talk about next is how we've transitioned some of our academic stuff into practical applications at airbus and really driving forward innovation in security operation centers so one of the problems that we're working on is basically moving beyond traditional av so we all know that you know signature based detection has its limitations as the static analysis um you've got things like apt but even just more more general malware become more sophisticated obfuscating code flipping code around code encryption hiding api calls in memory execution all the kind of stuff that makes it
actually quite difficult to detect using a static or a signature based approach so this is what we're kind of trying to address that so we're trying to say what does a future sock look like where you involve some sort of ai or machine learning in this kind of environment so we're moving towards um stuff that's a little bit harder to er to obfuscate so system level activity metrics so there was a paper that came out following the it's called the big four apts and it basically suggested we need to do more with stuff that you can't hide so if you want to do something as an apt you have to interact with the underlying machinery that'll leave a footprint in
the trace so anything like cpu ram data io process is created and all that kind of stuff has to happen on the machinery in order for you to be able to do something and achieve your goal as an attacker so we've been working with using that data as a footprint for behavior such that we can actually then use data-driven approaches to detect malware and distinguish between malicious and benign so in a way if you think about a whole bunch of behaviors that are malicious and a whole bunch of behaviors behaviors that are benign and normal we're trying to create dna profiles of both to be able to a distinguish between malicious benign but b try and work out
which dna is most similar to so you can start to do things like campaign profiling better understanding what's actually going on in a particular piece of malware while you detect it so the starting point iso our innovation in cyber security analytics is um we took thousands of samples of malicious traffic of actually code executables pdf and pe32s we run these in a sandbox environment which fully aware of the fact that there are limitations of of what you can capture in those environments given it's not on bare metal and there may be invasive features within that we took then thousands of samples of clean normal traffic that you'd expect to see on a day-to-day basis and we
developed a whole bunch of behavior and we created two what we call maps these are this method is called the self-organizing feature map and what what you can see here is two maps one representing all the good behavior one representing all about behavior so within that sandbox environment we collected the system metrics the bit that was left behind while it was running every second um for five minutes and every seconds data is put onto one of these maps starting point is an empty map totally kind of blank and you just plop the first bit of data wherever you want then you get the second set and you put that on there and that'll end up quite
close to it because it's fairly similar and then the third the fourth or fifth over time with thousands and millions really of data points you start ending up with similar behavior clustering together and different behavior pushing apart so that's why it ends up looking a bit like um a dispersed set of nodes so the lighter the color on this the more frequent we see that behavior and the darker the color the more new that behavior is and what it leads to is this concept of sort of fuzzy neighborhoods where you can see this is similar to stuff we've seen before but a little bit different so again if you're actually flipping the code around doing different things but
you're still actually achieving the same goal it still allows us to kind of model that in this environment so if you actually put this into practice what you would see is every second new bit of behavior coming in working out which map does this sit on is it fit better with the good behavior or the bad behavior and then once you've worked out if it's better with the bad behavior you plot it on that map and then it'll tell you well actually which node does this sit on better so you get good bad good bad occurring so you can see it by eyeball initially that it's actually landing in a lot of the frequently seen bad activity area
and over time it's building up a basically a profile of whether it's good or whether it's bad what it's actually giving us is an x y coordinate so we transform all the continuous data about the behavior into essentially an x y coordinate that represents either a good behavior or a bad behavior and an xy coordinate which you then use in other algorithms to tell us whether it's malicious or benign so this has actually worked out pretty well this is around about 94 accurate when tested on a whole bunch of other samples that we haven't used in training um different file types over over time and in different environments um so this was the first kind of innovation
that that we got to is this kind of cyber security analytics approach its limitation of course is that you have to wait for all the behavior to process before you get the result out of the end so if you imagine we run it for 30 seconds we get 30 xy coordinates which will give us a vector that we can use with the machine learning algorithm to tell us whether or not this behavior is malicious 30 seconds in though if it's something like ransomware it's not particularly great because you've lost half your files by that point and so the next step was innovation in the area of trying to detect this earlier and this is where matilda's work comes
in so she's actually taking the idea forward of using the same type of data but with different models that can detect this attacks earlier so i'll hand over to matilda hey hi um oh great it's working um so as pete said um the previous model um using the song um is great for analytics if you're in a lab controlled environment you can see how a sample is behaving but what if he wanted to use this on an end point so as we know static data pete mentioned is easily obfuscated even known malware can easily be disguised to evade detection and researchers have long been in favor of dynamic analysis but it's not well suited to endpoint detection because it
typically takes about five minutes per file in the literature um and nobody's going to wait that long and to find out if they're allowed to open their email attachments so we wanted to see how early can you tell that a file is going to be malicious can we do it in seconds basically because if we can do it in seconds maybe people are willing to wait that long and so we use machine activity metrics which i'll just play this video again um because it shows at the beginning so things like um the total number of processors running the cpu usage and the number of packets being received and sent by each application and every single second we monitor these
10 different behaviors um if you want to know the details of it i'm happy to discuss afterwards we use the recurrent neural network which is a machine learning algorithm that's really good at processing time series data um to try and predict uh whether or not a sample was malicious or benign and what we found was that every second we could get a prediction um this graphic on the far side over here um is showing you with the kind of wedge of the pie is how confident the model is and what you can see is that typically the wedge gets narrower over time and that's what we would expect to see because it sees more data and it
becomes more confident and so what was the result of this work so we found that if you looked at a application for five minutes um we could detect 98 of them as malicious or benign correctly but if we looked at them for just 20 seconds we could also detect with 98 accuracy that they were malicious or benign and after just five seconds we could already get 94 of the um predictions correct and so this gives us a kind of cost benefit curve of how long you want to wait before you make a decision so if you're in a strict corporate environment you might say okay everyone has to wait full 20 seconds but that wouldn't work for every every
type of workplace um about the time that we finished this research the wannacry attack had just hit the nhs so we thought um yeah that's a great scenario in which you really want that early detection to happen as pete said if your files have all been encrypted and you don't have a backup you're basically stuck or you can gamble to pay your ransom um and what we found was that 99 of the 2000 ransomware samples we tested were detected within one second of execution and that's basically i think because ransomware is incredibly unsubtle um in the way that it works um so there are two security flaws for this approach if anyone spotted them one is
that if we only use the first five seconds of behavior to decide and you knew that as the attacker you could just run microsoft word for five seconds at the beginning or sleep your application the other is that it's based on using virtual machines and there's debate about whether or not it's decreasing but malware can detect uh often that if it's running in a virtual machine there's more and more inventive ways it's part of the cat and mouse of trying to disguise that your vm is a vm um but because this process of disguising the vm and detecting the vm is ongoing we thought the only way that we'll ever solve this problem is just to watch what
it does on the target endpoint you could either try and replicate your target endpoint exactly but if you have an enterprise with lots of different endpoints a really heterogeneous network then you're going to be building a lot of virtual machines and they're going to change over time so we wanted to see um can you detect malicious processes as they're running on the machine and this might sound a bit mad because why would you let malware run on your machine but the point is you don't know that it's malware at the time you've still got your basic static signature antivirus in place so you're not letting anything through that you definitely know is malicious but you're watching how it's behaving on
your endpoint um what good is it to a user though if you get an alert saying we think there's ransomware happening uh your all your files are being encrypted right now one they have to be sitting at their computer when that happens and two they have to know what to do um so it might wait until the middle of the night or rather until they detect that there's no user activity going on um or you might be busy and you're i don't know how or you just don't know how to deal with that alert so we thought the only way to sensibly deal with that is that we automatically kill the processes so you could also
quarantine them put some security restrictions on but we thought let's try killing them because it's the most dramatic one and see how that goes first um this is a little bit of odds with machine learning models because machine learning models will try to do the best they can at matching the data you've given them across the whole data set so when you're looking at time if you have a program that runs for a process that's running for an hour and just for one second you thought it was malicious it's going to kill your program even though if you were taking one snapshot every second that's 3600 points in time only one out of 3600 times did you get the guess wrong that
machine learning model thinks it's really good it's not 0.9999 accurate but because it's killed your process it doesn't really matter so we had to make some amendments to the loss function um to basically tell the model that that's what it was doing to say if you kill a process you never get another chance to reclassify it as gone not only that but all the child processes are gone as well um so this is with a modified um recurrent neural network um this is a kind of diagram of how it works i'm happy to talk about it more afterwards um but what does it mean to kill a process so a piece of malware is made up of
multiple processes maybe one or two of them are doing the thing that you deem to be causing the most damage um i wasn't going to sit and reverse engineer nor would i claim to have the skills um the 4000 malwares in our data set so we took ransomware again because it's really easy to see the tangible impact on a user what percentage of your files were corrupted and what we found was that 50 fewer files were encrypted and by fast acting ransomware when we used our model so this is running live and including all the analysis and process um killing time and by fast acting i mean it starts encryption within five seconds it doesn't need internet access it's not
waiting for some information from a c2 server it just goes away and ruins your life um as quickly as it can um but the model did detect 90 of the ransomware samples so what we found is the model is basically quite clunky takes up a bunch of cpu because we're running this on a normal laptop as if a regular person is using this um and so we're working at the moment and distilling it into a a more um efficient machine learning model i'm just going to show you a quick live demo of it working um so in a second somebody's desktop is going to pop up an unfortunate person um and uh yeah i can't speed it up because
it just needs to load and i'm not connected to the internet it's all in a virtual machine so don't worry about anything leaking out um so this is somebody running um windows 7 you might think windows 7 who uses that anymore the majority of people in the world still use windows 7 so we can see that we're getting a graph up already and we're getting predictions as to whether or not the file is malicious so it already is predicting malicious it didn't in the first second and then quickly uh started to say it's malicious and i don't know if any of you saw but all the files on the desktop just got encrypted um so it can work in real time and what
we're doing at the moment is working on making it um something that could actually function um and that won't kill microsoft word when you haven't saved uh whatever you're working on um
lost the slides
um there we go it's just a bit confused um so part of the work we do um working in an academic research team but we have um very close connections with airbus and work on site with them often is that we're thinking about how sorry we translate research um into the operational environment i think that needs to create virtualbox maybe um so we'll collect these data sets and train these machine learning models but often we'll get hold of these data sets in a particular place and also get the test sets from the same location but when you take a large organization they're going to get different types of malware and benign wear than you might
capture in the wild so the types of garden variety malware that a user might come across um is uh might be things like phishing emails whereas a company might get um more things that are trying to i don't know steal corporate um secret information um so what we wanted to look at was the robustness of our model um so we uh collected a load of uh malicious and benign data um kindly given to us by the airbus sock and wanted to see how that fared when we use one of the models we trained in the lab environment so the most common data that is used for malware detection with machine learning which is a very popular area is api
calls so those are system calls made to the os to basically get hold of hardware resources any process that runs needs to make api calls in order to get access to memory and cpu um and create files and things like that there are a number of different ways to represent them you might have a binary representation saying yes these api calls occurred you might say how many times they occurred or you might say these short sequences of api calls occurred and the data we use is slightly different as i described before it's things like cpu usage and packets being received and sent and the amount of memory being used the main difference between these two
is that the machine metrics are almost always invoked by every single process that runs constantly so if a process is running it's usually using some cpu or okay it could not be but it's usually has some tangible metric that the model can understand whereas api calls um get depreciated occasionally but more often than not they'll be used more or less according to different trends in software engineering and the evolution of malware in response to vulnerabilities that appear over time or in different organizations so we hypothesize that machine metrics are basically going to work across two data sets from different places and what we found for executable files was that that was true so this graph at the bottom is showing
you the accuracy of a bunch of models at detecting malicious executables the blue dots are from the same they're collected in the wild they're publicly available malware i would say there's stuff you can get hold on on the internet quite easily um if you were going to deliver from a research team into an organization you'd say this is this model is 90 accurate and then the people you've given it to have no way of testing whether or not the model is still performing up to scratch the orange dots are how well the data set we collected were given by airbus worked so what you ideally want to see is the blue and the orange dots are
close together the accuracy score that you tell the company you're getting is similar to what they're going to see in operation and on these three graphs there's three different machine learning models we just did three different models to show that it wasn't a problem with the model so random forest support vector machine and the neural network and what you can see is that the middle three sets of dots are much more tightly clustered together and those are the machine activity metrics than the api call dots which show a much lower performance once you put it into production and interestingly when you combine the data sets together it doesn't make the model stronger overall because machine learning models are
quite lazy and they easily get distracted by things that they think are kind of easy markers for telling between the two um so that was what we found um so this is all work we've done around detection and slowly moving from research into how this would get pushed into operation um but what does that really mean for the company and i'm going to pass over to pete to talk about um how we can translate this information into something useful well thanks so just to wrap up the cyber security analytics bit so i opened it up at the start saying we're including not just the detection stuff but actually doing stuff with it and informing decision
making so the work that we've done around goal oriented risk modeling has been translated also into a methodology and a tool so this tool here you can see is a tool called icf model what it basically represents is so this is the ultimate goal of an organization let's just say for example building wings that depends on certain things for example an operational factory production floor which depends on people and supplies of materials and electricity and all that kind of stuff so it's very much abstracted but underneath that you can measure all sorts of things right from your networks to the political state of brexit it doesn't really matter what it is right and you can plug this in to these models the
reason you can do that is because this basically is a directed graph which means you can say this depends on this and this or that which means it's then useful for conditional probability which means you could do some bayesian analysis around it so if i change the probability of this it'll update the probability of this this and this achieving their goal yeah so if you imagine we then plug that into a piece of work like matilda's where you identify very early on an attack on an iot device as the confidence grows in that attack you can keep updating this so it'll start off at 95 percent likely of being uh operational dropped 90 80 70 and so on so we just
keep feeding that in so this is how this work on confidence essentially and labeling from machine learning turns into a probability of a goal being achieved there are any questions beyond that i can answer those later but this is how we're actually linking work on machine learning to um work on goal-oriented risk assessment through a translation of outputs of machine learning into probabilities and goals so that's ultimately it uh other than to say you know if you enjoy this type of stuff there's two phd student ships becoming open at the university very soon they'll plug it's not really a company's it's for the better good everyone needs these people in their organizations so um that's it from us thank you very much
thank you guys [Applause] thank you that was very interesting talk we do have a few minutes for questions but um we'll try and get them we'll try to get we'll try to be around after as well as well yeah i'll speak up hi uh yeah thanks very much for a very interesting thing uh what i'm quite interested in is you do a lot of detection of malware for instance and the algorithms and the machine learning is detecting malware but also things like microsoft word is quite good at detecting itself and shutting its own processes down uh did you find that you had quite a lot of false positives in your work or was it uh were you really focused on
the malware side of things uh well so we've been focused on them on the malware really um but we've been using malware that would affect microsoft office and word excel and dark files that's been our main focus of them
it's really hard hi there and so you you've trained your benign data and how would this transfer to to my company would i need to train a completely new data set yeah if i'm running different processes well so it depends i suppose my view would be that the the data are fairly generic data you know the office files word files but the process would be the same anyway even if you had new data you could just retrain on that
is there a code to play with can we start playing with this code sorry is that code to play yeah on github um there's code which i can yeah okay thank you very much thank you