← All talks

Chances and Challenges of Machine Learning in Cybersecurity

BSides Barcelona · 202145:5651 viewsPublished 2022-01Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
BSidesBCN21 - Day 1 - Sagrada Familia Track Chances and Challenges of Machine Learning in Cybersecurity (Claudia Ully) Artificial Intelligence, Machine Learning, Deep Learning - these terms are often used interchangeably and for marketing purposes. If we were to believe some of the colourful marketing claims, machine learning could solve many problems that cybersecurity has been struggling with such as detecting new and unknown attack attempts and automatically taking defensive measures before any harm is done. By understanding how machine learning actually works, we will be able to understand why it is no silver bullet for cybersecurity but only yet another tool with its own strength and weaknesses. The talk will explain the basic principles of machine learning and show scenarios and tools where it can be applied for cybersecurity. Also the challenges and dangers of machine learning will be considered: how attackers can target machine learning algorithms or use machine learning for their own malicious purposes. About Claudia Ully Claudia Ully is a penetration tester working in the NVISO Software and Security Assessments team. Apart from spotting vulnerabilities in applications, she enjoys helping and training developers and IT staff to better understand and prevent security issues. (cit. NVISO)
Show transcript [en]

further ado claudia the floor is yours thanks for the introduction and let's get started we have a lot of to cover today so um as announced i will talk about chances and the meaning of opportunities and challenges of machine learning in cyber security so what i'm going to cover is that i want to give you a basic understanding of machine what machine learning is we will learn about different types of machine learning how they work and see areas where machine learning is already applied in cyber security but we also want to discuss challenges and problems that are related to machine learning in cyber security so the goal is that by the end of this talk you should have a

solid overview of how cyber security can benefit from machine learning but also which challenges and threats and problems are related to it so it's more meant like um an introductory talk to the topic and not a real technical deep drive but i hope you can get a few things inspirations that then help you to maybe dig deeper into this topic so just two sentences about me damien already introduced me so i'm a significant security consultant at in visa where i'm mainly focusing on penetration testing so whether mobile application security um but the interest in machine learning i actually got this during my university studies um where i dig deeper into reinforcement especially we will come to that in a second and i'm

now really curious to see and explore how i can combine kind of these two really interesting topics machine learning and cyber security so first is machine learning precisely to security at all a thing or is it just kind of a yeah hot topic by uh making it some propaganda or marketing stuff so um if you think back traditionally security problems are aided by mathematical models such as think about cryptography but those models have limits when it comes to a very large number of variables we have to consider and if we have to process a huge amount of data and these are areas that are actually the strength of machine learning so this is why it is indeed a thing we we can

use um or we should um uh yeah cover when talking about cyber security also because a lot of companies are kind of adopting it but the word of caution not everywhere machine learning is written on the outside it's actually on the inside so because there are overlaps between machine learning and just simple statistics some vendors might claim they are using machine learning when actually they are just applying simple statistics in the back so it is indeed sometimes more kind like a marketing thing or kind of thing to to make look things nice and shiny so with this in mind first let's get our terminology terminology straight so there are terms like artificial intelligence machine learning deep

learning flowing around and um often used interchangeably although there aren't exactly the same thing so artificial intelligence actually the broad field and study uh of simulating human intelligence while machine learning is a subset of it an approach to enable computers to make decisions on their own and do stuff and things they were not explicitly programmed to do the phrase was coined back in 1952 already by arthur samuel a pioneer in the field of computer gaming and artificial intelligence so it has been longer round than one might think but it has formerly limitations due to the computational power it requires and this is also the reason why it's kind of gaining momentum during the last decades

now deep learning again is just um a subset of machine learning so it's certain techniques applied within machine learning that is about simulating the behavior of the human brain and we will come back to this a bit more in detail later during this talk now also machine learning is not just one thing but there are actually different types of machine learnings and all are suited for different types of tasks that we want to achieve um we usually divide between three different types of machine learning supervised learning which might be the most commonly known and can be compared to learning with a teacher so you kind of have someone or something that tells you exactly what is what and

what to make of this things you see and it's mainly used for prediction so we want to map a certain input we have to a certain output and the most commonly known sub-category here is classification where the predicted output is categorical so think of it in the example of image recognition so we give it a picture and we want to know if we can see here for example a cat or a dog if we want to predict some real valued output then we are talking about regression so this can be applied for example for weather forecasts or market forecasting also in unsupervised learning we have different subtypes unsupervised learning can be comparable to doing self-studies

so you don't have any one telling you what to make of the things you you read and study but you have to kind of organize and structurize it on your own and so you usually use it for example for a time like of pattern detection and structuralizing data so clustering for example um is about grouping input into segments based on certain similarities so the machine learning algorithm is to kind of find those similarities and group the data according to it we have association rule learning um which is about finding correlations you might have always seen this for example if you get proposed products in amazon or if you have kind of like spotify and your weekly discovery playlist or

something like that and it's based on the probability that we see two events occurring together so kind of if you like song a there is a eighty percent chance that you also like uh song b because this has been observed together um very often and as a third type of unsupervised learning we have dimensionality reduction which is often more like a pre-processing or intermediate step it's for filtering and reducing data so reducing the dimension dimensionality for example if we want to create visualizations or if we want to process the data further and i want to kind of reduce it for that purpose and last but not least we have reinforcement learning which can be compared to learning with a critic so

someone who doesn't tell you upfront what is what and what is right and what is wrong but will give you feedback based on what you did before and the what it's used for is like learning procedures of actions based on the feedback you get and we will have a closer look and all those three types and how they are applied and or can be applied for in the field of cyber security during this talk but first let's see the basic machine learning process you usually have and it all starts with gathering data because data is kind of the fuel for machine learning so we need it for our training process and um like with fuel for a car the the

quality and quantity we have has a direct input on the results we get so if you have bad fuel in your car um it might destroy it or it might get you very fun the same is for data and a machine learning algorithm next we want to prepare the data so kind of for example randomize the data samples but also very important check for biases so for example is there a certain type of data very dominant in your data set and if yes your result will also be biased toward this type of data and will be struggling with other types of data so the very yeah kind of famous or infamous example for that is

the early phases of face recognition which were mainly um trained on images of white people and then of course the algorithm struggled with the faces of people of color now something else we want to do is split our data set into a training and evaluation data set so usual uh ratio here is something like 80 30 or 70 80 20 or 70 30. so you have a larger training than eventuation data set but we need the evaluation data set a bit later as we will see in a minute next we choose a model so a model is um kind of the the raw material uh we have it's a it's a generic program that is then made specific

by data so think of it like clay which will then get molded through your hands by repeating the um yeah molding it or turning the the wheel or something like that so we of course have to choose here which learning algorithm we want to use and there are of course multiple algorithms out there so there's not just one for classification one for clustering but there are a lot of different types and um what you have to choose is of course the right right algorithm that's suited for your purpose so is it at all a classification task is more a clustering task and which algorithm in for this task is the best or could be the best

for your case next we start training so this is the learning process that forms the model um until we finally have the result we want and it's an iterative process so we do a lot of training loops but during which we make repeatedly small changes to our model until we finally get the result we want and to check whether we've reached the state we of course need to evaluate our model so we run our model on the evaluation data set you remember step two where this where we took this out so we give it data it hasn't seen before to check its performance on unknown inputs and see whether it can work with that so

you have problems we will come back to this later where you might have not trained enough and it doesn't perform well on the data set or you might have trained too much and then it also doesn't perform well on the evaluation data set so based on the outcome you will start tuning your uh um yeah your algorithm so certain parameters you use and try to optimize the performance for example get to create a prediction accuracy and last but not least you can finally apply the model this step is also called inference so putting the machine learning model into production use the trained model actually to solve the task um you would like to solve it

so let's look at this process um a bit more in detail in the different types of machine learning we've seen so at first uh let's look at the supervised learning and here let's choose an actual example from the area of cyber security a very typical one like spam detection so this is a classification task we want to predict an output that's categorical so is it spam this is not spam and what we need first in the training process is label data so or we at least we need two things we need the labels this is what we want to predict like spam not spam and we need label data so in this example a lot of emails for which we

already know this is a spam email and this is not a spam email now we will describe each data point through a feature or attribute vector so this is the criteria based on which we will then train our model and later make our predictions so this is actually data or our features that are probably relevant for for our classification here so the example of emails this could be for example the sender address or contained links or attachments but something like for example the date and uh time of receipt might not be a feature that's really important relevant for predicting if something is spam or not next we start the training process as i said it's an iterative process where the

learner develops models or shapes the model um based on the training data and we once we we are finished the training process or also in between we then run some evaluation to see if we whether we first have to maybe change and tune some things here so what we could change for example is of course some parameters in the algorithm itself so in the training part but also maybe make changes to the feature metrics if we see um maybe a feature might uh be uh add a feature that might be helpful a removal feature that we don't think happened this all can change the outcome so after completing the training process then we can use it for prediction which

will mean we get an email also we have to extract the feature edge or attribute vector as we did for the training data and now we'll put it or we'll give it to the model that we've trained and as a result we get back a certain probability that the mail we gave it is a spam or a non-schema spam email and the same principle can be applied for example for or is already applied for malware detection um where you then of course have um extract features from from the ba from a binary to see whether you have or whether it's uh yeah a good or or bad uh um executable for example you're dealing with

now for those who thought oh this is really interesting i want to do this myself and get my hands sturdy and give it a try what you need of course is some data and there are data sets out there you can use for example for training your own spam detection algorithms that the spam based data set from the university of california irving and that has about 5 500 emails of which about 13 are spammy mess so you have the label data there you can use and also for malware um there are data sets like microsoft's malware data set or endgamers ember malware properties dataset that you could use if you just want to try it out

and try training a model on your own in these tasks so what does the process now look like or what are the actual differences in the process now for unsupervised learning as we remember this was more for organizing and structurizing data so the first difference is that we have a completely different input here which is we get unlabeled data so as i said we don't have something that already tells us what is what we just get data as it is and again we must describe it through a feature vector and um start the training process and um shape the model and as a as an outcome usually then there's some human intervention needed and required to interpret the output we get

at the end from the model so uh the algorithm or we will get back some type of clustering for example and we will then actually have to detect what is the points or the reason that the data was clustered that way so taking the example here um what we could end up with is for example like on the left we got two clusters based on the shape as we can see and on the right we've got three clusters based on the color so this is why we then here need some still some brain to see what what we get as a result now where could this be used for example um two areas of applications are user

behavior analytics and forensic analysis so in user behavior analytics what we want to do is of course find anomalies and user behavior and detect uh to detect insider and outsider attacks so parameters you usually look for are for example my ip addresses and their geolocations time stamps user agents or certain actions that are done and if you look at it as a very simplified example here you can say that the behavior of admins has some typical traits putting them into one group so kind of like the blue cluster you can see here and the typical behavior of your users is a different group it's like the yellow cluster here and if now an admin for example logs in

from an unusual ip address one that's usually associated with a normal user for example he moves out of uh his admin cluster so what we can do very well with unsupervised learning here is detect outliers and things that do not fit into or do not fit into their usual cluster into a cluster at all same goes for forensic analysis so for example if you try finding anomalies and log files which are usually huge this is really of help here to process this amount of data and to find things that really stick out and can be indicators of attacks or compromise so outliers could be for example data that doesn't match any other data or something that's unusual in terms of

frequency in terms of correlation on sequence and there are tools out there that already use unsupervised learning um usually with a deep learning approach which i will explain in a second for example splunk or dark trace are tools for user behavior analysis using unsupervised or using machine learning and supervised learning to do this task or a forensic analysis you have the elastic stack um which is free and open source and helps you actually create your own machine learning jobs to analyze and monitor metrics from blocks so we also see they're kind of a machine learning becoming easier available to end users so you don't have to really understand the deep technical stuff anymore and the

mathematics behind it but it's kind of getting more user friendly and more easy to use now last but not least from the types of machine learning reinforcement learning which differs a little from the two things we've seen now because here the agent or we call it the agent so then let's call it the learner the one who learns is creating their own learning experience through interaction with an environment so we don't have data beforehand and it's like learning through trial and error so the best example to think about is about this is for example cleaning robot so if you think about a cleaning robot running your how around in your house what it has it has certain sensors

attached to it um which will um yeah give it a certain state so for example with it will see all my front proximity sensors detect something very close to me could be translated to like i'm standing in front of a wall or at least in front of something so and based on the observation it has it now tries to choose an action and it has an a certain policy which is kind of its current knowledge that helps the agent or in this case the robot to select and execute an action so possible options would be here for example move front or turn sideways and as a result of taking action we have another observation so for example after moving front um the

agent will detect all my bumpers the table is detected that i hit something or when it turns sideways it will say my front proximity sensor see there's nothing uh in front of me anymore so based on the observation we put first head the action we took and the new state and observation we are in the agent will receive a certain reward uh so kind of which indicates whether the action was good or bad so kind of if if we bump against the wall against the wall that's bad we give it a negative reward meaning basically a negative number usually and if we detect we have now a freeway um we give it a positive reward so kind of a positive number to

say this is this was a good result and over time this then shapes the policy so the re uh enforcement learning algorithm then modifies the behavior over time because the agent's goal is to maximize its rewards and get as many positive rewards as possible and this will filter out the bad behavior over time but of course it's time consuming so your agent will probably bump the ball around thousand times before it starts learning that it's actually bad what it's trying to do there now um for the application of it in in the area of cyber security there are some challenges as i said already that it is um time consuming but also um the the thing is that real world

problems are usually have a large have a large state in action space so a lot of possible states a lot of possible actions that could be taken and as a consequence reinforcement learning is now usually combined with deep learning approaches we come back to this in a bit and this is still a quite let's say a quite new thing um and the first great success with this was in about 2015 with um maybe you've heard of it alphago uh by google so a kind of a machine that taught itself playing the board game go and then became so good uh that it defeated not just the the best human players but also the the best machine players around at

that time so so far the greatest success of this has been in other domains like video and board games or even robotics and autonomous driving but there are some yeah scenarios where it's considered and experimented with uh in the field of cyber security so one thing is for example cyber physical systems so meaning iot industrial internet smart grids smart cities so basically everything that's smart and you take the the yeah the sensor input of the or the sensor input of the device defines its states and by training it um certain changes in its state and in a sense it might help it to detect if it's for example is if it's under attack and to select the appropriate action

so you kind of get a device that can defend itself because it knows how to react and which actions to take if it detects some malicious things going on with the sensors another area where it's applied is for intrusion detection systems but more as in addition to other learning approaches so combined with supervisor unsupervised learning and the aim is to enable systems to choose more appropriate log files for example when looking for traces of an attack by giving kind of a reinforcement additional reinforcement loop which will give a reward or a penalty when a system selects a log file that does or does not contain um anomalies or signs of an attack so that over time the system will

get back better selecting those log files and a really interesting area of application is the one of uh yeah game theory so with game theory we can model the conflict and cooperation between intelligent decision makers so basically describing for example an attacker and a defender fighting over an asset and this is exactly what um experiment what was experimented with so you have two agents the attacker the defender both act against each other and try to learn the optimal attack and defense strategy but the problem with this that is that it of course simplifies the real world cyber security problems into a very limited game of just two players and with one two asset switch which is

usually not the case in in the real world now the most common yeah algorithm use as i said is when you uh combine reinforcement learning with deep learning this is why i want to look or give look the give the deep learning also a closer look so deep learning usually refers to the use of neural networks and the neural networks is based on the working principle of neurons in our brain if we look at a model of an artificial neuron we receive a certain input so in real life certain um yeah a certain input from our senses and uh those go through synapses into our neuron and those synapses have a certain strength that amplify or

weaken the input so these are called white weights in the artificial neuron and this determines of which input is more or less significant for our case now the neuron will sum up all the input we get and then compare it to a threshold value and if this threshold value is exceeded the neuron will fire and meaning it will produce a kind of a positive output which is also called the activation function now in deep learning we have multiple layers so a simple neural network just has the one neural layer in between in deep learning we have multiple so the input of the output of one layer is the input of the next layer and each layer will bring

more complexity uh to the to the table so thinking about image recognition it's um so the first layer might be able to identify edges the second layer might identify a combination of x edges the third might identify certain features like eyes or mouth and the last identifies combinations of features like a certain face um now uh so how does the learning then work in those networks so the the training means that we will again have an iterative process where we tune those weights so the strength of the synapses and the biases or the threshold values within the neurons in each neuron in each layer until we finally get the um yeah get close to to what what we want

and so at the end there's always a so-called cost function which uh will determine the degree to which the output we get now differs from the expected output and um the result is then pushed back across all the neurons for for tuning those values i see i'm getting a bit darker let me enlighten myself a bit okay now um having learned about the basic principles let's uh sum up um a bit on the opportunities how we can actually use this for our purposes so we've seen already some examples but the the main areas where it's now applied and indeed promising is auto automation for example so that it takes over repetitive tasks that such as

triaging and triaging or log analysis or so that human analysts can then actually concentrate on the really difficult stuff also most promising is again to actually combine it with with humans so enhance human analysis for example to reduce human error but also to increase efficiency so there was a very promising experiment at the mit's computer science and artificial intelligence lab called ai square for log analysis and what they did is that they had first supervised learning algorithms that finds extreme and rare events and logs and those events get then reviewed by human analysts which will label them as this is normal or this is indeed an attack indicator and now we have labeled data and can feed this back in a

supervised learning loop and based on this input then and also the one what from the unsupervised learning part can do a prediction if an attack is imminent imminent or not and during this ex experiment the attack detection rate actually rose to 85 percent and heavily decreased the false positives also it can help us to improve threat response time so by pulling the data actually required or really relevant data um from from log files um so we can prepare them for analysis so that cyber security teams get more simplified reports making the most important things or highlighting the most important things and then can make decisions um faster and it's kind of like yeah of course the

dream that maybe also at a certain point of time machine learning might be able to detect zero days so there have been of course some studies and experiments on this but i'd say um we are not entirely there so usually if you have something completely new your model has never seen it of course will have problem dealing with it so um yeah yeah we're not quite there yet so there are of course some problems and challenges as i already mentioned i already said you have general problems during the training like um overfitting it's like you you train on on your data set like learning for a test and uh getting sample questions and learning exactly those sample questions and their

answers and then you get in the actual test different questions and you perform bad that's the problem of overfitting um underfitting means we have not trained it enough so we have not studied enough and then we also will not perform well also a problem is always dimensionality so the more the mind dimensions and factors and features we have to learn and the more data we is required and again there's this inductive bias so there are certain assumptions we made to make for example when extracting the feature vector and this has of course an impact on the results we get especially in machine in cyber security a huge problem is also the shortage of data or especially the shortage of label

data and then the imbalance of data set we have so usually your ratio should be like something like 1 to 10 so in 10 yeah 10 normal emails have one spam email for example um but in cyber security this is usually a lot higher so you have a lot for example a lot of benign traffic and just a very few malicious samples also think about the consequences so if you if your model does produce false positive you lose work time and confidence in the system but think of the false negatives if your model isn't working good enough so the worst cases of course you can have a full system compromise and another great problem is um that

often we want results to being interpretable so we want to know why a certain output is made so why did the algorithm choose maybe for example a certain um log file or a certain right did it suggest as a certain category in classification and it's kind of difficult to see under the hood on which basis um this is made so it's kind of makes it difficult to explain and of course the world changes usually quicker than models or there's always time as i said required from changes until a model can adapt to those changes so you have new protocols coming up new types of activities and malware and this is of course also a challenge so

my last five minutes uh let's look on the other side of the coin so we've seen um a lot of cases where we could use machine learning but also have a look at how machine learning can be abused and the first cve for machine learning component was published in 1919 which was for an email protection system so you see the attackers never sleep and by the end of 2020 my cooperation with microsoft and other companies published this adversarial machine learning threat matrix as a framework to for detecting responding and remediating threats so in in highlighted and white are threats that are really on techniques specific to machine learning systems and as you can see there are quite a lot and

in grey there are techniques that are applicable to both machine learning non-machine learning systems and due to the time i could just um i just picked two very small examples to demonstrate this two types of attacks here poisoning and evasion so poisoning is something that usually happens or happens during training meaning uh the attacker is somehow able for example to modify labels if you want to do some classification or to inject adversarial samples into the training data set and the best example for this is actually from 2016 um microsoft's chatbot tai who i think who remembers this so this was a chatbot program to learn from interactions with users so the user text it got was

actually the input and it had to be taken down 24 hours later as it has become sexist and racist so means of course if you give it poisoned input you will get poisoned output and another very well famous let's say famous example is evasion it's kind of like the most common uh type of attack you have so you try to design input that is then wrongly classified by the machine learning model and i've chosen here an example from from defcon 2019 the stealth shirt so you have an object detection camera so whenever you've seen the green square it detected an object and as you can see whenever he doesn't cover um the front of his shirt

he is not detected so this is how the stealth should work so kind of confuse um the um yeah the algorithms so it doesn't detect the model and this has also been shown in research for example by um putting stickers on stop signs which then get miss classified guy miss miss classy cry to not no longer stop signs but like a speed limit 30 or 50 signs which is of course if you think of um autonomous vehicles uh really bad if this was a successful attack and last but not least of course we cannot just attack um machine learning but we can also use machine learning for our attacks so with with everything everything can kind of be used

as a weapon and there has have been some examples showing this successfully like breaking uh the google capture with machine learning or generating passwords or creating more concise lists based on on data on the user which make password cracking then more successful that you create um malware that can then evade um anti-malware engines or also for advanced phishing attacks so for example the um that tweets fishing posts targeting a specific user and seeding those posts dynamically based on topics extracted from the timeline and posts of the target and the user he interacts with and retweets and follows which makes it then really hard to detect that this is indeed fishing okay this concludes my presentation i hope

you gained a good overview of the topic and i'm happy to answer questions now thank you very much claudia and let's wait a little bit to see if people have any questions again you can write the question here in uh in zoom or even in slack we will answer those in the meanwhile um i want to ask something claudia um i mean first and foremost great presentation thank you very much i feel like was very informative and especially talking about a quite complex topic we could be i mean it's actually easy it has a lot of stuff going on a lot of winning parts but i felt like it was really easy broke down and and you know and and sent to us you

know so that was was great uh i have a question now then have you have you surveyed or do you have an understanding of how much learning has been applied in nowadays security cyber security solutions on what the one that they claim that they have machine learning on their solutions and if you do i mean what would your you know what do you think about that um is there anything any something that you want to comment on those um just let me repeat your question to make sure absolutely yes i understood it right so um so where has it actually been applied um and yeah how well basically or um in general i was asking if you had a

look at other uh softwares or services in the internet they claim that they promote machine learning as their product if you have a look at you know what do you think about implementation of course uh this is usually the problem that you can't really look under the hood of the product so um because this is kind of what their business is built on um but uh for sure you can see that like in so it's mostly now and let's say more more products for for the blue team like i said like like splunk or dark trace so they actually they couldn't do what they do if it wasn't real machine learning or more concrete really deep learning that

they apply so you can be pretty sure they're actually uh um losing it the question is you get very little on information on what exactly they're using so they usually of course say okay we have some deep learning applied here but they they won't give you details on the models or often also not so it's often also the question what training data do they use for example to actually get their models trained um this is also something you should maybe ask the vendor when you consider using their services but um don't expect them to really give you an a very informative answer on that because yeah that's a kind of part of their secret yeah i can imagine that i can imagine

that they can you know um play in this regard you know what they have um and also understand that some of them you know they just it's not easy to figure out if someone is using that but it's good to have some pointers and especially understand right if the you know the outcome is expected to be from machine learning or just a bunch of i don't know crown scripts you know and then they just analyze the data and then so so that's why that's why i was asking if you had any of the experience yeah so you can be actually pretty sure if you have those large really really large amounts of data um the best way to handle this um for those

use cases i showed is this machine learning you can't actually yeah or would take a lot of more processing power to do it otherwise and so it's kind of there also the economic option to go for to use machine learning in those cases so we have a question uh from pablo uh do you know an example of use of machine learning to evaluate estimate risk for risk estimation i think he's actually more about yeah business yeah it's called like yeah he's kind of governed and policies not out of top of my head but let me do just uh let me yeah and the answer to you maybe a bit later and uh have a closer look at this but yeah yeah

in the channel yeah we can you can reply to the selection or tweet about it let's see okay so i think there are no more questions again thank you very much claudia