How to make sure your data science isn't vulnerable to attack - Leila Powell

Name: How to make sure your data science isn't vulnerable to attack - Leila Powell
Uploaded: 2016-08-27
Duration: 57 min 19 s
Description: How to make sure your data science isn't vulnerable to attack - Leila Powell Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 02, 2016

BSides Las Vegas57:19135 viewsPublished 2016-08Watch on YouTube ↗

About this talk

How to make sure your data science isn't vulnerable to attack - Leila Powell Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 02, 2016

Show transcript [en]

uh welcome back to the ground truth track at bides Las Vegas 2016 uh I just wanted to take one more opportunity to thank all our sponsors um because we couldn't make this happen without them uh please visit their booths out in the chill out area uh let's see uh we're live streaming and recording this so please turn off your cell phone ringers and whatnot uh and please don't stand in the back because it's a fire lane uh up next we have Lea Powell who is a secur data scientist at panir thank you all right thanks thanks so I'm sure many of you have heard the term data science appearing more and more in an infoset context and

at the moment as we've heard from others some other speakers already the focus currently seems to be on machine learning now machine learning is a great family of algorith M and can be really powerful but that's all it is unfortunately we seem at the moment to be missing uh any coverage of the broader discipline of data science this can be problematic for a number of reasons first of all um similar to the talk we heard just before people can get taken up by the the hype of advertising around machine learning and just think it's a a Magic Bullet for all their problems if they don't understand the work required before and after to make the solution

robust also people maybe think that they can just start applying machine learning algorithms ad hoc on data um without the experience to handle the data properly but for the purpose of this talk I want to focus on one of the other areas which is the fact that we're not spending any time looking at data science as a discipline means that we're missing out on some of the benefits applying the the dis the discipline of data science to infc can actually bring in the last year I've been working with financial services companies trying to help them to bake in a data science approach to their data analysis and infos sec in particular we've tried to help them with a couple of problems

first of all communicating the data um it can be hard in infoset because there lots of different stakeholders to get everyone to agree on what the truth of the situation is and in that case you then lose trust in the data analysis so today I want to talk to you about why data science is a discipline uh and what that involves and look at how applying data science discipline to infc can tackle some of the challenges that I've seen people face over the last year okay so data science is discipline essentially what I mean by this is it's a way of doing things there are principles that govern how you should do stuff it's not just uh a bunch of

algorithms that we just throw things um data science like many professions is often misunderstood people have one idea of it but actually when you get involved there's a lot more there's a lot more to it and I'm sure those of you that are more on the infex side in the audience can can sympathize with with this this problem so let's talk about the principles of data science I've broken them down into into three areas and the important thing here is that they all build on top of one another so the first one data exploration and preparation is required as a foundation for everything else that comes afterwards so I want to give you a little bit more information on each of

these uh each of these areas okay data exploration and preparation this involves uh a series of principles first one is understanding what questions can be answered by your data set so if you take any set of numbers you can uh look at the distribution you can calculate a median you can do anything to it and some other numbers will come out but you need to understand what information is contained in the data and whether the question you're asking can actually be answered the second one is is domain knowledge uh and this was this was touched on in the first talk this morning um suggesting that sometimes SEC data scientists working in security might lack the domain knowledge to do

the work but actually we should be getting this from the data if we understand work with the data properly uh we can learn what we need to know to answer the questions from the data set but we have to be careful to do it thoroughly the next point is taking a look at the metadata so by this I just mean data about your data when we take a set of data say from a database we don't just assume all of it's valid we want to look at the time stamps governing when that data when that record was updated where it came from is it still valid now was it was it relevant six months ago

can we still use it next point is around quirks in the data at some point whichever database you're you're reading from you extracting data from was designed by another human who might not have thought exactly like you do so when you export data and start to work with it yourself you have to be careful not to make assumptions about how that data is structured how it's updated how it's stored otherwise it can lead to to misunderstandings misinterpretation then we need to think about completeness this is a bit slightly related to the first point what questions can we answer we need to think about if we're looking at a population of users or assets we need to really

think about how much how well that population is represented in the data set we have if we only have 10% of our Assets in the data set any conclusions we draw won't really be valuable uh and finally I've thrown in simple stats in here um I think this is often overlooked people often want to jump straight to some really Advanced algorithms but actually if it's the first time you've analyzed a data set for something Beyond its operational purpose some basic statistics can be really valuable and reveal quite a lot and you should probably be doing it anyway to check you understand kind of uh you have a feel for your data before you start to do anything

else the next uh area then of data science is is applying the algorithms now you will have hear a lot of great talks um at B size today and tomorrow around some of the specific algorithms machine learning other stuff how you actually how these things actually work but today I want to focus on the as I said the discipline of data science and the things you have to be careful of when you apply these algorithms the main one I'd say is understanding which algorithms appropriate for first of all your the data set you have so if you want to apply a statistical test or a machine learning algorithm you need to think what's my data look like some things

will will assume a normal distribution do you have one of those if your data set is skewed are you applying the right the right uh machine learning algorithm to it the next thing is is your use case what is the question you're try to answer and what is the appropriate algorithm to provide you with the information to answer that question this is particularly important for the in the scenario I'm talking about which is not doing data science as a as a researcher but actually working with an infos team in a company so we can't just look at stuff that's interesting we have to look at stuff that's useful so the use case and what we're actually going

to do with that afterwards is really important and finally the thing we want to consider is the level of accuracy required so how accurate a number is appropriate how much time do we have to that reach that solution we can't say we'll be 90% accurate in 6 months because the infos team we're working with needs something a bit sooner and finally on to communication which is possibly one of the most neglected areas of uh data science in infc a few of the the principles then of doing uh communication well there first of all balancing uh what I've called caveats versus usability so we'll talk about this a bit more later essentially what I mean here is giving

someone enough information that they have relevant context for their use case the decision they have to make but not giving them so much information that they're completely overloaded and don't know what to do next next we need to look at the perspective on the data that's appropriate for the different stakeholders infos data has a lot of different stakeholders with different roles and responsibilities and you can't show one plot to all of them to help them do their job then we need our insights to be actionable something I touched on earlier when we present results from the data it needs to be something that someone can do something about there's no point just highlighting a bunch of

bad stuff and uh and and leaving them to it it's not really useful and finally we need to be careful that the probability of someone misinterpreting the way we've presented the data is low so our job as data scientists and this is what you should expect from a data scientist you're working with one is to make sure you understand um what they've presented with you they shouldn't just be throwing stuff over the fence and leaving you to get on with it so I just wanted to jump back to to machine learning again briefly to say where does that fit in so it's kind of in this algorithms um this algorithm section and what I want to take what I

want you to take away from this slide is that if you're going to apply machine learning yourself and um for example if you enjoyed the first talk this morning you'd like to have a go with some of those algorithms and you're actually going to use what you find to make decisions um make sure you do all this first bit all the data exploration and preparation you need to do all that as well and you need to be able to communicate the results and secondly if you're looking at tools you know back to the kind of crazy vendor claims again um look at what data they tell you they need to have for their soluction to

perform as they've claimed do they need collect data for 6 months 9 months which data sets do you need how clean does the data have to be because your solution is only going to be as good as a data it's built on so today I'm going to focus on the bottom and the top of this uh Tower of data science um simply because there's there'll be a lot of focus on on other algor algorithms in other talks and I think often we forget the how crucial the the foundations are um and actually doing something with the the data afterwards so I want to talk through how we can apply these principles of data science to um to infc

and I'm going to use the example of vulnerability analysis so I want to start off with the the data bit if you like getting the the strong foundations now if you're analyzing vulnerability data simply using the tools that were provided by your by your Vendor by the vulnerability scanner so just logging into their web interface or using their reporting module then you you don't really need to worry so much about all this kind of data science stuff because you're just looking at pre-created plots um made by the people that that built the database right so this should all be fine it's when you want to do something beyond what you can do in that tool and

I've seen a lot in the last year people end up exporting stuff in Excel and trying to do additional reporting and the problem when you do that is you need to have some kind of framework and a way of handling that data uh otherwise you get into all sorts of trouble so if you're happy with your your vendor tool then don't worry about it um if you want to export the data and do something with it yourself here's how we can go about that so starting with the strong foundations so I want to kind of set the scene with an example problem uh that we' seen in the last year so we talked earlier about how

there's a lot of stakeholders in in Inc and in vulnerability data there's a lot of stakeholders too you can have the patching teams the vulnerability manager you've got the ceso prob need to report up to the board as well and all those people need to pass an information about what's going on so suppose your your C so has to report on the vulnerability situation to the board and uh they they've got a nice trend line here of number of vulnerabilities over time you can see Something's Happened right there's a big spike so we know the what but we don't know the why just by showing this plot if they have to go and defend this

uh in a board meeting and say something about it from from this data alone we have no idea what's causing this Spike so one of the key points is to to make sure we actually measure something meaningful in vulnerability data we often start off seeing a number of vulnerabilities reported um whether that's in internal reporting I've seen that um or you log into your vend a tool that's the the big number the first thing you see but there's a lot of a lot of complexity in this so I kind of wanted to walk walk through what goes into that so imagine now we're trying to help the cees understand that Trend so they can they can explain what they've

seen give everyone a bit of confidence so they're you know they're not panicking we're going to have to understand what builds what builds up to make this number now if those some of those some some of you in this audience might be very familiar with vulnerability data so what I say next might be kind of obvious but the point I'm trying to make is that anyone that's not working with a given data set day in day out won't know all these hidden complexities so if you're trying to communicate about your vulnerability data to your cisu who might who might not have been Hands-On with the data they won't know all these things that seem obvious to you similarly if one of

your colleagues that works in a different area of infc but you've got you guys got to work together you might not know all the subtleties of their data so I wanted to just break this down um so we're all on the same page about the kind of complexities that can be hidden those sort of Trends and a basic number the first thing I wanted to talk about actually was was kind of naming conventions so when I started working with vulnerability data a year ago I was really surprised that two things were called vulnerabilities so I know like you know if you're writing a bit of code you don't want to call two variables the same thing same in algebra same in a lot

of stuff so it turns out that as many of you probably already know uh a vulnerability is a uh you know flaw in software that has a unique CV ID or a unique vendor ID but then if we're referring to an instance of that vulnerability on an asset well that's also a vulnerability so you get used to this it's fine but it's actually pretty confusing it sounds like I'm being a pedant and I definitely am anyway but there is a point to this um I was working with one vulnerability manager and she would have to go to the seeso and explain the vulnerability situation and you'd have these conversations where'd say right we've got we've got 32,000

vulnerabilities but we've got less work than last time because we've actually only got a 100 vulnerabilities I mean it just sounds crazy and it's a really hard concept to explain so for the sake of clarity um I'm going to kind of rename that and call it detections so vulnerabilities are things with a unique ID detect a detection is an instance of that vulnerability on an asset and this actually makes kind of reporting on it and discussing with lots of different people that don't work with the data a lot easier so it's worth thinking about language as well as the whole maths and data handling thing when you're trying to communicate to people that aren't in the weeds of the data

every day so let's start to to break down this uh this number so if you remember from the the principles of data science one of the first things we wanted to get was that domain knowledge so we're going to start looking basically what this number means okay we got 32,000 detections let's break it down we've got say 25,000 unchanged since the last time we scanned so that's the big that explains the big jump we've seen on our original plot that we showed to the CIS so um uh we've got a bunch of new ones a bunch of reopened ones we've accepted the risk on some so that's gone away we've closed a bunch and this all adds up to make that

number okay that's fine but we still can't really do anything with this so let's break it down another to another level of detail so it now starts to get a bit more a bit more complicated and in the interest of uh of time uh since this is the last Talk of the day I'm just going to focus on one branch as an example take a look at the new detections that have come in it's a bit easier to read um we'll have some that have come in because we've got newly published vulnerabilities so you've had you know Patch Tuesday a bunch of stuff has been released you run a scan okay that's all now detected on your

machines you have some actually we've seen that will come in because they're from an old vulnerability that's been newly newly detected on your estate so maybe this is second from 2012 that suddenly pops up on a on a bunch of assets that's that's potentially interesting now as we talked about before detection is a combination of a unique vulnerability and an asset so we've looked at the the causes related to the number of unique vulnerabilities changing what about the assets well suppose you're you're also rolling out a program to scan more of your estate so you now SC scan another a new subnet so there's more assets so now you've got detections of vulnerabilities but that's

that's actually a good thing right you're scanning more assets you're getting more coverage this is a good thing we don't we're not worried about that also maybe you actually buy some new workstations bring those online even an area that was already being scanned you've got more machines again so again this is feeding into this uh this increase but it's uh it's not something to worry about so I think what we're going to focus on is is the first two then things related to the the vulnerabilities but before we go any further there's a couple of other things we haven't done which were the from the the principles of data science um so let's pause this

here and check the validity of our data so let's look at the the metadata so we have the 32,000 detections um and there's a bunch of time stamps in vulnerability data which we should probably go in and have a look at just before we continue so let's have a look when the records in the database that we've exported data from we updated um I've kind of picked 90 days as a an arbitrary threshold this is one of the complexities actually it's not clear what the cut off is when is data valid that will depend on what the use case is how frequently you expect the data to be updated but in this case 90 days seemed

reasonable so we'll have most of of the detections updated in the last 90 days so say fine that's recent relevant information keep that we'll have some that haven't been updated in a long time and again I've broken out for you here here some of the some of the reasons why some the complexity behind this number so some will be detections that simply haven't been retested others will be have old update times because the asset they on just hasn't scanned and then if we again follow on the the leftand branch those that haven't been retested why is that now we get down to real kind of practical reasons maybe there's been an authentication failure so some

vulnerabilities you require authentication to to test them so if that fails we can't update the record maybe the test hasn't been able to be replicated you have to replicate the exact conditions to retest for the vulnerability so if you move a bit of software maybe maybe they couldn't do the test again so as again just the complexity of assessing whether the data is valid uh and making decisions about what do you do with things that are on assets that haven't scanned you don't just want to be deleting information about vulnerabilities so there's a whole kind of uh issue around this what what the right decision is in this case we were we were going to focus on the new

detections so we're in this side of the the tree diagram so we're okay to continue the next kind of um word of warning I guess is around that the data quirks I mentioned earlier so when you're exporting the data and you see something called last scan date think great I'll just uh stick that in my uh um my code I'll use that plot some plot some graphs that be fine do you really know what a last scandate means um is it well documented possibly not is it the last time that there was a vulnerability scan or the last time there was a compliance scan is it when the scan kicked off when it finished is

it when the scan was last authenticated or not authenticated or the most recent of either of those you can obviously work this out eventually but a lot of time people will just take the first thing that seems sensible and go with it actually we had a case where someone was exporting uh data and had got got the wrong meaning for a time stamp it wasn't updating any of their information so I mean this can be really crucial when you start to to do data analysis off your own back so just to before we move on then those are the the kind of warnings around other things you need to check so just to remind you where we got to

looking at new detections to explain that that spike in the the trend graph for the ceso and we're going to focus on the the things from um some of the uh old and new vulnerabilities this looks interesting this appears to explain that Spike so the next thing we want to move on to is how we might actually communicate this so we're going up to the the communication part of data science now we're not applying any complex algorithms I said we're just getting some initial value from some some simple metrics now I think communication has been one of the biggest issu issues I've seen people face um I touched on it before there's a lot of stakeholders

they all have different areas of expertise someone's dealing with VOR every day other people aren't and the same applies for all the different controls and it can be really hard to get your message across so I wanted to kind of take a look at this it's basically what we called the data flow in infas so you're going to start off down the bottom with your your sort of tactical data this will be everything from your logs controls and in our in our vulnerability example this is going to be our 32,000 detections got 32,000 data points down this end then as you move up to sort of operational data so maybe you're kind of V manager level now um that's going to

be condensed down the V manager probably going to be looking at I don't know histograms of um age of detections so they can see what needs to be patched next how they're doing keeping in policy so we're going down from 32,000 data points AR is getting smaller down to maybe say 10 bins and histogram compressing the data and then we move along to strategic data so this is going up to the level of ceso or even the board itself uh and this needs to get even even more compressed um in fact when infac team we worked with had uh at the bottom end they were producing maybe a uh sort 15 page report at the top end they had a quarter of a

PowerPoint slide to to explain the whole vulnerability situation in the business so you really need to you really need to compress this as you go along um and it's not just a matter testy it's not just uh people trying to be awkward making you make the data uh more summarized it simply has to be that way because as we go along this this chain we go up the levels of management the responsibility the remit of the person at each level gets much broader so on the Tactical level those 32,000 detections maybe a patching team there'll be someone working on patching team maybe for Windows maybe just for the UK so great they can have all the

raw data they need it for their job and that's all they're focused on when we get up to the top you'll have your ceso who needs to have an eye on all the controls on their estate across all Global regions she's never going to have time to look at the raw data from all of them and what's more it's not her job right someone needs to made a decision before that so what data we show someone is essentially based on what they need to do with it and this is this is really tricky because it comes down to one important Balancing Act which I mentioned uh as one of the principles at the beginning this balance between caveats and

usability so you need to provide someone with enough information to to do their job but not so much that they don't have time to look at it or they can't interpret it properly and I think this is one of the big challenges of trying to get to datadriven decis decision- making in infos is this flow of information as it goes up and giving people the appropriate um level of information to do their job and recognizing that not everyone has needs to have all the details at every time so based on that uh what we can see is is that we need different perspectives on the data for different stakeholders and what data scientist should be doing is taking some data

analyzing it and then packaging up the results so that they're really usable for the person's use case they're trying to they're trying to solve when you do that when you make something really great for one person it's not so great for the other person that's okay um but I would say is a kind of word of caution if you see some analysis um in the press or I don't know in being shown to someone else and your team it's at a different level and you think that's too simple where's all where's all the you know really intricate details just think was that analysis intended for you if it wasn't you're probably not going to like

it you're probably not going to find it useful probably be annoying like I'd love to see all the like you know all the logs all the all the raw details but the person that's been presented to doesn't they don't have time to look at it they need to make a decision what's important is that people get the right impression of the data the impression you want them to have say the outcome is the right one and often by giving people all the information you want to give them you get the wrong outcome because you just given them a bunch of logs and they're like what the hell is this so effective data science will be pretty

targeted is what I'm trying to say so let's have a look going back to vulnerability example let's have a a quick look at the different views that different stakeholders will have um on vulnerability vulnerability data so we can you can see an example of what I'm talking about so for um for aiso as I mentioned before they have a broad view they're going to be looking across the globe and they're probably going to be comparing how different business units are doing in fact they are comparing because this is what we've people we work with um typically look at so some important differences we've gone away from raw numbers now we're looking at um vulnerabilities per asset so kind of a

ratio number of detections number of assets otherwise when you're trying to compare your different business units a business unit with far more assets is obviously going to have far more detections of vulnerabilities so if you look at the raw numbers the biggest business unit is always going to apparently be doing the worst so the cis's job here is to make sure all the business units performing well so we we look at a a a relative number of vulnerabilities um we compare um across the regions and we're looking at over over all Trend a vulnerability manager for example needs a totally different view suppose they they're just working in in the in the Americas so this top bar

here they need completely different information they need to kind of make sure that that patching is being done on time and everything's being managed so the kind of information they might look like might they might prefer to look at be something like this this histogram we have um age of the detection so how long it's been on your estate across the x axis and the number of detections with that age on the Y AIS so this is showing them if your if your um patching policy was 30 days you can see you've got a bunch as Ione does way past that policy um and plenty coming up and then they'd probably also need an actual list of detections um to see what

vulnerabilities there to pass on to the patching teams so they can actually go and do something and this is all this is all the same data um but there's no point showing this to the ciso or the previous plot to the vulnerability manager it's important though that we can easily go from one to the other so another of the problems I've seen is that people will have high level plots which when you break them down to the RO logs the numbers don't add up we seem to have lost some vulnerabilities someone's filtered them out someone's done some weird thing in Excel and you can't you can't get from the raw logs up to the summarized View

and back down again and that's crucial everyone needs to be seeing the same data it's just been repackaged but it needs to be the same so the next point I want to make about communication um is around how it's interpreted so if you're a data scientist and you're you're plotting a a a chart or you're um or you're an infect professional you handing data or you're showing something about about your data set which you're really familiar with it get seem really obvious when you make a plot what it's telling you um but all of us have to be careful that we have to put ourselves in the shoes of the person viewing that plot and understand how might they

misinterpret it and again if you're a data science is basically your job to make sure that the audience has a a low probability of of misinterpreting so an example would be okay I'm going to show I'm going to show an average so one one see we'd like to see um average time of the detection on the estate before before patching nice quick measure easy get an idea compare the business units another one we work with was like no rubbish I don't want to see that that hides outliers show me the full distribution so it takes a bit longer to to um process the information in that but there is more more information contained in it so the question you have to ask

yourself is if you're going to show someone in the first case the average is that guy aware that it masks out Liars probably he is but is he going to remember it is it going to be in the Forefront of his mind when he has five minutes to check the report you've produced and has to make a quick decision on what to do on which business unit needs to be um contacted and told to uh you know improve their improve their process and this is uh this is again one of the responsibilities of data science to make sure that the communication is clear and and I'm ambiguous and it shows us that the way we present data is specific

not only to the role of the person are they patching team V manager ciso but also to the individual especially in roles where people come from different backgrounds so for example the CES you know some will be very technical others might be more from a a management background and they might have different skill sets when it comes to understanding data so we have to bear all of this in mind the next skill the next uh sorry the next principle to apply is that of actionable insight I mentioned so as a data scientist when you produce a plot or as a or as an infect professional when you're producing your report or when you're being shown something you

should ask so what what do you actually do with that now and remember again this isn't about doing data science and research is about working with an infc team that actually need to do stuff um so I wanted to take you back now to the original part we looked at that Trend the cesa was trying to understand um and we saw that that would that seemed to be to do with new vulnerabilities coming in Old vulnerabilities coming onto the estate uh let's see what we can do with that let's take a look at the a way we can potentially provide some actionable insight so the trend we looked at before um is on the left we can see there's

been a sharp spike in in in the last month um the ceso needs a reason to go and report to the board and from the kind of analysis we did before breaking down all the information getting that domain knowledge uh we saw that appeared to be linked to the the old vulnerabilities being reintroduced into the estate on the right if we look at those plotted out separately notice we don't plot all those things from the tree the the tree diagram that would just be that's you know that's too many caveats right this is usable we have in in yellow uh the number of detections coming from old vulnerabilities being reintroduced and in blue the number of uh detections

coming from newly published vulnerabilities there'll always be newly published vulnerabilities but as you can see this is pretty flat this hasn't changed markably in the in the last month however the number of old vulnerabilities has shot up and one of the one of the ways we've seen that that can that can uh happen is if you have a a kind of out-of-date standard build so you put that back in it's got old software on it you suddenly introduce a load of old old vulnerabilities and you have to patch them again so this is actionable this hints at process this isn't about patching all the time it's about looking at other aspects of data where are those vulnerabilities actually

coming from maybe if we like update the standard build we won't have all this stuff coming in um we actually saw an example with uh uh with some people we were working with and an old version of um I think it was Skype had been rolled out and uh you can do this analysis broken down by um software type as well and then you see these these all these older vulnerabilities coming in and that's just poor process you don't have to pass that onto your patching team it shouldn't be there in the first place so this kind of insight this is a pretty simple example um but this is what makes something usable the plot on the

left gives an idea but it's something like the thing on the right which indicates what someone should do next right go and review your process go and review your stand yeah the old vulnerabilities vulnerabilities that existed before nowed in your own Environ Orab old AG en vulnerabilities yeah yes you you could have both actually yeah so the the case of um um the standard build there would be an old vulnerability that's new on that asset but of course you could also have they could have been detected on your estate before so when that um when that standard build was up to date things would been detected on it they would have all been patched and now you're

putting them back in essentially um so yeah you can have both another another one of the complexities right so that would be another uh Avenue to go down to to split this out further okay so I feel like we've gone we've gone through the the kind of framework of data science for example of vulnerability uh we've seen the way to approach the data when you're when you're handling it yourself how to get some insight and the things to bear in mind when you're trying to communicate it what I wanted to talk about uh in the last kind of section is going beyond your data set so all this stuff we've talked about is just for one

data set and that's quite common in infosec right we have all these Silo tools we look at this one then we look at this one but actually once you have this kind of framework in place to make sure you're hand handling data carefully handling it properly and people kind of get on board everyone's agreeing on on the picture they're seeing um we can start to look uh outside the data set one of the first things we should do next is start to think about the completeness of that data we just looked at we've done all that analysis but we haven't checked how complete the data is so this circle represents all the hosts that we found in our vulnerability

scanner database that's what we use for all that analysis um if we look at the all the hosts we have on our estate represented by this gray Circle what we might find and probably will find is that there'll be hosts on the estate that aren't being seen by the vulnerability scanner and unless there's a good reason they probably should be we also might see some hosts um in the small section here that are in the vulnerability scanner that actually have been decommissioned and we haven't cleaned up no one's cleaned up the database Purge the data um but the main thing we're worried about is this this this gray area and the bigger this is the less confidence we have in how

relevant our findings uh of what driving vulnerabilities on the estate are so what can we do about this well we can start to look in other data sets to see um to get visibility on these hosts now it'd be great if there was an a really up to- dat cmdb but in my experience that's not the case so we can go to other data sets and use those so what what we've tended to do in the last year is get data sets that will go some way to helping us solve the the problem we have but are easy to access often in uh security teams don't own all their data sets they could be outsourc costs owned by

it uh on some systems logging might not be switched on to a high enough level of detail to to to answer the use case you want so you have to be a bit pragmatic and get the data sets that will help you but you can get quickly again we want to we want to get a quick return on investment so that people can actually do something with this data uh so one of the things we've been using is is Av data AV and vulner both typically quite easily available so what we can do then is we can essentially join the data sets together together and look for hosts that are in both which is the

overlap and in the case of trying to get a feel for how good our control coverage of the vulnerability scanner is look for hosts that are only in the AV data this is this section here and then we can go and see why aren't they being why aren't they in the vulnerability scanner we also can start to work towards some kind of percentage coverage of our vulnerability uh scanner as a control so whenever you look at data from your controls you should try and understand the percentage of your estate that that control covers if it's a really low percentage your conclusion is just basically useless you should try and improve the coverage first then do

the analysis afterwards so as with all the um as with all the kind of uh machine learning hype it's one of those things that like sounds really easy when you when you first mentioned it actually it's really hard to to to match the data sets together so in the v data I typically work with you'll have an IP address but often you won't have a a host name resolved often there'll be no DNS name resolved so then in I AV data you want to try and match it but if you use the IP address uh a lot of businesses obviously will have DHCP so is it the same asset now you can't check the host names because most of the time

you don't have them um and okay if it's the same IP address and they scan the AV scan was an hour after the vul scan okay they're probably the same or if it's three hours or four or five two days at what point do you do you make that cut but you can actually do a pretty good f pass um with any host names net bios names for example that you do have in the data so it's not too there's complexities and and that's probably that host resol host resolution is probably another talk um but you can do you can do a good first pass you can learn something from this data and then to learn more you can get

more data um so maybe you can get the DHCP logs and start to see which assets are assign which IP addresses and at this point you're really glad you applied that really solid framework of how to handle data and how to communicate it because if you hadn't done it this is uh an incredible Mess by this point and the more data you put in the harder and harder it gets to to to manage it sensibly so building up that kind of that framework of data science and applying that to infect data allows you to add more data sources in and build complexity without completely losing um losing all the accuracy and Trust in your data the other thing I wanted to to uh

highlight is that once you start to get all this data and you're managing it well and you know what you're doing why not use it for other things so we can start to get a bit more context around what's going on with our assets right we can start to find when we were finding host in in lots of different data sets we can start to say how's that host look in the vulnerability data what vulnerabilities does it have what's the situation in AV what else is going on maybe what users are logging on what software is on there have we had any um um alerts from our from other systems involving that host so we start to get a lot more context

that can make it easier to do some of the Tactical and operational work another benefit is that if we start to bring in business context information as well we can move towards having a wider view of security across the business we can start to make it something that is more relevant for uh the board they can buy they can get we can get a bit more Buy from the board right because we're now starting to talk their language um we're commun we can communicate better with them they can give them an overall picture of what the sort of uh what sort of exposure to risk is so the takeaway um from this talk is essentially that first of all data

science is more than just machine learning there's a bunch of other stuff there's a whole framework for how you can approach analysis and if we can apply this to the way data handle is handling infc we can improve the trust in and the communication of uh security data the benefits of this for those of you that are are working in a tactical operational way is you can get more context you can know more about what's going on that might also be important when working out how to prioritize uh tasks uh to understand what which are the riskiest assets but possibly the most valuable thing in the long run is that we can start to provide evidence to

management to the board that demonstrates how the work of the security team is lowering risk and helping the business at the moment the situation we're in is if nothing bad happens then it's okay so no matter how much work you do there's there's no evidence to show what what you what you're doing just nothing bad has happen it's hard to measure nothing right so this way can really communicate to the board what's going on what the security situation is and finally demonstrate all the hard work the infect teams are doing doing

thanks

so I have uh question SL comment and you touched on the importance of looking at basic statistics and I for me personally count is the fundamental stat that I look at first because that addresses coverage um if you don't have good counts your means don't mean anything you can't establish uh statistical significance so yeah I think that's absolutely true um a lot of the the problems I've seen is people literally the numbers just don't don't add up so people think they have X number of hosts and you look in the vulnerability data you're like we've got y number of hosts and then are you meant to be scanning these are you not meant to be um you have problems with

simply numbers of vulnerabilities not not summing to the total you expect to have because people are exporting data in Excel and cutting it and changing it um see absolutely there's there's a lot of value count stuff first then try summing it right let's look at a median you have a much better feel for the data the other point is when they yeah yeah so the the exclusions is a is actually a big issue um and I think it's something that people often do unintentionally um so people will you know I talked about if you're not an expert in a data set right um someone present stuff to you they might assume it's obvious to you that they've cut out

things that haven't scanned in the last 30 days they've cut out severity one vulnerabilities because who cares about those um you probably should care by way but um that's not not advice um and they'll assume perhaps that it's obvious those things have been done and as that goes up that chain and that flows through the the level of management no one knows those cuts have been made right and then if that if that person leaves the business next person comes in they're trying to replicate that reporting and the numbers just don't add up they they don't know that that that guy chopped out those older detections or you know made that cut so absolutely I think that once you once

you get thing this apprach working um you need to move to something that's that's repeatable then as well um repeatable and um automated if possible because then people can't start cutting bits and pieces out you may have already touched on it a little bit but can you share a couple of common ways that uh or common mistakes that you can make when kind of interpreting analyzing data potentially arriving at the wrong conclusions because let's say for example certain data points are over over uh represented and therefore you kind of like oh the example you gave was of course North America is going to have the most amount of vulnerabilities because they have the most amount of assets do you have a

couple of other examples that are like really easy to fall into like traps and stuff I think um it's a lot of it is just around making assumptions actually um people will see the example I gave around like the the last scan date thing right people be like oh that's obviously the last time it was the scan St started and then they'll develop some really nice plot based on that and find out later that was actually the last time authentication failed or something um so it's about testing your assumptions about the data actually so you know if you if you can actually run a test and see what happens when you change things test your assumptions

another example like that we saw was um in a database you would have when you had DHCP you would have an IP address uh used by Windows machine you get a net bios name then a Linux machine would have that IP and I would just assume right okay that they're probably going to nulli nullify the the net bios field right that should be a blank now no it stays there uh that's how the data that's how the data is stored um and if you had left that in and did done all these plots based on net bios numbers of unique net bios names or something like that they could be totally wrong just because you've assumed that if you built

the database you would have done it like that um it's actually the really simple stuff that trips Stu trips people up before they even get to attempting to apply blackbox machine learning uh like this gentleman said you know the Count's wrong um or someone's filtered something out in a spreadsheet and not told you um great presentation you're giving me flashbacks to to VM program so uh and a lot of the problems we had but um one of the things I had was with communication and I think you kind of touched on that so I wonder if you had any Lessons Learned um in two different areas so I've been asked you know when I was doing it for uh metrics that weren't

really um very good to to look at like percentage of uh severities in your environment so you know if you clear out all your level ones like you said your level fives would increase in severity when actually the security would be decreased overall so that's one example and then um and also went like uh impromptu data so you know taking your data at times of the month May significantly uh decide how your data looks like after patch Tuesday or whatever so I was wonder if you had any suggestions or Lessons Learned around around those two areas so yeah I think the so to start with the the time analysis thing I think that's something you absolutely have to be careful with

so for example one of the things we do when we first go in and start start working with people's data is try and match to their existing reporting right bit of validation um and you can find that if you export from a database on on a Monday one week and then export on a Tuesday another week data doesn't agree um so it's important to to bear that in mind it's also important to think about whether you want to look at behavior in in kind of like a rolling window or if it's the current status that's important um and that depends that everything is quite dependent on what what the use case is um so I think it's

just it's being aware of that and um thinking about how it might impact on your results um remind me your your other uh the other one was on kind of um funny metrics being asked for like you know you present your data in uh in vulnerabilities for host which is an excellent n metric but um you know I've been asked putting in a percentage as the P gra which is horrible course yeah yeah beond high percent is not accurate yeah I think this is the thing this is um this essentially comes comes back to the kind of breakdown right the vulnerability stuff is you have to see what's feeding in um I think you have to try to guide

guide people I think doing data science is sometimes a bit of an education piece um so in the same way that you know you can take guidance off infect professionals around the way that data is useful to them and what they what they need to do their job um they can hopefully take guidance from data scientists around um the best way to present data and I think if you explain to to people why why they might actually be misleading themselves by showing that um and give them an alternative and show them a a valuable alternative that's use um so maybe looking at um uh like as you said like a ratio of things instead of

percentages I mean percentages can be fine it depends it depends what you're doing right um the coverage one can suffer as well right so you know your coverage can be really good and you can bring bring more assets online and then it takes a while for the scanner to to go and find them so your coverage plunges it's about having reasons about being able to explain that the differences and how do you explain sometimes that increase in vulnerabilities may actually be a decrease in your risk if you added more home or yeah exactly so that that was the you know the the tree the tree I had before that was the point some of the

some of the changes are actually actually um positive you've rolled out your scanner to a a new a new business unit and I've got loads more vulnerabilities so it's about I think there's not one metric right that's going to give you the whole state of vulnerability so it's about having potentially having a bunch of simple metrics that are kind of um will flag up these areas so it'll flag up things around process so looking at age of vulnerabilities and how that that changes whether they're old or whether they're new you start to look at you know as said standard build process or other things um also looking at tracking the number of assets so I think uh one thing um

I've been working on is actually as a complement to looking at number of vulnerabilities is looking at how the number of assets on a state is changing and you can correlate that right so if you have um a set of metrics around assets you can see if that's shot up you're expect a a boost in your um in your vulnerabilities um sits around grouping together metrics I think that inform each other and for sure it's challenging um because especially with vulnerability because there's so many factors that come into play but I think anything is better than the kind of current situation I've seen uh where it's typically a number and a trend and and no Insight at all what's that mean

yeah exactly hi thanks you have a great presentation uh and and having done this for a while I just have a lot of questions but I think one of the biggest ones I had was and one that I grapple with a lot is the last bullet you have effectively here so how do you measure how do you tell you know the business ownership the leaders the money people you know what risk means to them is is your definition of risk just lowering the number of vulnerabilities is that it because risk to them I think is something else and it has a dollar value Associated to it so do you have anything related to that that ties in the entire

enchilada yeah nice phrase um yeah absolutely it's really complicated for sure it's not the number of vulnerabilities totally agree that's not that's not the point um I think this is something I'm I'm kind of uh I'm working with my colleagues at the moment actually is looking into into this in kind of um how we can really estimate what risk is um that might be looking at that kind of hostcentric risk seeing seeing how that plays out when you look at your controls across the infrastructure but I think as a kind of first pass when we're talking about these basic metrics I've shown today it's simply about showing uh allowing the ceso to show that if he or she

implements a a strategy something happens so if they they you know roll out vulnerability scanning on across more of the estate they can demonstrate they've done that we have more visibility so it's around educating it's around educating the board on on what Security's doing as well I think um so it's it's being able to show an outcome of work you do and I think one of the examples of vulnerability that that can be problematic is if you take a really simplistic View and you just see an increasing Trend you know if that ceso in my example had gone to the board and be like oh God look at the spike in vulnerabilities um people will kind of

infer things that aren't true maybe they're like oh well that what's the patching team doing this is a rubbish and actually you've in you actually improved coverage of the vulnerability scan across your estate you actually done something good and it represents as something bad if you have overly simplistic metrics so simply repackaging the data in a way that gets Insight the proper insight and gives a reason why things are changing and you can the seeso can go and point to and be like you know what I'm going to show you I'm going to okay vulnerability has gone up but that's great because we've got we've got awareness of these now we didn't know these were here before that's a totally

different story to like uh this is there's a big spike I've got no idea so I think it's it's providing reasons providing evidence for what you've done and then eventually working up to to I mean that you know you say a million dollar question right is is getting a a um a measurement of of risk to the

business thank you okay thanks

How to make sure your data science isn't vulnerable to attack - Leila Powell

Related talks