Dominating the DBIR Data

Name: Dominating the DBIR Data
Uploaded: 2016-08-30
Duration: 56 min 15 s
Description: Gabriel Bassett and Anastasia Atanasoff from Verizon's database team explore the unglamorous data-preparation work behind the Verizon Data Breach Investigations Report. They walk through incident curation, data cleaning, anomaly handling, and statistical methodology—revealing how messy real-world da

BSides Las Vegas · 201656:1595 viewsPublished 2016-08Watch on YouTube ↗

Speakers

Anastasia Atanasoff Gabriel Bassett

Tags

CategoryResearch

ResearchEmpirical Research Methodology

StyleTalk

About this talk

Gabriel Bassett and Anastasia Atanasoff from Verizon's database team explore the unglamorous data-preparation work behind the Verizon Data Breach Investigations Report. They walk through incident curation, data cleaning, anomaly handling, and statistical methodology—revealing how messy real-world data becomes actionable security intelligence.

Show original YouTube description

Dominating the DBIR Data - Anastasia Atanasoff, Gabriel Bassett Ground Truth BSidesLV 2016 - Tuscany Hotel - Aug 03, 2016

Show transcript [en]

uh okay so uh next up we have Gabriel Basset and Anastasia atanasoff uh from Verizon they're on the DB team so put your hands

together okay so we're not going to be talking about the DB report in fact we're not even going to be talking about the data in the DBI report this is all going to be about the process that goes into the data so the data janitor work right it's like 80% of the data scientist work is you know cleaning and preparing the data and the other 20% is complaining about how the data isn't clean enough and so by the way I'm going to apologize ahead of time this is a really high uh content per minute talk it's very dense and frankly the slides are not that good they're kind of boring there's just a lot of words don't look

at them um we're we're much better to look at like we're not hard on the eyes so look at us listen to us it's a lot better and so the reason we wanted to do this talk was kind of threefold um first uh this is not something that people normally talk about like we talk about the models we create and the cool things and how well they did we don't talk about all the process we did to get the data where it needs to be um second we wanted to help kind people understand how we got to the number so that when you read the DV or whatever report you can look at that number and understand

the process that went into getting to that number and finally for those people who you know are new to data science we wanted to kind of give them an ex an idea of here's what happens um in reality with data Sciences here's what they spend all their time on datawise and how they clean in the entire process behind it and so this is this is really it's the agenda but it's also the data flow it's what the pro the process our data goes through and we're not going to really talk about how we store the data excuse me we're not going to talk about how we store the data but we are we we store in a get repo and it just

kind of lives there and so unless we mention otherwise that's where all the stuff is happening and so like I'll just skip the uh the formal stuff I'm the mad scientist from the teame I'm Anastasia tanos I'm a data scientist I have a background in security and pure and Applied Mathematics I'm one of the co-authors of the report and then I do exploratory and statistical analysis on the data and yeah I guess I should point out I am not Anastasia I'm Gabriel in case there was any confustion between the to us and so talking about the report um if you aren't familiar with it it's a report on like 100,000 incidents that happen each year and you know the goal is for it to

be data driven we want it to be uh done to academic rigor but we give it a really cheeky voice so that it's actually fun to read because it's a report and reports aren't fun to read you're going to try again I'm sorry it's not on okay you you know you told me before it was on and so well I guess I W I wasn't live earlier off switches you know okay should I go back no I'm not going to go back I'm not doing it so okay so you know the first thing we need to kind of think about is what's data right and the clear way to answer that is to Google it

um and so I think this is kind of what we know data to be it's it's something that's solid it's unchanging it looks the same from every single angle by the way do you know you can buy a ball of onyx on eBay for five bucks and get delivered from China in like 3 days like that's impressive I thought I was I was like looking for a ball and I thought I was going to get like some concrete thing and like I got this ball of Onyx I feel like I probably stole it from some ancient uh you know Museum or something and so you know that's kind of what we we know like that's the definition that

we know but reality is we this is what we feel data is like is it's a puzzle piece and when we look at it we don't really know where it fits or you know what it means but but we're confident that if we analyze it and figure out its place in the puzzle um we can put this together and come up with a picture that really conveys some knowledge now unfortunately this is what data looks like this is this is the floor of my Rec I mean not my rec room it's my kids well no it's actually both we share the rec room some of these are my toys um and so when we analyze data it's re the reality

is a bunch of pieces that don't actually fit together like they're not designed to easily connect and so part of our job is to make the pieces fit together um also uh they can fit together in many ways right with a puzzle it only fits together one way you only get one picture out of it you know if you're playing with dupl you can build anything you want you can build a car or or a boat or a car boat um you can come up with whatever picture you want and so the reason we need to talk about about our methodology talk about our process is so that when we put the pieces together into a picture we do it in a

repeatable verifiable maintainable manner and preferably move from a manual process where we can introduce airs to an automated process and so the first step is getting in preparing the data I'm going to hand it over to Anna to do that section all right so when we get the data we always look for a very robust and diverse data set so we make continuous effort to recruit Partners from all over the world with various focuses and we also want to maintain a good working relationship with current Partners from year to year so um these different focuses of course introduce some sort of bias as bias is inherent in any kind of research um it's not dichotomus so could occur in any phase

of research whether it be planning data collection analysis or reporting because it's independent of sample size and variance you do have to care about random error as well so it's always a good idea to identify the different types of bias that you may encounter during your research such as selection bias detection uh exclusion bias or even Observer Bias so um we consider the biased variance tradeoff and then with random error that's when it's a good idea to start looking into hypothesis testing and we look into our confidence intervals so some potential Partners may not meet our data standards so this is where we will review and assist in their data collection methods if need be

also every partner stores their data in their own way so what we do is we provide them with a standard Excel file for them to use or we give them a web form for them to enter their data into and then we take the various you know XLS CV text file or whatever and we handon convert it into standard excel in order to maintain consistency so for those cases when we receive non-standard Source data what we do is we build mappings per partner so we take their enumerations and we map them to ours and we reuse these mappings each year and then also to complete some of those mappings we actually have to hand code various features such as adding in

nakes codes for Industries and do any kind of necessary conversion such as timestamps in order to maintain that consistency so many times data is missing complimentary encodings so we either manually edit or we use heuristic rules using a script for example if you have an incident or record that has malware vector. web app in the record the rules will also add in asset. assets. web app into that record in order to help complete the data if there's missing data we contact the partner see if they can provide it if not we automatically add in the required field as uh we put unknown in that required field also because data is anonymized we do add data to uh account

for traceability so for example there are fields that the partner does not provide but we add in such as DB year um Source or Master ID in order to help with that traceability sometimes we have to remove things from raw data um for example we look at the source data and we have to uh ensure that it's actually deemed an incident how we Define an incident is that it is a security event that compromises the confidentiality integrity and Ava availability of an asset so we don't include things that are not compromised there's no impact or if it's just informational we also remove unintended duplication for those cases where a where an incident can be represented

multiple times in the Raw data and we also remove already aggregated data uh for example if we get a text file and it just has counts by industry or accounts by attack type or accounts by data type it's very difficult to make individual correlations with the aggregated data because we don't have access to the individual incidents so we make a habit of reaching out to our partners and requesting that they don't provide aggregated data and the actual individual records so for storing data we have three main directories where we have our source data first P data and review data for Source data this is our original unchanged raw data we never delete it and we always uh keep copies for each

iteration we have our first pass data which is where we just put the data into the dbir format so our collection well I'll get into that later um so the data is examined and then it's prepared for import and then for review data we just take that first pass data we review it for completeness and then it's ready for import so typically we have a couple uh members of the team that look at the review data to make sure that it's consistent from this year compared to previous years and that we're actually working off the same data set next we convert the data into Json which is what our schema is built upon so we use csvs we don't have to play

around with worksheets we don't have to play around with programming language parsers or anything like that we use import scripts to convert input CSV to Json so each CSV is unique and it requires its own import script overall we have three different python conversion scripts to import those formats the problem with these scripts is that they've been written by multiple authors in the past so we have run into consistency challenges where it's sometimes hard to maintain so one solution that we implemented was that we actually modularize those import scripts where we move common portions to separate modules using heris rules and then we have a single control script that we we import all the modules and

then run the actual conversion process okay I get the picture and do you like the animation it took me a long time to do that I think it's very good it's the only one in the chart I I promise I didn't like animate from here on out it's all animations so know um we don't need to talk a whole lot about the schema right because we're not going to be talking about the data but what we do need to understand about the schema is what it means for the structure of the data and so we needed a schema that could bring in both structured and unstructured data right because we have data out of Sims that is very structured

tabular data but we also have case reports that have to be coded up and so our our um schema is a Json schema which is actually like a literal thing you know it's there's a standard for Json schemas and what it allows us to do is get a hierarchical structure for our data um and the problem is this introduces a couple things that later on we're going to have to account for um the first is hierarchal nature so if in normal we have like an action right and the action could be hacking but we have varieties of hacking so action hacking variety SQL injection action hacking variety uh cross- site scripting action hacking variety use of stolen

credentials and so we need to be able to capture we're capturing at multiple levels of the hierarchy and that's going to cause us parsing problems later on that we know how to solve because I wouldn't tell you about them if I couldn't solve them notd um the the other problem we have is that it creates multinomial data non-exclusive multinomial data in fact so that normally what you'd have is you'd have a feature and that feature would have a set of values it could take on while our features can be multiple values at the same time action hacking variety can be cross-site scripting at the same time as it's um SQL injection at the same time

it's use of stolen credentials and the hierarchy the Json scheme is fine it knows how to handle that but again as we get into parsing we'll see how we have to accommodate that those challenges and so a word about kind of the the schema the technical realities of it so the scheme itself it was good but we realized there were some fields we wanted for uh DB and then there were other fields that we wanted for our open data set vcdb and then there were some fields that really you know were just exclusive to those two things so the community uh schema didn't have that so now we're up to three schemas um and by the way each

schema isn't really represented by a single file it takes multiple files to represent it because the parsing scripts use different um uh different pieces of the schema and so now we've got five files per three uh types and of course you know it's not a static thing we change it over time so now there's iterations of it so now we have five files by three schemas by multiple iterations and you know it turns out that if you take that and you store it in a single repository under the same name and just hope that people understand which directories are which um you you don't know what the hell is what and so instead we said let's go

find a better way to do that and so we've broken it out by the um the repositories that the schema is live in so our DB version of the schema lives in our DBI repository vars Community the C lives in the Varys repository and now we have a very clearly defined set of files and what they do and what they're used for and that's defined in the repository with them and instead of adding directories within the repository for the different iterations we just uh iterate the entire repository so we have good traceability and these are things that are obvious they're clear no one would ever make these mistakes right except us you know last year um now the

next fun thing about this is the reality is there's there's the schema and then there's like the conversion script and then there's the input file and the theory is that the input file will be perfectly parsed by the um conversion script which will perfectly generate the schema the reality is that physics does not require the same Fields be in each one of those right it's not like magically because you said it was that way that's what happens what happens is that you have this and it all agrees the first time someone writes it but then someone over here on the input file decides that they want you know a couple more enumerations and so they put them in there but they

don't tell the guy who wrote the uh import script and so now it's here and maybe they even put it in the schema so the schema had something but no one actually bothered to come back and tell the guys that wrote these two things to put it in there or they thought they were paring it here but someone commented that line of code out because it was really weird and they didn't understand it and all of the sudden over a couple years these things become inconsistent so we spent a lot of time this summer going back and making them consistent and the reality was the the the issues are not in any of the Maj or

were not in any of the major places they were minor things that really had low coverage um enumeration wise but this is one of those challenges we run into is you have to have a plan for how you're going to keep all of those consistent and so the next step once you've gone through that you've you know parsed the day in the scheme and now now you want to validate right because you you want to trust but you want to verify and we get uh different anomalies we get kind of missing values or typos are the most common sometimes there's always one file that just you look at it and you're like how could any of our tools have

outputed that raw data like that is not physically possible possible so you end up like spending hours going and fixing that one up and like Anna said we have heris rules that we're applying um we want to double check those heris rules and so the validation script checks the heris it also checks the business logic we use NYX codes for Industries we want to check that even though that's a six-digit or a five-digit number it's one of the valid codes because not all numbers are uh valid and so the the validation script does that and this is another place where we ran into an interesting technical challenge cuz turns out validating you know 100,000 records isn't always fast and our python

script ran really really slow at it and so we're like you know what we can do this we we'll turn it into node.js and we'll validate in node.js and that works it was fast but added a huge amount of technical overhead because it added entirely new tool chain to our process and when you add that one it makes scripting a lot worse because now you're calling no Js from python um and on top of that you tell your team members hey you know do you want to install node.js and they go no you know and so they're like you run it and so now you're the only person who gets to run it um we fixed that we went back and

looked at the python code and figured out now the python is actually faster than node.js was but better we convin con condensed it down to a more maintainable product and in fact because it's all python we can import the validation script and we can run that along with the import script so it's a singl step process and so of course if we're validating that kind of implies that if we find anything we have to revalidate or reimport in fact um we also get a lot of new data you know Partners give us data over time you know they give us this set and they go well we've got some more and they keep adding and it turns

out that if you have 40 Partners providing incident data we had 70 Partners 40 or so of them provide incident data and you have to do these iterations multiple times it it's a really great opportunity to um uh look for Speed because I'm the the one that has to do this and you know it turns out um we had it down to four steps last year and but every hour I had to come back to my computer type in some line of code hit the button and leave for an hour and it turns out when you do that four times in an hour each step it takes four hours to go through each freaking iteration it's my time you know and you

can't like have a beer while you're doing that and so the team like finishes up the coding at like 5:00 they're like great Gabe we're ready for you to start importent and like 9:00 I can finally have my freaking beer and I don't I want to have my beer earlier than that and so this year we're bringing into a more um push button style we're keeping state in the system so that we can just hit a button it will realize okay these are the files that changed those are the ones that need to be reimported revalidated it will rerun that assuming it gets no validation errors it will tell us if it gets them we will let us

know and that way we've done and it it gives us speed it gets us consistency gets us repeatability in fact it lets us we we paralize this parallelize this now so that each individual partner coming in gets its own process and we're running five or 10 processes at the same time which would make this so much faster and so now we've got 100,000 Json records right and the way we analyze these is we print everyone out and then we put them in front of a a camera and then we just flip them really quick and we look for what we do is we put them in R the the whole picture thing isn't happening um so who here is used r

Yeah couple okay so R likes tables right it likes tabular data does r like Json no R doesn't like Json um it it'll work with it but it's not what it's designed for and so now we need to cram a tree a hierarchical structure into a data frame and this is where we start to run those problems with the hierarchical nature and the multi nomial uh non-exclusive multinomial uh structure of the data and so the way we handle this is we make every Leaf in the Json schema its own column so there is a column for Action hacking variety cross-site scripting and it's a Boolean column it's true or false and so every record that has it's true

not is false and we have another column for Action hacking variety SQL injection another one for lost and stolen assets and so now we've solved our um the problem of storing our non-exclusive multinomial data but that doesn't solve the hierarchical problem and so we're going to have to add in some surrogate columns to deal with that so we're going to have to add in a column for action. hacking and the way we're going to fill that in is anything that has a variety of action hacking is going to have action hacking is true as well and so now we've accounted for our hierarchical data but again and that one now is pretty easy to parse it doesn't have the

parsing problems that the um the non-exclusive data has um see now a final thing is like this is all your current data set like you you did all this process you know you brought in your 100,000 records but we don't want to just know about what happened today we want to know about what happened last year and what happened the year before and so it's important to keep your data in the same format from prior years and keep it someplace accessible so that when you import you're not just importing your current data set you're also importing those historic ones and you need to be able to to upgrade them so if I use schema version 2 today and last IU

schema version one I need some way to take those other you know 200,000 records and bring them up to date you know and this is all straightforward stuff it's all easy but it's all things to think about if you don't think about them you run into them when you try to do the process and so once we've got it all into the data frame and nice into a place to analyze it it's time to start exploring it I'll hand it over to Anna to do that so we've made the exploration process more um playing with the Audio I think I'm louder than you are possibly hello out there okay uh so it made the exploration process more efficient by

actually preena a lot of the more common exploratory analysis um we autogenerate three reports where we slice and dice the data in different ways by the way we use Knitter which is an R markdown language for reporting um so what how we slice and dice the data is that we look at the overall summary we look at Trends over time and then also any significant changes since the previous year the reason why we look at the very it look at it from different perspectives is that we don't always know if the features value or its Trend over time is actually going to be more interesting and then also with these reports we actually run them twice so we

run it for incidents and then breaches as well we also run these reports for the overall data set and then also subsets so this includes patterns um uh Industries or Partners so overall we're working around there's probably a total around 8,500 100 figures generated in total and this is another place by the way where we parallelize so these reports take a long time to run because there's so much data in them and so we end up running you know five or 10 reports at the same time as well in our workflow for exploring the data we also don't wait until the data is complete to start doing the actual exploratory analysis um we can already gain Insight

before the data is complete and then um we also start exploratory analysis early because it helps us to find any kind of anomalies that we may need to go back to the raw data or the source so we run these reports early and we run them often also as a note it's an easy mistake to uh only look at your current version of of your data uh Gabe and I make it a habit to actually explore the data frame manually because sometimes we may be able to gain Insight that the preener pre-generated reports don't capture and the key thing is to always go back to your raw data so our main objective is to make novel

findings to start developing our hypothesis these hypothesis help us to decide all right what do we actually need to talk about in the report we look for things that are big that have changed significantly between years things that are commonly seen together and things that we you know we've seen patterns and we're they're almost predictable but we also want to take care to um see what's unexpected and what may cause that so it's also not enough to actually find something that's novel we have to answer the question okay why why is it relevant so this is where we put the finding into some sort of context so first we have to see is the finding actually true or do we need to

fix something in the source data or the import process we use tools such as Association rules to help to answer the question of why a finding is relevant um so these Association rules show us how enumerations relate to other features and what had occurred most pertaining to the novel Finding so uh we use some key attributes for Association rules such as support confidence and lift and then we also look at the distribution of a feature with and without an enumeration for example if we want to look at action we will look at the distribution with actor. internal excluded as well as not actor. internal so we not we end up looking at the entire distribution and

seeing how it shifts and we don't look at just a numeration but actually a set we not only do we look at incident data we also look at non-incident data it helps to answer the question of why key finding is relevant uh and it also helps to put our incident and breach data into some sort of context so non-incident data is not as refined as incident data we may not necessarily you know get what we need or want from the partner there's no standard schema it takes uh a lot of time to do exploratory analysis because that's all manual for non-incident data and then we make sure to communicate with the partner to make any kind of clarifications or maybe

they'll be able to provide some additional data to us the all the other nice thing about non-incident data is that we can correlate findings across multiple data sets so say a partner sends us malware data we and we maybe find something that's interesting or a key finding what we actually do is we collect multiple data sets and then we're able to actually increase our sample size which decreases our random error and we're able to verify those key findings and then non-incident data grows quickly and um it's quite big so what we use is 32 or 64 gigs of RAM and an SSD which is very sufficient so we didn't spin up any spark cluster or

anything um and we also spun up a Naas to share our data files this really helps to ensure consistency so if Gabe generates a stat and I validate it um we're actually working off the rec cleaned version which saves a lot of time so then from exploring we actually have to start thinking you know go moving into the writing phase so first we have to decide what actually constitutes the data set not all records are made the same some records have a lot of unknowns in the required Fields so what we do is we score all the records for complexity and then we discard all those records uh that are below a certain threshold so to

date we we have our threshold is set to seven in numeration ations and this helps us to answer okay what actually encompasses the data set uh because the goal is that we want to represent all legitimate data the data set will end up being a combination of sub of subsets of the full Data Corpus so for example in 2016 uh we had over 300,000 records but the data set actually ended up being a little over 230,000 records so after removing those low complexity records um the next thing is to identify and review all the major subsets of the data so this is normally in the several hundred breaches to several thousand incidents for example in 2016 our major

subsets of the data was dryex and web app attex so we have to understand what causes those major subsets in the data and we need to consider SE several things so first there a limited number of sources report a lot of common low Fidelity incidents which was reflected in web app attacks did a limited number of sources have a unique view that added a lot of high quality incidents and this was reflected in dryex and then finally did something unique happen this year that made something Stand Out across sources and this is where actually started getting into our into our patterns where we see the same types of attacks happening over and over so then we have to answer the

question okay how do how do these major subsets affect the overall data set do does it look like they don't affect it at all do they skew it in a way um do they Eclipse it entirely so we have to decide on how to handle these first we analyze the subset as part of the overall data set and then we just caveat skew which is what we did with dryex in our report we could also analyze the data set separately which is what we did for web app attacks and then we could also do both which is where we start getting into our patterns so overall before writing we need to work off a single subset of the

data this is our subset in Varys and by at this point we've decided what to include and exclude from the data set we're also looking at approaches of accommodating for skewness uh specifically you can reduce skewness by applying transformation so you replace a variable with a function of that variable so if you want to reduce right or positive skewness uh where extreme data results are larger you'll have frequent small losses and few extreme gains and that's where you want to start taking your roots logs and reciprocals if if you want to reduce left or negative skewness where extreme data results are um smaller frequent small gains and few extreme losses that's where you want to start taking your

squares cubes and higher Powers so then the fund begins when we start actually generating our stats and developing our hypotheses first we differentiate between an incident and a breach in our analysis and then um when we report the stats in the yeah when we report on the stats in the report so I had to Define how we Define an incident and we Define a breach as a security incident that has confirmed data disclosure of an asset to an unauthorized party we also need to be aware of what constitutes a filter for example if you take the field year it can be interpreted three different ways first incident year that's the year that the incident occurred second we have DB year

this is the DB year the data was assigned to so our collection period is from from November to November and then finally we can look at it as data set year and this is the year that the data was actually imported so for coding up incidents in vcdb say we code up we code up in 2016 the actual incidents occurred several years ago it would be inappropriate to include them into this year as current incidents in the report because they had occurred PRI in Prior years so the key here is that those said incidents are assigned to the dbir year Associated to the actual incident year we also have to be careful when people request for stats regarding year so if

they ask for a stat across multiple years we usually assume calendar year if they ask for this year we usually assume data since last report again we always communicate and clarify that we're talking about the same thing and also again we exclude those old incidents that we just happen to code up this year but that actually had occurred in previous years then we also have to be careful of catchalls for example example unknown other nna what's measured what's not measured do we count it or do we not for unknown we consider as not measured it doesn't add any additional knowledge when we encode the unknowns because it adds no data we typically exclude it for

others we consider it measured but it's just not covered in our schema so it does add data so we do include it finally with n it actually depends on the question so if we're asking what percent of breaches involved a financial motive we would typically include this because some breaches just have no motive but if we're asking what percent of motives are Financial we exclude it because na is not a motive so those questions sound like they're asking similar things but they're actually quite different then comes counting n how do we count n the normal way to calculate it is that we count how many of each enumeration we sum up the counts and we

get our n this doesn't work for those records and incidents that have multiple enumerations so for example if you have one incident that is both social and malware it will be counted more than once so instead what we do is that we subset the records that we care about or that we want to analyze we exclude our unknowns and any potentially Naas and then um you know then we sum up the we count the number of Records I.E the number of rows before summing up those features next we look at confidence this is very important when we're subsetting data because our sample sizes decreased quite quickly this is where you want to start looking into your random error um

we look at confidence intervals we want to see if they overlap for example if we're we're you know we're working with statements like hacking was greater than maare this year or hacking increase from last year so just to recall if there's no overlap and the confidence intervals this implies that the difference is statistically significant if there is over overlap it doesn't necessarily mean that that the difference is not statistically significant so depending on who you talk to people Define confidence intervals slightly differently um it is correct to say that there is a 95% chance that the confidence interval you calculated includes the true population mean some other people Define it as there's a 95% chance that the population mean lies

within the interval which is not quite correct also if you want to uh be more confident that an interval contains a true parameter um so for instance we instead of 95% we want to look at 99% the interval will become wider um if there's more variance in our data it also creates wider intervals and as well as um smaller sample sizes that creates wider intervals and um we know that there's a inverse square root relationship between confidence intervals and Sample sizes so if you want want to decrease your margin of error by half you technically have to quadruple your sample size so overall we calculate these confidence intervals we don't include them in the report because

a majority of our readership could care less about the actual upper and lower limits um we can treat our data in two main ways so we can we look at it from the binomial perspective which is what we have done in previous years and to date it depends on what question you're asking for binomial versus multinomial for binomial we work with primarily Boolean columns so you know was it for an incident was it you know social a social incident true or false malware true or false was there data disclosure true or false um so again this was our primary one but the problem here is that it doesn't model interdependencies between features so that's where it's nice to

start looking at it from a multinomial perspective especially for those records that have more than one enumeration because you're not only looking at a singular enumeration you're actually looking at a set we have a crop ton of stats in the dbir lots of percents uh we can't ever say that there was X number of breaches in you know in the past year we never know the population size we always work with sample sizes so our for our percentages we break it down by action or by pattern or whatever it may be the problem is that when when one enumeration increases it may look like another enumeration is decreasing for example if you have 100 hacking breaches in 2015

and in 2016 then you look at malware you have 10 in 2015 and you have a th in 2016 malware is going to look like it's increasing but it's going to make hacking look like it's decreasing when in fact it was actually consistent per year so with things like that when we generate figures and it looks a little bit misleading um we are very careful to annotate that for a readership to avoid any kind of confusion so then after we generate all these stats then we have to actually validate those hypotheses which Gabe will get into so I got to talk about validating the data now I get to talk about validating those hypotheses right

um well we'll start with the figures so this is one of the things that in reality when people say they read the dbr what they really said was I I looked at the figures because you know that's like that's the standard way in fact it's cool if like when we're briefing the dbr way I cast that question and you're like who read the dbr and there's like a few people it's like if you ask who read it first like a bunch of people raise their hand if you ask who looked at the figures first you get a lot of hands there and then no one raises their hand for the reading question because they weren't really sure where you were

going with like when you asked reading first and like I I don't want to like not raise my hand here and make it sound like I didn't even look at it but people look at the figures right that's that's how we communicate and so we want the figures to communicate our points clearly right and so we we use the common ones we use bar graphs and we use LINE charts you know very um standard ways to go but you can communicate clearly um without necessarily being simple you can have complex figures that communicate clearly and we try to put a little bit of variety in we had a parallel coordinates chart we actually had an arc plot in there uh for everyone

that read like into the appendix um but the important thing is how you track these figures for validation so we talked about we run the exploratory reports right um every one of those exploratory reports with every one of those figures has a unique ID on it and when they someone from our team goes okay we're going to put this figure into the report or into the section I'm writing we capture that unique ID and all the code associated with that figure and it goes into a validation a figure report for us and that figure report has the unique ID it has the title which is effectively the hypothesis for that figure and then it has the entire code

to generate it um that way we can track it and we can track it as things change and if someone comes and requests one of us to do a figure for them we give them the unique ID that way is a report is being um written and put together we have the idea of the figures that are in it and we have a report that can generate them all and that way we have confidence that we can repeat those figures and can do them consistently and we do this all with Knitter um the same thing we do the exploratory reports with and we do it with ggplot has anyone here use ggplot has anyone written like a GG plot figure

that was really nice and like under tin lines it's big but you get really nice nice um layout quality figures from it it's the way to go and by the way so we wrote the report and then um and they did this last year we went on a podcast with a a professor who specializes in data figures and had him critique it because we want to continuously get better and you know what he he didn't say a whole lot but you know what he said and it doesn't count if you actually listen to the podcast he so do you remember like the flow graphs they were really bright and they had lines and stuff in them um he's like yeah you

can't use those ever again people can't read those and we really we kind of knew it but we included them for consistency but next year we're probably going to replace them with small multiples charts also a great idea is he took the figures and he said look you've got this figure that's got this one really big bar and these little bars why don't you put like a big big red letter A on that bar and then in your in your text you can put that same big uh red letter A that way people that want to know what that bar is or why it's big can find the spot easily in your text it's a great

recommendation and it's one we'll hopefully see in the dbr next year and what we do for figures we want to do for our all of our hypotheses right so you see statements like hacking was greater than malware or um malware increased since last year or you know fishing was 40% of breaches every single one of those statements in the report this year was in a validation report and we probably made a mistake in that this year last year we waited till the report was written and we wrote those next year every single time we give out a stat we're going to give out a unique ID with it and we're in our validation report we're going to put that unique ID we're

going to put the hypothesis and we've got this report for this year's um put the hypothesis we're going to put the code to generate it and we're going to put the outcome that way every single number every single statement in there can be tracked back to the logic which arrived at and it's so it's very similar to the um the uh figure report and the important thing or the reason that you need to do this is because anomalies creep in but you know there's something you can do about keeping those anomalies out um the first is start from a single codebase like it's it's just bad juju if I'm like I'm running one version of R

and Anna is running another version because we may come up with the same numbers we may not it depends what's changed between the versions um in fact it's really bad if I like actually change what version of R running during the process so like we have a cut off somewhere in like the fall where it's like we stop updating our statistical software until like after the last version the last iteration of that dbr is published because we don't want to uh break our workflow or change our workflow Midstream the next is you want a single data starting point so like Anna said we we have a lot of records and we have to ultimately choose what

are we going to analyze as the main set and what's going to be analyzed as um unique subsets separately because we want to represent all valid data but we don't want our data to one set of data to e clipse others and so it's important to BU clearly label what that starting point is all of the an analysts that work with the data know where to start with their data we all start at the exact same point and that helps us with repeatability and additionally we also have a common analysis process we all use DLI who uses deer for our people oh come on we have that's wait so we we had a lot of people who used R and and very

few hands went up for D plier like if there we go at least one person cuz if you're using R and not using D plier you are going to be amazed at how like actually logical R can be you know as opposed to like feeding stuff backwards in time you normally do it you like actually get to feed things forward and so we use this standard process for doing it and because of that it's very easy for us to all arrive at the same answer every time and so on the other hand there's places where we've tried to do things like the association rules we talked about those so with the association rules um we've had issues where like we

generated stats and we generate interesting stuff in the association rules but they're very hard to replicate you have to start with the exact same data um the exact same columns we can't run it on all of our columns so we have to subset the same and then it's like any machine learning algorithm the arguments you give to it strongly affect what the actual outcome is and so unless you got all that exactly the same as the first time you ran it you're not going to get the same answer so instead you know for something like Association rules we're going to go back and recreate those stats using our normal deer process that way we have confidence

in the repeatability of them and so and by the way um like we talked about in this track yesterday uh it doesn't matter what your stats are if there's not a so what to them and so you're going to do all these great stats you're going to do all this analysis great confidence intervals and when it comes time to write tons of it's going to get cut out just like line after line like you would not believe how much like it's our favorite it's always the best analysis isn't it like the stuff that we love gets cut out and so it's important to just be prepared for that because if there's no so what behind your data it

doesn't matter how good the stats where the analysis was it doesn't go in and so looking forward um we're doing a lot of work uh the First on Varys um we're currently iterating versions to increase the enumerations to get more coverage on what people want um we're also going to fix some consistency um issues in the hierarchy on the next version and then we're going to go to sequencing so that you can in various you can record this action happened then this one then this one or this action happened and it compromised this which led to this you know which will let us capture some additional granularity as well as being able to represent things

like pin tests um we're also doing a lot of work on vers R which is the r package that parses various data um we're doing some internal changes to uh make it easier to work with it for who have to deal with this kind of stuff it currently uses data tables which sound really great until they break everything you do like once a month and so we're going to get back off those it turned out sounded good when we did it um we're also putting in additional uh helper code and functions and more um analysis tools within it it's stuff that currently resides in a separate repository but we want to make sure it's available to everyone and we're putting

in more statistical tools and adding in things so that other people can run confidence intervals when they're working with our verar is by the way all this stuff is you can download off GitHub um we're also working on our workflow and by the way when we're updating this stuff we're updating in consistent manner especially like updating the ver schema you know we have a huge chart for everything so that every single thing in the entire workflow gets updated at the same time and we don't have half the workflow on one thing and half on the other and that's hard to do but it's something you need to plan for if you're going to maintain um a tool like this a data tool

and so we're updating our workflow making it easier more repeatable faster um and generally trying to make the whole system easy to run consistent and repeatable and before we get to the conclusion there's one thing I want to talk about here um we all generate data products and and we've come to terms with the data but not everyone in the world in fact not everyone in information security has because you know it's hard when your brain is telling you something like someone's telling you something with data and your brain is telling you something different right and because you look at the data and it feels a bit like magic right that analysis process even when the entire methodology is

disclosed it still feels a little like magic because your brain is saying I don't know if I believe that data um but there's a sign um in uh the UK it's like uh art on the side of ability it's says uh your mind is crazy and tells you lies right and you know there's a lot of Truth to that because if you kind of contrast what happens with your brain right CU your brain can organize all this like it'd be great if you had a process but you like you could put this all into a single picture um but it take a lot of skill and a lot of thought to turn that into something interesting all

those disparate parts and that's what your brain has to do and the downside is there's there's no documentation of what your mind did right how those ideas are formed there's you don't know what data went into it you know maybe your mind excluded all the little cardboard bricks maybe it brought in a bunch of uh linkoln logs and added those because they were somewhere in your mind and so it decided to include them in your model you know you have no way of telling that you have no way of questioning or validating the process that went into creating the picture H and you have no way of maintaining consistency so like today I might look at this and know that

looks like a flamingo you know and you know a few months down the road I get the same problem with the same data and like great that's a Tonka truck you know you have no way of maintaining consistency over time your mind is a black box um Andy Ellis the ciso at aamai I think is planning on giving a presentation on this called the complexity apocalypse you know and his base thing is that information security risks are now too complex for system one for for gut instinct analysis and so it's important that us as you know datadriven people you know encourage people to build a mental model that satisfies all the facts right those that

come from their mind and those that come from the data so in conclusion usion now the data is not perfect it never is it never will be but you're here to help it tell its story but not torture it have a consistent process you know and if possible automate for repeatability more importantly document everything you do put a unique ID on everything you generate question everything that everything of interest and communicate because when you communicate for the people that consume your report it helps them understand the process that went into the numbers and for the PE for you it helps you understand and learn from the people who are using your information and so with that um I think

I left good for me I left like five minutes for questions people have

questions so you mentioned the uh fun stuff that doesn't make the report because it failed the so what test yeah any favorites or items it was all the malare analysis disappointed by we we pretty much uh we slice and dice specific uh malware names and we wanted to see what families were very prominent um we did a couple did a couple more stuff with malware but it just didn't make it in I don't yeah there was a lot around like so we talked about the subsets of the data right and we talked about how non-incident data adds context well part of the reason we put left the dryex subset in the overall data set um

and as opposed to analyzing it separately was because when we looked at the malware data you could see the impact of dryex in it um there was just a significant amount of um of botn net data and what it really was was it was executables coming by email or infected ma uh infected Word documents or office documents coming by email and so you could see how these Bots were actually happening significantly and it helped inform us on um and how we handled the data and we had some really pretty cool pictures like they were all done we did them like up to the final quality and they just didn't make the final cut but if you

ever get a chance to like see either of our like actual dbr presentations like it's it's ours we they can't tell us not to use those figures and we got them so like I got them all in my my presentation welcome two quick questions um somewhat related you mentioned um three different schemas yeah uh that's the first I've heard of multiple schemas is when it comes to the analysis and the storage of the data is this something new to this year's report or does we only operate on a single one of those schemas we always use the DB schema but there's there's some things like if you look at the schema you notice there's a plus section

and that plus section is marked to allow additional features in it and the idea is that if you as an organization have things that are relevant to you but not to other people you can put them in there like we we used to deal with a lot of PCI Data and so we have you know specific PCI uh numerations that really aren't part of the overall schema but that we have and then we use a dbr one dbr year is actually a plus thing it's in the dbr schema but it's not in the the standard schema and the reason is because most people don't care about what year the data was imported in the dbr because they're not importing to the

dbr and so it's just like little helper stuff like that that's specific to us so we always use the um dbr schema and the dbr schema is 95% the community schema that gets to my other questions um have you published what the different schemas are and disclosed um within the report because I don't recall reading this um those differences so that as um other data scientists other analysts are taking a look at the community schema what's up on um the community site um they can reproduce your results using your methods well should be they'll be able to reproduce so this gets back partially the data um because of the data sources our Agreements are that we

are not allowed to share the data like if you create any analysis on vcdb and send it to us we'll run it on our data set and give you back the analysis but our agreements don't allow us to share the actual data and so ultimately there will never be kind of the perfect reproducibility um but I've got no problem sharing the additional um features that are in the dbr stuff we talked about him a lot you're going to be sorely disappointed because there's nothing exciting in there all the cool stuff happens in the community schema like you know cuz really what people care about we call them the foras right action in fact it was in the schema

thing like the on the circles that's where 99% of our parsing comes from and the stuff that gets in the plus section in fact unfortunately it's a lot of stuff that they like kind of tried they're like let's try to see if this um feature gets coded in or we can fill stuff into it and it turns out no it didn't get used at all and and so you even in the plus section for other than like the metadata ones like dbr year it's just all columns and no you don't it's in the schema but there's no actual data in it but there no problem to share it you know it doesn't really affect has

been shared um I don't so I'm trying to think what the best way to share it would be because I don't on the vars website but that's this is what I'm worried about right because if I put that on the various website now I've put two schemas two different schemas side by side and I don't want people looking and going well which one of these should I be using you know the qu the answer is you should be using the community schema because that one's designed to meet the needs of the community and if you use the additional features you know potentially it would confuse you and also we're not as good about keeping those additional ones as clean as we do

the community ones like there's some there's like some enumerations that I was cleaning it up this year and I'm like hey why' you got like what is this field why is this in here and they're like oh yeah we tried that out like 3 years ago and it didn't work and we never took it out of the the private the private schema and so I'd also be worried about confusing people but you know if I I'd be happy to email you you know and I we've got a Blog but it turns out that getting stuff onto the blog is a whole um a lot of work and so there's got to be some way maybe in the there's

a DB repo um that we use to that's got all the figures in fact if you didn't know this um last year we posted all the figures there this year we posted um the figures plus the data behind the figures in an RDA file and then next year I'm going to try to see if I can post the validation uh report in there but I could probably dump it in there um I'll need to be reminded I forget things very quickly um but if you email me at db.com I'll make sure it happens I'm just thinking this would go a long way producing some of the criticism that the report has gotten over the years

especially this past year regarding the vulnerability data openness about the sausage I think this presentation is great um I I think this should be shared more widely sure if you think it would help I'd be happy to do it yeah I I think this would reduce a lot of cries and prises God know with these reports we' got get set for the yeah cool thank

you

Dominating the DBIR Data

Related talks