BSides LV 2022 - Wednesday - Ground Truth

Name: BSides LV 2022 - Wednesday - Ground Truth
Uploaded: 2022-08-11
Duration: 9 h 46 min 31 s
Description: A place where hackers, academics, and data science practitioners can share ideas, ask questions, and compare notes. It is a venue for talks rooted in scientific approaches to infosec, such as statistical analysis, machine learning, and less common disciplines like linguistics. We are excited to show

BSides Las Vegas9:46:31408 viewsPublished 2022-08Watch on YouTube ↗

About this talk

A place where hackers, academics, and data science practitioners can share ideas, ask questions, and compare notes. It is a venue for talks rooted in scientific approaches to infosec, such as statistical analysis, machine learning, and less common disciplines like linguistics. We are excited to showcase talks on theoretical topics, examples of successful and failed attempts to apply techniques in practice, software and data set releases, and discussions of relevant techniques from related fields. Some past topics of interest have included: data processes, methods for getting or cleaning data, basic statistics done well, data visualization, real science (scientific method/hypothesis testing), practical applications of data science in production, academic research (both student and faculty), machine learning, attacking machine learning and data science, risks of machine learning and data science, and the use of analytics to discover the story told by the data.

Show transcript [en]

[Music]

[Music] so

[Music]

[Music] so [Music]

[Music]

[Music] so [Music]

[Music]

do [Music]

[Music]

[Music] do [Music]

[Music]

[Music] so [Music]

[Music]

[Music] so

[Music] [Music]

[Music]

[Music] [Music] do

[Music]

[Music] do

[Music]

[Music] do [Music]

[Music]

[Music] so [Music]

[Music]

she [Music] so [Music]

[Music]

[Music] [Music]

[Music]

[Music] do [Music]

[Music]

do [Music]

[Music]

[Music] do [Music]

[Music]

foreign

[Music]

[Music] do

[Music]

[Music] [Music]

[Music]

[Music] do [Music]

do [Music]

[Music]

[Music] do

[Music]

[Music] do

[Music]

[Music] so [Music]

[Music]

[Music] [Music]

[Music]

so [Music] [Music]

[Music]

you

[Music]

[Music] do [Music]

so [Music] so [Music]

[Music]

um door bouncer are you still letting people in okay thank you okay

are you going to give me a thumbs up when the camera's rolling oh we're going we're live okay we're doing it live

all right good morning thank you all for joining me here today when i say so i'm going to ask all of you even our folks joining us remotely to close your eyes for 30 seconds i want you to think about your memories experiences and interactions in a library ready close your eyes the timer has started

open your eyes now i know some of you are spread out and do your best but i'd like you to turn to a person next to you i'd like you to ask them to share their memory of using or visiting a library then share your memory with them for those of you joining us remotely please participate as well just think of a memory and say it out loud or tell it to a pet or a human who's nearby again i'll give you 30 seconds the timer started talk amongst yourselves

all right the timer is up 30 seconds is up

okay wrap up your conversations thank you i didn't realize people were going to be so chatty normally with this crowd they're they're not so um [Music] okay so by show of hands who had a positive or inspirational memory show of hands positive or inspirational excellent hands down thank you now by a show of hands who had a negative memory don't be shy i won't be offended anybody have a negative memory all right you need to leave no i'm just kidding no no thank you for your honesty i appreciate it okay so did you see it was most of the room that had positive experiences now if we were to ask people about their experiences with information security

do you think most people would have warm fuzzies no right librarians and infosec professionals have a lot in common we are both subject matter experts we both provide knowledge guidance instruction we both understand something that can be confusing or overwhelming at first so why do libraries rank so favorably in information security so unfavorably well that's what i'm going to try and change welcome to lawn overdue making infosec better through library science i'm tracy mayleaf

[Applause] so before i became infosec sherpa i was library sherpa i have a master of library and information science degree from the university of pittsburgh go pitt i began volunteering at my local library when i was 13 and have worked as an academic corporate and law firm librarian before bringing my skills to information security i'm here today to show you how we can learn from the successes and failures of librarians libraries and library science as a discipline to help make information security better using my unique perspective of having a foot in both worlds libraries have been around a lot longer than information security this is obviously a very brief sampling but i'm using it to demonstrate how

young information security is compared to other professions and industries we are still growing and changing and we still have a lot of work to do on ourselves now just take a look a look at that for a second look how long ago some of the oldest libraries have been and some of the more quote unquote modern libraries from the 16 1700s some of them are still operational today like bodleian library at oxford university in england but information security i was kind of grasping at straws here i decided to pick first digital computer you know and then then you got arpanet and then you have cliff stole's book look how recent that is most of us you know the the second two

that's in most of our lifetimes so i want you to understand that i feel this comparison is just because library science has been around for a really long time and we can learn from them and we have to remember that we are still a young growing industry so to get everybody up to speed i'd like to give you a super quick overview of modern library science and i'm not kidding this is a super quick overview so we got gabriel first one french librarian and scholar gabriel nade wrote advice on establishing a library in 1627 and that work is considered to be one of the earliest foundations of librarianship next we found out whatever the hell

thomas jefferson was doing in monticello pause for obscure hamilton reference applause break anyone anyone okay thank you he created a classification system for his personal library then in the year 1800 u.s president john adams improved approved not improv approved an aca congress which is considered to be the beginning of the library of congress quit show of hands who's been to the library of congress in person excellent go it is so cool it's really cool but then cut to 1814 when those pesky brits burn down a lot of washington dc do we have any british people here right now oh okay you know what we're gonna have a word later okay those pesky brits burned down washington

dc including the books of the fledgling library of congress collection so what did thomas jefferson miss second pause for hamilton reference applause break okay no no we'll keep going he missed nothing he was on top of it and quote-unquote generously offered to sell his private library collection to the u s government yes you heard me correctly not donate but sell in 1815 the us government paid him roughly half a million dollars in today's currency today's value for about 6 500 books and the information system about his classification system that he devised once he did that his private collection then became the foundation for what we know today as the library of congress and you can see a lot of his original

books in the library of congress so this is how we got started with with libraries in america basically was thomas jefferson collected books devised his own classification system and then generously got half a million dollars in today's value for that collection but moving on i do want to say that yes i am very well aware that this is a super duper american-centric based timeline but i am afraid that is what i learned in library school so if you do want to learn more um i do encourage you to you know investigate on your own and do some research but i just wanted to kind of give you a quick snippet to bring you up to speed to

where we are today and i am acknowledging i know this is very american-centric so apologies to our non-us folks you do have your own very robust and strong library histories another quick show of hands who here has heard of the dewey decimal system okay well sorry to burst your bubble i'm glad that you've heard of it but i want to drop some harsh realities here he was a jerk i need to share some harsh realities with you today unfortunately one of the commonalities library science and i.t and infosec in common is the presence of inappropriate behavior by pression professionals in the field yeah i went there yeah i did so yes while mobile dewey's classification system did revolutionize

library science it was also very flawed because of dewey's racist and misogynist opinions views worldview et cetera beliefs librarians have been working tirelessly as early as the 1939 to correct all the races and other bias that were built into his system presently groups like the american library association have been removing dewey's name from awards and continue to approve upon the long-standing wrongdoings of his classification system but on happier note there are other classification systems that aren't racism misogynist so i want to just take a minute to talk about controlled vocabularies in information security so dewey is not the only classification system but i understand why it's the one you're most familiar with because it's prevalent in schools and public

libraries but me as a law firm librarian i use library congress so to me the letter k means law and i could not tell you what law was in dewey decimal because i never had to use it as a professional but there's some other ones so the medical librarians and they'll end national library of medicine they have mesh that they use so this is something i'm very passionate about is is that we we really need to have my wish for information security is that we have an industry-wide infosec classification system controlled vocabulary some of what we use comes from the military and we've incorporated into our own infosec lexicon but we also have some very unique

phrasing that really in my opinion should have a ruling on whether or not it should be used i'm looking at you thrunting and smishing i'm looking at the two of yous okay yes we have lots of resources and we publish glossaries defining these terms but what i feel like we're really lacking in information security is a centralized authority that makes determinations on the validity of phrases in library science there's a publication called the rda resource description and access when you catalog resource which basically means the application of descriptive metadata for that item you consult an authority like the rda so my wish for infosec is that we need to develop a robust authority for the

language that we use it will be less confusing for users for professionals for journalists for everyone and please don't be thinking or say to me afterwards well why don't you just do it this is not a one person job i think there might be a small group of individuals who contacted me once who is trying to work towards this this is definitely a committee and organization and institutional thing this is not something a single person can do you know you might argue well novel dewey did it but he was a racist and misogynist and he had sole control over the vocabulary and look where that got us so we need a diverse group of people

to give input and to be an authoritative authoritative body to really to delegate what sort of language we use and now keep in mind this is also going to help you communicate better with your stakeholders right you know if you start throwing the words around like fronting and smishing i really hate those that's why i'm just going to publicly shame them do you think the cfo on the board that you're wanting to approve your security budget is going to understand what you're saying if you're just kind of using slang or words that can't easily be understood or defined we need to come together on this and that's something that i really really want to happen so i'm putting that out

there all right now i'd like to get to our main course in 1924 a librarian and mathematician in india wrote the five laws of library science and i like to keep in mind for the updated interpretation the word book really goes beyond just a paperback or a hardback it can mean a digital resource or any other offering that a library may have case in point did you know that many libraries have 3d printers that you can schedule time with at no charge it's true you may have to bring your own filament is that what it's called the stuff you put in to make the thing you may need to bring that but the machine itself is available to be used

for free at many libraries you can also use gardening tools you can borrow a rake if you need one you can borrow household equipment libraries are borrowing things so book really doesn't exist anymore it's just resource so in 2022 consider a book to be some sort of resource or tool of any kind now you've you've had a chance to see here what these five laws of library science are i'm going to go through each one and kind of dive in a little bit further and i'm going to help draw some correlations here to information security because i want to emphasize how much we have in common how much we can learn from them so let's dig in a little bit to the

first one books are for use resources are for use how about technology is for use apps are for use devices are for use your wherever c library i'm going to try to just change this out with a security term your your security team should be a welcoming environment insert chuckles here your security should provide easy access and convenience for the users do you do that do people know how to contact you simple as that are you squirreled away in a secret room in a building somewhere or are you all distributed and nobody knows where the centralized security team is the primary duty of staff is a curator do you know all of the security components what's going on

in your your enterprise are you working with your it teams are they using insecure software and you don't even know it are you curating all the security within your enterprise books are for all security is for all technology is for all the security team should provide education for every person now you may just think well that's just security awareness okay but do you really do a good job with your security awareness do you treat it as you know as a throwaway thing that people don't care about do you mock it do you put little value on it or do you rather just spend time rolling your eyes and making fun of all the users the security team needs to be

knowledgeable you need to stay on top of stuff what's going on in the news i said this actually to one of my bosses chris krebs the other day i said i tell people all the time they need to be on top of the breaking news stories and you should be able if asked to translate that into layman's terms if asked if you're in front of a board or front of some sort of governing body that's trying to give you money or give you support for your security team and you can't explain to them in easy terms what happened with the colonial pipeline then you need to to go back and hit the books as it were

and stay knowledgeable you need to anticipate the customer's needs is there new tech coming out is there a new iowa you know watch coming out are there new issues coming out are there new attacks coming out anticipate the needs of your users don't be knee-jerk reactions get ahead of it it'll be so much easier to manage this last one if this does not ring any bells inside your head then i don't know if i can help you does access control sound familiar for information security libraries have access control too it just looks different there just might be books that are behind the desk that are not in circulation did you ever go to a library

catalog and you'll see something listed as no circ or non-circ it means you can't take it out because they have access control because the book may be valuable it may be stolen a lot it may be rare it may be a rare book there's a lot of times very old books that you can't handle yourself you have to get a skilled archivist or librarian we all have access control libraries have access control they've had access control we have access control we can learn from them every book has a reader every security has a user every tech has a user again classification system we really need to get on board as an industry as a community getting one

common language used and please for the love of jeff stop coming up with these new marketing terms you know smishing fishing squishing whatever all it does is confuse the people using the technology it's fishing it's fishing just call it fishing that's all it is let's give the end users one term to remember and stop dividing it out i have yet to hear an argument why all these supporters terms are necessary if you have one talk to me later but i have yet to hear one other than it just causes confusion and what else do we have we want awareness person to person do you actually talk to your users do you interact with them a famous story i

tell is when i first started in as a sock analyst at a global pharmaceutical company i replied to a user's email to which she wrote back oh there's humans in security i thought everything was automated and i turned to my my manager at the time and i said don't you talk to anybody don't we talk to anybody here you need to interact with people they're humans we're human security at the root of it is a human problem right sure there's ones in zeros and sure there's networks and osi models and all that stuff but at the very root of it it's a human problem so don't disregard the humans in this equation and again education offerings not only

for the the users but for the staff as well is the staff up to speed you know are you given some sort of budget to take classes is it enough of a budget to take classes these are really critical issues and i know a lot of companies feel differently about this but this is something you really need to stress or if you're trying to negotiate for a new job make sure you get some education credit in there because you need to stay on top of stuff otherwise you're just kind of chasing after things and not really getting ahead of it and use that terminology if you're getting pushed back you know do you want a

security staff who's you know behind the times or do you want a security staff who is ahead of the wave save time for the readers and the staff so just kind of like what i was saying before you know you need to save time for yourselves to educate not only educating the users increase visibility maybe have a security fair i know a lot of people love to roll their eyes at me when i say october cyber security awareness month get a cake people love cake have a cake with a lock on it or something do something make it visible you know make security visible make it you know we live with it in our lives every day

don't make it this secret squirrel thing that you can't see that you can't touch make it known to the your customers your users your employees whatever you want to call them that it's a living organism that can grow with their needs can grow with the problems in the world and it's something to be celebrated not feared and not hidden away assess readers and understand their needs ask questions ask them ask them what they know what they don't know why are they doing things you can do that in a way and not sound condescending a story that i love to tell is there is there was a woman who every single day reported the company-wide newsletter as a fish every single day

and all my co-workers would just roll their eyes like there's betty again and blah blah so i finally said anybody ask her why she does this no they'd rather complain about it so i called her one day hi betty i'm tracy from the security team just just a question for you you know these emails that you get the company-wide newsletter i said i see that you report them as a fish every day and i'm just curious as to why you do that her response was i thought that's what i was supposed to do so somewhere there was a breakdown in communication or instruction along the way and rather than rectify it the team i was on at the time thought it

was more fun just to make fun of it so when i explained to her that you know this particular email you know is safe i said but yes if you do see something suspicious about it then report it but just don't automatically report it because you see it coming into your inbox so i spent at most maybe five minutes of my day kind of going through that with her emails disappeared because she was getting the information she needed she was misdirected she was misguided it took five minutes out of my life and also then in the in the long run caused us less work because what did the team have to do every day deal with

that email so i just was beside myself like you need to talk to people and that's the librarian in me that you know you might want to call that human if you want to be in a real infosecchi about it sometimes you can get an answer just by talking to someone and i know like picking up the phone is anathema to many people you need to get over it that's a lot of times how you get to the root of security problems is just by talking to someone

so in the original five laws of library science um the the line for this was the library is a growing organism that was updated to the library is always changing what's the only constant in infosec change right i often refer to infosec as a sisyphean task if you're not familiar with the tale of sisyphus it's the person who was doomed to rolling a rock up a hill for for the lifetime for duration of of all lifetime that's kind of what it feels like right and yeah there's a downside and i'm not trying to all bum us out right now but there's so many positives we can bring out of that task you know we can we can anticipate threats coming

and warn people or batten down the hatches of our network as necessary you know consider the needs of your organization do you know that your company's looking to grow in maybe another country or another state or something get ahead of that do some you know easy ocean or threat intelligence what sort of challenges might you face if you try to open up a call center in qatar just to throw a you know country out there not picking on them um watch the world cup this fall um so you know it's important to plan to plan ahead you know are there going to be layoffs that could indicate possibly insider threats you know and i know that

the security team isn't always on top of this but this is where your person-to-person comment uh content and your human come into play maybe put the onus on your upper level bosses hey can you get looped in with hr i'd like to know if there's maybe a mass layoff coming because that could really be a problem for us or if you start to see a lot of exfiltration of data maybe run it up the chain hey are we about to let go a bunch of people because i see a lot of people you know bringing stuff out of the network you know things are always changing so you need to be on top of it and if

you're familiar with the ferris bueller movie you know life moves at you fast right you need to stay on top of it all right let's take a minute deep breath that was a lot of information i threw at you right so now you know we we looked at whoops so so we looked at how we can apply the five laws of library science to infosec but i have a proposal for five laws of information security let's have at it so now you some of you may be looking at these five and me thinking to yourself well this isn't anything new this is what we do now great then you're better than most of the other people in

this space but do you apply this consistently is it updated regularly are your sops and your playbooks updated regularly or does one person hold the keys to all the playbooks and then they leave the company and then nobody knows how to do it because you know bob was the only one who knew how to do that don't hoard knowledge hoarding knowledge doesn't make you more powerful okay i think a lot of people confuse hoarding of knowledge with with power when all it does is make your enterprise weaker if you don't share that knowledge so and then do you actually practice what you preach does this does the security team have one set of rules always logging in as admin

but then you're admonishing others for doing the same so these five here just my suggestion and this always can be improved upon but the bottom line is make sure your information security program has a clearly defined set of purpose and guidance that is adaptable to your users needs your team's needs your organization's needs and the changing world around us you know you may have pages and pages of again of those sops or those play books or there's guidelines it might be stored on your intranet but can people actually find them do they know where they are have they been updated does it reference an outdated version of an iphone do you think anybody's going to take

security advice seriously if you're referring to a zoom on on your site and do you have something succinct in a in layman's terms that people can follow sure you may have more advanced users then guess what then you can create a second set set of instructions how many times i know that i've run into this when you try to discuss security with one of the i.t folks who because they're in i.t think they know as much or more than you especially me being a woman in security well they don't because security is my job not theirs so fine if you need to create a more high tech level explanation to account for those people do it

do it don't fight them they're just they're they're going to look down on you no matter what so you know what if you want them to be secure speak to them in their language if you want the less tech savvy people to be more secure speak to them in their language it's not one size fits all library science information security growing organisms you need to grow and morph and change to serve the different needs of the users you have so what do you think thumbs up thumbs down do you think that five laws of information security could work right awesome any thumbs down anybody who wants to like huff out of here you're kind of on

the fence i i get it no i get you i get you it's hard i'm sorry i can't hear you okay thank you yes you're yeah and the ones that i came up with may not be perfect you know i'm just kind of spitballing here because then this goes back to we need that authoritative body we need more input this needs more work but i'm introducing something brand new here this is a first draft but i want you to think about this and if we can't do this as a community do it for your own enterprise do it for your own organization but wait there's more okay one other thing i want to cover is the

reference interview but spoiler alert i have given this talk many times and i am not going to go into it again today because this itself is an hour-long talk um this is actually a slide from one of the early versions of the talk that that i've given so uh most notably i've given this talk empathy as a service to create a culture of security at derby con 2019 i was one of the speakers at the very last derby con which side note i was stunned when they accepted my talk because i thought that was a very highly technical conference but as it turns out highly technical people need to need to know what empathy is and how to apply it

so i was very pleased how well my talk went over there and then i was also a keynote speaker for the diana initiative in 2020 for this talk so you can view my full talk um either one derby counter diet initiative on youtube um i do have the my last slide i have a link to my link tree which all my talks will be posted there don't watch it now just watch it later make some popcorn it's good but for those of you who have not seen this talk or heard about the reference interview before i'm going to go over a quick rundown of what each of these steps means so this is a seven step procedure that is often

used in libraries called the reference interview as a way to really get to the root of what a user needs what a user wants so one approachability are people in your organization scared to report that they clicked on a fish or are they scared to report and are scared to approach the security team in general guess what you have an approachability problem and you need to fix it now i'm not saying you go out and you give you know hugs and you know unicorns and pink bunnies and all that stuff you don't have to swing the pendulum that way be a human be empathetic understand that security is not their job it's yours and these

things can be very scary and intimidating you need to have an approachability factor there's so many stories i could tell you and i'll just tell you one very quickly how approachability saved one of the and the companies i worked for somebody from the marketing department said we're going to have the cmo of this insert really popular app here talk to our marketing team and of like 200 folks and i want them all to download their new app that they're going to release so that we can all you know you know make it a good impression on the person blah blah blah but they were too afraid to submit it to the general security email so this came to me directly

and i said let me look into that app for you oh it took me all of like two minutes to go oh hell no no here this is a very insecure product and we are not doing this so you know i did not admonish her i just i just laid it out to her i said look these are the insecurities i do not advise this i really do not want you know please do not encourage people to download this um long story short i got in trouble for handling something independently but i made the right call that's not the talk for drinks another time but i was approachable i solved a potential problem that could

have been huge we would have had 250 people downloading an insecure product and using it and then probably forgetting about it and leaving that insecure app on the network and because i was approachable i was able to nip that in the bud i still take pride in that even though i got in trouble interest care about your job if you don't like being in this industry there's so many other jobs out there i just i really have disdain for people with a bad attitude it really just irritates me i made a choice to come into infosec okay i want to be here i want to help people and i think you should too take an interest in solving the problems

not pushing them aside you know i mentioned earlier betty right because just kept submitting the daily wrap-up of the company you know blindly because she thought she was supposed to i took an interest in her problem solved it and also gave her some other fishing tips along the way you'll be surprised how much taking an interest in your users can actually eliminate a lot of the problems you see on your network listening train yourself to listen for what people don't say yes what they don't say for example oftentimes i have an employee say to me i got an email i clicked on a link i went to this weird looking website and now i'm calling you

okay let's let's rewind the tape there for a second so i asked a very pointed question well what happened between you looking at the weird website and you calling me oh well it asked me to fill in my username and password and then it didn't do anything so i called you okay well your password is now compromised so i'm glad you called and did not say those words exactly but you know so but why didn't they say that either they didn't know it was important because security is my job and not theirs maybe they were embarrassed maybe they knew they did something wrong and they were too ashamed or didn't know how to phrase it

but again through approachability and through listening to what they didn't say i was able to piece together what happened and was very quickly able to block off their account change the password so we didn't have an issue you know and you know you need to understand that a lot of times users don't know what is important to us so this is why you need to train yourself to listen for what they don't say and and a lot of people have problems struggle with this and ask me for assistance so this is a guideline i can offer you that might help you know the checklist in your head how to remediate an issue if you don't know

it in your head then physically you know write it down or type it somewhere tick off the boxes when they mention things if there's some there's a if there's a box left until that's when you circle back and you ask the specific question to get that box ticked and it may be nothing they may just be saying oh yeah i just closed it i didn't do anything okay but you didn't they didn't say that so rather than have an unknown just ask interviewing so while it's good to be you know good at listening again circle back and ask those questions in a non-threatening way non-mocking way you're not out to humiliate someone you're there to troubleshoot

ask specific questions to get the answers you need to fix the problem you know and sometimes you do get resistance one time i i knew for a fact that a user had given up his banking username and password because of you know the blinky box tools that we have i was able to clearly see the post request where he was giving up his information on a fake bank site but he swore up and down that he didn't do it that he didn't do it that he didn't do it and i'm like i got receipts dude inside my head i'm saying that um so i left him with all the contact information for his bank i was like okay like well you know what

i think you might want to reach out to your bank here's all their legitimate you know information so i get it i get that not everybody's going to be willing to work with you leave them with something that they can do in their power and then you've done everything you can and then document document document searching i've already mentioned it a bunch lately ocean human whatever you need to do to get a second opinion if you're not sure about something or get help solving a problem i can't stress this enough sometimes a single phone call can solve something um just another quick story of la i always like to tell have story time remember librarian

aspect we'll do all these little little tidbits of story time i was working late one night i saw one of the computers in europe was going nuts you know lighten up lighten up the enterprise and lighten up the network and because of the time of day it was over there i was like oh this is sketchy um i looked up who the user was immediately it was a security guard okay it's probably an overnight security guard they were downloading a bunch of stuff reached out using the company you know chat internal chat system not getting any answers it dawns on me this person might not speak english so i went to an online translator the

very first sentence was i am using an online translator and then proceeded to ask my questions response right away in that and not in that non-english language throw it into the translator went back and forth he was just bored downloading photos i asked him to stop he did he was so sorry and i then walked him through using the translator how to you know run the antivirus and all that stuff all these things remotely done you know i didn't panic i didn't think this this is it this is it the russians took over the network you know look into stuff you you know and then had that thought of oh he might not speak english

you know do things like that but also i stress if you're going to use an online translator tell the person you're using an online translator because it doesn't really translate exactly well and it might look really strange to them and that might also give them pause but also also that's why i use the internal messaging system but again he knew okay she's using online translator this is why the language looks very strange to me so answering what does your correspondence or messages look like to the you know to your consumers you know do you sign off do you have you know a sig file or do they know that they're corresponding with the manager or a sock analyst and that's not like an

elitist thing but if you're an end user experiencing a problem i kind of like to know who i'm dealing with you know i know i'm gonna start i'm gonna make a bad joke here but yeah i need to speak to a manager well yeah because maybe maybe you're not really getting anywhere with someone and you know that's maybe that's a fair assessment maybe that analyst doesn't care about their job and maybe isn't doing all these steps that we're talking about today so you know again how do how do you tell your internal customers you know that you're working on things for them i in my talk again i don't give too many spoilers because i want you to watch my

original talk but i give a sample of an email that two sentences convey four very pieces important pieces of information it acknowledges the person's problem it told them you know when to expect an answer you know it it acknowledged what the issue was things like that so is your car your correspondence doesn't have to be pages and pages long it you know it just has to be succinct and give information i know that i as a customer would feel so much better having a two sentence email that had four pieces of information to know oh i can expect to hear back from them by the end of the day tomorrow they know that i reported a suspicious email they know

who i am i know who i'm talking to that would put me at ease and then follow up are these solutions you put into place actually working no no i don't expect you to follow up with every single betty in your organization but can you just tell by looking at your logs are things that you put in place working are you still having the same amount of false positives or or you know all other indicators you know are these things working do you really check up on this or if you do have an individual frequent flyer who always contacts security do you check up with them periodically um you know that those things are important

because they you know some people might need a little extra hand holding and yes some of those people might be executives okay um you know if you have an executive who travels around the world a lot you know are they going to different countries like oh have we scanned your phone since you got back from vietnam we should probably do that um you know are you following up on things this is another way that you can you know get ahead of potential problems and really nip them in the bud through follow-up so i'm gonna start to wrap this up now so what did we learn today we can learn from library science that you know there are so many applicable

lessons and failures to be learned but let's let's talk about the lessons and the positives raise your quick show hands do you feel like you've learned something new today that you can apply to information security you feel like you've learned something excellent there's a lot we can learn there's still more we can learn please go back and watch my empathy as a service to create a culture security like i said that that whole talk is just about the reference interview and i go in a lot deeper with that we could really use five laws of information security as a guideline we could really use a standard vocabulary these are things we need to work towards

as an industry and as a community and just talk look at all these things that we went through today and yes i'll figure out how to get my slides posted but use the reference interview techniques to adapt them to your particular situation i know it's not one size fits all remember the library and security is a growing organism change them as you need to but have some guidelines in mind you know at the very root of it security is human-centric and we need to think about it that way that's what the library did the library knows that it's dependent upon users they keep all kinds of stats about their um you know their circulation and who comes through the

door and all that stuff because that's how they get their money a lot of times based on grants so think of it that way you need to think of the human element how you can improve security so i kind of already asked this already but just last show of hands everybody feel good about this you feel like you library science can help us in the future everybody feel good excellent all right you've been a fantastic audience thank you very much i'm tracy maylie fifasec sherpa you can see my talks and articles and things like that on my link tree thanks for coming

and i'm happy to take questions and whatnot yes in the back can you come to the mic i'm sorry it's really hard to hear

so in uh i feel there's a term uh which isn't relevant but translating the term from a technical definition for people who are working in the space to a non-technical definition for uh executives and policy people there's a big disconnect in how what they actually end up hearing like they don't understand they are thinking one thing and we're thinking um what's the best way to reconcile the fact that the technical definition i mean how big of an organization are we talking is it something that you can actually have like had like face-to-face meetings or too big too big okay um can you have it do you have a carved out piece on the internet that people can go

to where you can establish these these explanations and terms and like make a big deal about drawing people to it um there is well there's a lot of competing people but they're all trying to explain it to executives and the executives are like we got it you're like you didn't get it um is there is there somebody at the company who has an authority to oversee like language and stuff like that sorry i don't have all the answers i know this is a this is a thing of like the community as a whole has this problem yeah it's at every single company it's that like the government's got buying making i'm probably gonna have to uh ruminate on this a little bit

i don't apologize i don't have an answer off the top of my head but um if you wanna um like maybe just like tweet that question at me um i can give it some thought maybe like write a blog post about it or something thank you sorry i couldn't answer off top my head but i don't sorry i don't have all the answers so thanks for coming

yeah sure i can if we're especially if there's another speaker coming i can get out of here all right okay so if you have any other questions we're going to be getting another speaker going but tracy is going to be out in the hallway uh if you have anything thank you so much for coming uh thank you to our sponsors as well [Applause]

[Music]

[Music] so [Music] [Music]

[Music]

do [Music]

[Music] do

[Music]

so [Music]

hello everybody thank you for coming to b-sides we've got russell and chris here uh they're going to be doing cyber threat modeling so it sounds pretty cool once again thanks and i'm going to hand it off

hi there everyone okay uh we are from rms uh next slide um my name is russell thomas and i've been rms about three years give or take i come to the organization with a background in data science at a regional bank uh phd work in computational social science and before that high tech uh r d marketing manufacturing uh bachelors of science in electrical engineering and management chris my name's chris voss i've been an rms for about seven years sorry been in rms about seven years um been working on our cyber risk model since since the outset i have a background in mathematical modeling particularly in the context of natural catastrophes so i have a a masters in risk and environmental

hazards and bachelor's in physical geography both from universities in the uk as you can probably tell from my accent so as russell mentioned we do cat modeling and for many of you in the audience when we say those words this might be what comes to mind uh however unfortunately today we won't be spending the next 45 minutes checking out different designs of denim jackets for cats and that sort of stuff instead what we'll be talking about is catastrophe modelling which is a field within mathematical modelling focusing on quantifying the risk associated with rare severe phenomena things like hurricanes earthquakes floods pandemics and of course cyber attacks what we have on this lovely slide here that one

please russell it's just an example of some realistic but uh synthetic hurricane tracks from our north atlantic hurricane model so passover so when we say rare catastrophic events one of the features of this talk is going to be some seinfeld memes i think we would all agree this particular episode where frank costanzo fell backwards on a little fucilly jerry statue and ended up in the proctologist office as a very rare very catastrophic event so as we mentioned you know russell and i focus on cyber risk modeling but uh this is just this is just one of a whole suite of different catastrophe models that rms build our firm builds everything from north atlantic hurricane models to european flood models

asian earthquake models and you know terrorism pandemic and of course cyber risk models so our primary clients sit within the insurance sphere uh this little slide here just kind of gives an overview of the insurance value chain starting from individual companies who if they want to buy insurance will speak with insurance brokers they'll get a policy off an insurance company but those insurance companies typically don't want to sorry those those companies don't want to hold all the risk on their balance balance sheet so typically they interact with reinsurers to shift some of that off and you can see the flow of risk from left to right here now rms as a company provides risk quantification tools and

services throughout this value chain in the context of cyber risk we're primarily on the sort of right hand side insurers reinsurance brokers and reinsurers themselves so what do we actually mean by risk well very simply risk is often defined as being the product of likelihood and impact now in our context impact actually means the direct financial losses experienced by companies that unfortunately are on the the bad end of a cyber incident now on the right hand side here you can see some of the types of impacts that are considered by our modeling you can see everything from lost revenue that occurs during an incident so of course as we know when when a bad ransomware incident occurs

often that means a company can't operate at 100 so we'll be quantifying that sort of thing quantifying forensics cost incident response costs some kinds of fines notification costs and ransom payments and that sort of thing however what we don't quantify is things like post incident upgrades loss to share value and this sort of thing and that's primarily because these are the sorts of losses that are not covered by insurance contracts in the cyber insurance sphere so as a result we really focus on incidents that are above a particular severity threshold we're really only interested in stuff that causes realized financial pain to companies not so much about other kind of incidents that perhaps network security folks are concerned

about intrusions that they want to follow up etc but we are really interested in if if it results in a financial loss we're interested if it doesn't not so much yes please and as part of this we model a diverse range of different types of cyber incidents everything from data breaches to ransomware attacks wipers cloud outages and this sort of thing and each of those different what we would call sub-perils of cyber have different likelihoods and different associated financial losses and of course depending on what type of company we're talking about you know a small company versus a large company the likelihoods might be different depending on what industry we're talking about the likelihood also might be different next

please okay a quick show of hands does anybody here work in the insurance industry cyber risk okay interesting anybody here do risk modeling as a business as opposed to okay so what i want to contrast here in this next series of slides is how the different perspectives of risk vary and overlap but are significantly different that's really critical to understand that how we approach modeling and how it may be different especially from the enterprise so if you're a risk manager in an enterprise essentially all risks all bad things that hap can happen your business are important so if you get hit with a really bad event causes big financial losses or big reputation damage you've got to go in

front of your board or something that could be a catastrophe as you define it but from a population standpoint if you're the only organization get hit by that the population or the people that look at populations like governments regulators they may not see this as an extraordinary event it's bad for one but not necessarily for the population but if you start seeing events hitting many organizations essentially or roughly in the same time especially the same type of attack and attack severity now we've got a population level catastrophe now it's critical to understand that insurance companies view and manage cyber risk in the context of a portfolio that's how they decide what the premiums are going to be

and what the rules for coverage are going to be how much capital allocate and how to even acquire reinsurance and report to regulators so portfolios have boundaries who's in and who's out portfolios have rules of coverage in terms of conditions and even every customer is going to buy different levels of coverage so while a single insurer may look at the population and sort of take that advantage perspective they're always looking at a subset and a key part of the rms product is to help insurance customers go from the population or macro view down to their particular portfolio view and say what does this mean for the types of customers we cover so what rms does in our model is

in in this version six that we've just introduced we model a synthetic population of all firms above a certain size threshold and further we separate this synthetic population by the industrial sector as well as the geographic jurisdiction and critical to our modeling is what's the footprint of given attacks and we're concerned about how many threat actors there are in their campaigns and what are their footprints and different campaigns can have different footprints and that can affect who's who's affected and uh how many so some may be horizontal some may be vertical and in the worst case scenarios they may cover a very very large portion of the population and this is really a prime concern to our

customers and to our models cool thanks russell so to sort of formalize this i think it's helpful to uh sort of reiterate a couple of things so the first of which is that we can think about risk in two categories we can think about what's called attritional risk and essentially these are incidents like russell mentioned at the very beginning which are sort of independent which can still be substantial in scale but are not associated with many many companies being hit so an example of that would be something like the 2017 equifax data breach which was brutal for equifax but it's not like it hit thousands and thousands of companies simultaneously then we have tail risk

which in our parlance is really focusing on low probability high severity events that hit many many companies simultaneously so these are things like wannacry and not petty it might be examples of tail risk and of course we can all imagine much more terrifying examples than any of those you know that might potentially occur so we wanted to sort of touch on how this influences insurance premium because that might be something that you know a touch point that you folks have you know directly with the insurance industry and so insurance premium really is covering the the average loss that or the mean loss that your company might experience in a given year and this includes both attritional and tail

risk so on the right hand side i've got a very super simplified example of an imaginary company and let's say an insurance company comes up with a 10 000 technical premium now a technical premium essentially is defined as being the the amount they're charging directly to cover the losses that you might uh the claims that you might bring it doesn't include profits or other sorts of um other sorts of costs so next one please russell so again these numbers are made up but hopefully it will help us follow through so the attritional component you can see here is about seven thousand five hundred dollars and the way that this might be computed again this is a

simplification is that you might take the mean loss that the company might experience conditional on an incident so given that they've experienced an incident on average what's the dollar cost and then what you do is you combine that with information about the likelihood of that happening or the probability of that happening in this case five percent for easy numbers and then you get what's a probability weighted loss which in this case is 7 500 but that only covers the attritional component those sort of independent events what you also need to consider is the tail component which is often called the catastrophe load and this is then considering okay what about all those events which might

hit loads of companies simultaneously in this case here our example was saying that on average if our company gets caught up in one of those events that the loss is going to be 250 000 on average but there's only about a one percent chance in a given year that that this company gets caught up in this and when we do the probability weighted loss we get two thousand five hundred sum them and we get ten thousand um so one thing that we thought was kind of useful to to mention is that you know russell spoke about the different scales of risk and what what's catastrophic in in the eyes of an enterprise versus the eyes of an

insurance company or the eyes of a population and it's worth mentioning that you know for an individual company you could imagine that perhaps you know a brutal double extortion event might be the worst case scenario right where all of their commercially confidential information gets stolen you know personal information they have get stolen and leaked and at the same time the operations grind to a halt because everything's encrypted horrendous however due to the scaling properties of double extortion in those sorts of attacks might not be the drivers of population level catastrophic risk it might be something like a wiper that is just rolled out through a worm or something like that instead and ultimately depending on the angle at which you're

approaching risk management you might be uh concerned about certain types of incidents over others next please so before we really get into the meat of the presentation we thought it'd be helpful to really spell out what catastrophe models in particular our cyber risk model does and what it doesn't do what it does is assesses the likelihood of different loss outcomes we're not trying to make predictions of exactly what will happen now to use a a simple maybe appropriate analogy in las vegas if we imagine a you know a dice what our model is saying is that the dice has six sides and each side has one over six probability of it of it being rolled

what we're not saying is that the next roll is gonna be a two um if we did know that then we wouldn't be here and we'd be in the casino instead um what else what else does it do well what it really tries to aim aims to do is to capture the key drivers of risk what we're not trying to do is reflect all of the complexity of the real world all of the you know huge depth of technical complexity that you're all super aware of and you know all the complexity of decision making of threat actors ultimately we're trying to identify what is really driving risk um and ultimately this comes back to the fact that all

mathematical models are simplifications of the world they are helpful decision making tools but they're not supposed to you know reflect all of the ugly details of the real world and finally what our models do is they complement expert judgment as a decision-making tool they're not supposed to replace humans because as i mentioned before these models are simplifications so we need expert judgment on top of it you know some people might think that the output is too high too low etc etc so now that we've sort of laid the groundwork of that we'll move on to our first lesson which is the benefits of causal risk modelling so a causal model is a model that represents the causal or mechanistic

relationships in a system on the flip side statistical models are models that reflect the mathematical or statistical relationship between different variables next slide please so here we have a toy example of a statistical model that you might use in the context of cyber risk so this follows a very very popular framework the frequency severity framework so on the left hand side here we have a probability distribution showing the likelihood of a company experiencing an incident here you can see there's like a 75 chance that it doesn't experience an incident just over 20 chance that it experiences one incident you know and a small percentage that hits two three or four incidents and then on the right hand

side you have the severity side which basically says given that a company has experienced an incident what are the range of potential dollar loss outcomes and their associated likelihood so on the right hand side that's a probability density function so essentially where the the curve is highest you're sort of most likely to see an outcome but we can see that all the way down to very large numbers there are there's a chance that the loss plays out this way um one thing that those one thing that these kinds of models uh are good for is when you have a lot of data but what they're not very good for is trying to quantify very very extreme outcomes that

you haven't observed if i want to ask the question what's the likelihood that next year 25 percent of companies get nailed with a wiper this sort of model can't help me because we haven't observed anything like that in the past we need to use different techniques yeah so just to underline that last point i want to share a brief story conversation i had uh 2008 or nine with a famous security consultant keynote speaker and he is that 15 minutes did you just wait 15 minutes thank you scared me so anyway he was arguing against the possibility of ever quantifying low probability high magnitude loss his argument was if you take infinitesimal probabilities against incredibly large numbers your margin of

error you can end up with any result so he was just trying to dissuade me and other people from going down that path so coming back to our seinfeld mean how would you or any of you estimate the likelihood probability the risk associated with this particular loss event well i would challenge you to take a standard statistical model of frequency and severity and apply it to this it's very hard to get off the ground and have any credible information so what we at rms do in our version six model is we model a synthetic world that has all of the key elements firms software vulnerabilities campaigns threat actors and we connect them in a mechanistic or

causal chain so in this case if we wanted to include this we'd have to have a threat actor of the type cramer and cramer would have to have the capability of building weird things like little statues and his attack pattern would be leaving that statue on the ground where somebody might fall on it and then the causal mechanism is how is somebody like frank the potential victim with vulnerability here likely to fall and if he falls what's the likelihood that he's going to fall in a particular way that he's going to have to visit a proctologist so to go a little bit deeper into this synthetic world as russell described we here are really trying to call out as i

mentioned earlier on the key drivers of risk in the cyber kind of risk ecosystem so here what you'll see is components everything from individual threat actors that we spawn that have various different characteristics size skill motivation etc things like software different kinds of software that exists with different market share and what kinds of companies use those pieces of software vulnerabilities of course which are crucial to understand and the rate at which those vulnerabilities spawn and their characteristics and of course you know what their exploitation kind of characteristics look like but then also things like the different ways in which threat actors can gain initial access into uh into corporates right things like uh social engineering have extremely

different scaling characteristics to worms for example and modeling the specific ways in which those play out is is super important so those of you kind of who work in a field a little bit closer to us might be wondering how on earth do you operationalize all of that well so this is a bit of a simplification but ultimately the way that this works is for 50 000 synthetic years what we do is we simulate we initialize a world in which you have software with different market share characteristics different vulnerabilities that have been spawned with with with various different characteristics and then of course putting into place different ways in which threat actors can get into

into companies and different bad things they can do once they're inside and then those threat actors essentially are able to evaluate their different options each of each year and russell will talk a little bit later on about how we go about that and then these threads will choose what nefarious thing they decide to do and we essentially rinse and repeat this process exploring lots of different states of the world because none of us can you know write down on a piece of paper now exactly what's going to happen from a vulnerability's perspective you know over the coming year so we need to explore a broad range of potential outcomes and once you repeat this process essentially what

happens is you get synthetic attacks occurring some of which are much larger in scale than others anything else yeah just uh those of you who might be familiar with epidemiological models of infectious disease especially across a network or a geography similar structure those models take the bacteria point of view or the virus point of view and the hosts the susceptible hosts are sort of provide the backdrop for the virus to move around so in this case the threat actors take the primary active role in decision making and determine what campaigns happen and firms that enterprise by their policies and practices more or less determine the environment in which they operate cool so as you can imagine you know

building this sort of model as opposed to a statistical model is a huge amount of work both from a time effort and data requirements perspective so why on earth do we you know put ourselves to all of that well as we mentioned earlier on it's our firm belief that you cannot robustly quantify extreme cyber risk through simple statistical models and through this kind of model that we've described extreme events are emergent from this particular model and the reason why that's the case is you will get years inside of this 50 000 year simulation where you might have an extremely large skilled and aggressive threat threat group who are operating in a year in which there are a slew of you know very very

high severity broad reaching vulnerabilities and it's essentially the superposition of many bad things occurring at once that ultimately can result in catastrophic activity there's also a bunch of other different uh benefits i think really importantly is that assumptions in this sort of model can be directly interrogated by domain experts if a domain expert wants to ask how do you consider patching and you know how do you consider that some companies are slower patching than others if you ask that question to the statistical model that we mentioned at the beginning that question doesn't make any sense there's no notion of patching inside of that model whatsoever whilst in the context of a causal model such as

our own you have particular components that are directly addressing that sort of phenomenon in real life and you can make sure that you are capturing uh the process in a robust manner both from a quantitative and a qualitative perspective uh okay let's just move on next yeah um so it will kind of take a little longer than we'd hoped but uh um so good news on empirical data so there's a lot of good news i think on the empirical data front i think it's things like the frameworks that mitre have produced things like the attack kpec and defend frameworks for us have helped absolutely enormously in terms of providing a common language to describe different phenomena but also as a way to

be able to homogenize data from different sources you know if you've got source a and source b both tracking thread actor activity and they're describing you know persistent techniques persistence technique lateral movement techniques all these sorts of things using the same language for us that makes things a million times easier so this is really great next i think incident data is becoming much more readily reported in the past things like data breaches were you know pretty commonly reported due to regulatory reasons but i think more and more companies are becoming less ashamed of announcing that they've experienced a ransomware incident or other kinds of incidents like that and for us more data results in better models and just

quickly regulatory influence securities and exchange commission privacy reporting requirements are all ex and obviously the european requirements are all positive here and from our standpoint our customer standpoint we look forward to more of this being more comprehensive cool on top of that you know there's a whole series of uh publicly available and extremely well structured data on vulnerabilities to some extent also on exploits i think you know there's still some uh perhaps some work to be done there but again that's incredibly useful from our perspective i think something else we really want to call out is the substantial body of really high quality statistically robust work that sort of is out in the literature big big

shout out to the folks at sciencia but also folks at verizon and various other different reports i think this sort of information you know really is positive for the the for the discipline of of cyber risk quantification in general just historical context some of you maybe have been around in the industry long enough to remember cost of a data breach reports by an organization beginning with the letter p so we have significantly advanced beyond that and then finally you know there's some really great uh work that's going on in the broader community um you know where there are very talented dedicated professionals putting their you know data science statistics and i guess cyber security knowledge together

to put together uh products such as the epss which is the exploit prediction scoring system this is super interesting both from our perspective at the macro level but i think also for you know practitioners as providing you know valuable insights to the probability of individual exploits being uh of individual problem of individual vulnerabilities being exploited to help with things like prioritization a big improvement over their cvss severity system so all of this foundation of data was really a prerequisite for us to even consider this sort of model um certain areas we definitely feel you know a little bit more tractable with current data than others i think we're very comfortable with where we are with

things like software vulnerability spawning patching the cost of individual attacks as well as the mechanisms around initial access vectors and the flip side there's some not so good news so in general in this arena we're always going to be suffering with relatively short time series compared to natural phenomenon like earthquakes and floods and so on so even if you can collect 20 or 30 years of data really only the last 10 years and sometimes five years is applicable because then you have an environment that's more or less consistent with where you're operating today so we are constantly having to go through data analysis and data evaluation process with any empirical data source as to how far back in time we go and how

we make use of that and of course as we look forward the dynamic nature of what we're dealing with is crucial i'll just mention an area of my own work in modeling high level financial theft from banks which was a very active area for certain nation states uh three to five years ago but those nation states have shifted to cryptocurrency thefts uh so one of the things we have to evaluate in a practical sense is if we're going to be putting out a model that our customers can be using for the next year how how much do we anticipate things that may be in the process of changing uh will the uh prevalence of ransomware

continue for the next 12 to 18 to 24 months so the same as it has now crucially there's still struggling for information around incidents especially breaking down the loss event in the categories that we can use what's publicized is not always the most important or the most useful and the biggest gap of all is information about threat actors and i'll just give a shout out had some conversations over the last couple of days and put out over twitter proposing a project collaborative project around modeling threat actor capability evolution and change so if anybody is interested in getting involved with that think of it as sort of like an attack framework but focused on threat actors

not from a threat intel point of view not indicators of compromise and ttps but their organizational capabilities and their value chains so come up and see me afterward so this leaves us with the need to evaluate uh excuse me fill in the gaps with data with expert knowledge so we mentioned here threat actors and one area i worked on was how do threat actors make decisions around campaigns given the options available i just got the 10 minute notice so i'm going to pick up the pace a little bit i hope that's okay so we're going to walk through a simplified example from the end result backwards so the end result we have in our choice

model is choice probabilities here six campaigns are available to this threat actor and what we end up here with choice probability and a random draw is made based on those probabilities to the campaign choice the choice probabilities here are simply the normalized outcome scores so how do we get the outcome scores and how do we come up with six campaigns well in this particular case it's three initial access excuse me three initial access vectors cross product with two particular attack types execution techniques so we've got ransomware the data exfiltration and three different ways of getting into the population and given their weighted attractiveness that determines the outcome score for each now weighted attractiveness is sort of

where the magic happens here it we need to discern the relative attractiveness of the firms to these different types of an attack so a firm that's highly attractive to a ransomware person is not necessarily as attractive to the same to a different attack or even the same attacker if they're trying to exfiltrate a bunch of confidential health information

so we'll move on to the next thing so this next lesson is i guess a bit more of a tutorial for for those of you who are kind of interested in you know applying statistics simple statistical methods to your risk quantification work so we're going to talk about probability distribution so probability distribution is essentially a mathematical function listing a range of possible outcomes for a particular process and the corresponding likelihoods on the right hand side here we have a normal or a gaussian distribution bell-shaped curve for adult male height in the united states thanks please now using observations we can actually come up with some reasonable parameters for what this mathematical function might look like

and the benefits of fitting these functions are quite numerous actually the first of which is enables you to estimate the likelihood of different outcomes if you wanted to know what's the chance that someone is more than six foot three tall you can ask that to this mathematical function you can ask also what's the likelihood that they're between five foot and five foot five tall you can also ask that sort of thing it also enables you to determine the likelihood of unobserved outcomes and enables you to generate synthetic data through sampling in this case here russell just showed you know if i ask this function give me five numbers it will draw in a way that's

proportionate to the probability density next so how would you actually go about doing this in this imaginary example we're saying you know if i go around my local town and measure the height of 100 different men i might get a histogram like you can see on the top right sorry it's a little bit small and if i was thinking okay what sort of probability distribution do i want to fit to this there's lots of different considerations that i might want to make but one of them should be what is my prior knowledge about this particular process you can see in the histogram it's bimodal meaning it has two humps however having lived as a human for

a number of years i know that actually in general you know there's a sort of a smooth spectrum of heights and there's not a small there's not a valley of height in uh the male population so we're going to choose a unimodal distribution next so in order to actually fit this there's a lot of different tools that can really make your life easy we use various different languages in our team i quite like julia and what you can see here is you can fit that red line on the right hand side with just a single line of code it's not too terrifying you know there's of course a rabbit hole that you can go down from the statistics

perspective but just to start getting familiar with this sort of stuff there are a lot of tools out there that can really help now there's a question of how good a job did i do and there are a whole bunch of diagnostic plots that you can do which are pretty straightforward to follow but uh i'm running out of time so i'll pass the button over to russell again so a huge lesson that we learned through our experience in modeling the iterative process of looking at the data generating the model looking at the output is our definition of tail risk what drives that curve whether it's in the critical area or not so critical area is threat

actor capabilities so i want to give you a window into into this process and how we arrive there and one critical factor in this tail risk is what is the addressable population for any attack so the addressable population for the not petty attack was huge because it was a wormable windows vulnerability and therefore it could spread very widely so what factors determine that addressable population so obviously attacker strategy exploit capability products they're going after in their coverage in the population and i want to skip to the bottom which are interdependencies between vulnerabilities if they exploit multiple vulnerabilities a given attack may exploit two three five maybe even more vulnerabilities conditional on the victim that they run into

if those relationships are or logical relationships that expands their market gives them options but if those are end relationships those are points of failure because if any vulnerability any exploit in that sequence of ans fails the whole sequence fails so we believe that attackers when they're defining their attack strategy either by conscious planning or by experience essentially assemble their attack strategy to try to come up with an addressable population which fits their goals and their capabilities so what one lens on this is the initial access vector of worms so here we've modeled just the behavior of worms uh and what proportion of the population that they would be able to access and you can see in

this is an example of a sub analysis within our broader simulation you can see the vast majority of worms are down in the 20 percent range so what's interesting is this range above 50 percent this is the worst case of the worst case and this slide here it's a bit of an eye chart but this is a diagnostic output from our simulation where we look at in detail the characteristics of these simulated attacks in the far part of the tail so we went in and estimated well how prevalent are multiple uh vulnerabilities cross os vulnerabilities are they wormable or not and we created our own simulated world of these and this helps us evaluate are these assumptions realistic does

this correspond to what anybody has said from an expert opinion and we can go back and refine that so i'm not going to read all of these points but this iterative process about well how bad can things get what does it take for an attacker to really attack a very high portion of the population led us to not only produce the quantitative output but the reasoning behind why this is relatively rare more rare than some cassandra's might be claiming cool so um almost wrapped up we thought you know we've been speaking about doom and gloom so we thought we'd share what some of the output of our modeling is in terms of how bad can bad be so we can ask the

question of our model you know what are the characteristics of the most rare and broadly hitting events so here we're talking about stuff that we think it has a less than one percent chance of happening next year you could also say it has less than a 100 chance of happening next year so our model suggests that about 9.4 percent or greater firms are likely to experience an incident from a common cause this is sort of in the realm of our worst case and we think this is most likely to be driven by worms there's also a chance of supply chain attacks on very very large vendors and we think the the payload is very likely to be wipers

partly because of the the the threat act of motivation you know on the state-sponsored side but also because ransomware attacks typically have on the sort of business side for the threat actors they need to process a large number of transactions whilst a wiper it's possible for you to just drop it and walk away and over back onto russell last point so i've been in this business for quite a while 2007 i presented to an angel investor group incubator with the bright idea of a research agenda investment opportunity for quantifying information security risk and making it financially material to executives and i got blank stares and polite nods and then i invited myself back again i

said i want to give another presentation because i want to get something going and they said i think our calendar is full here but i kept going and i know people in this room and people in the audience watching on youtube have been on this journey for similar time periods 2022 is a different time uh we wanted to have a whole bunch of corporate logos and organization logos of all the people participating in this world now some of the people here in this room i kind of feel like we're in a golden age of risk quantification and information security there are so many more participants now there are so many more good platforms and initiatives academic

conferences and resources so if anybody is interested in getting involved with this for a start be happy to talk with you if anybody's in it been in it for a while and you want to up your game or get more involved be happy to talk last but not least this event in the ground truth track is evidence of how far this community's come in the past 15 or so years so i uh told my uk colleague chris that the american audience was very active in asking questions and engaging in conversation so now is your opportunity to do this as we get our microphone set up

hello ross get close hello russ hi i i presume that that uh slide at the beginning where you're modeling risk as a little closer as a result of impact times probability was just a straw man argument that you threw out with the rest of your sides which didn't back up that approach or am i completely misreading what you guys have done here because it doesn't thankfully it doesn't seem like you've gone through that dark road into multiplying apples of times oranges and you might want to educate everybody on exactly how you are modeling risk there uh thank you for that question uh so if everybody understood it he was saying we introduced one of the early slides risk

is likelihood times impact or severity and that seemed in contrast to the rest of things we're doing so in the early 2000s there was a lot of i will say expert commentary driving people to do quantitative risk modeling using that simplistic formula alone and trying to get you to think about every single asset and every single vulnerability and applying that formula to every single one i was once at a conference where a high-level person at a bank told a war story of taking his team through such an exercise and after two weeks gave up in exhaustion so they had carte blanche to work on this until they got it done for some board review and they gave up

the trouble with that is it makes sense at a conceptual level so at a very broad definitional level the definitions are consistent but how you get there is crucial and it's not the same you know our whole presentation about causal modeling and getting to the probability of certain events happening in their conditional severity is we think we're arguing is a viable supportable path to getting to that risk definition so closer closer so um there are lots of domains that are modeling risk uh as you mentioned earlier like finance and floods and et cetera et cetera um are there similar uh approaches to causal modeling in all of those domains and is there a common methodology between them

yeah i can try and talk to that a bit um so i think there are certainly overlaps um in that the process of trying to create a broad range or a broad catalogue of scenarios of bad things that could happen and associated likelihoods i think is common across the board we described you know our methodology here for something like a hurricane model what they do is they have a model that spawns low pressure systems inside of the north atlantic and kind of track tracks how they you know intensify for earthquakes you know they have um models on you know for different fault probabilities and these sorts of things so there's i think at a very high level you know this

notion of having a catalogue of bad things that can happen and then you combine that with the economy or an insurance portfolio to see uh how it comes out however of course you know in the context of cyber what is very different is where you've got the human decision making angle right we often talk about how you know you could build the most perfect synthetic world of vulnerabilities and you know software etc but your assumptions about how thread actors make decisions ultimately is going to be what's driving you know the likelihood of really bad things happening so i think there are certainly areas of commonality but having a man-made risk i think adds an additional

level of uncertainty that is perhaps not in place for those natural catastrophe models anything else yeah uh terrorism risk modeling which is something that rms does is the closest analog and they use similar approaches not identical but similar

hi there i work for a very small precious startup and i would i'm in qa i'm the closest thing they have to security how do i get my very small very new to the startup world very literally young engineering team thinking about these sort of war games these possible things that could happen how do we react to them without scaring the pants off of them that is a courageous huge incredibly important question and challenge and anybody in this room who claims they know the answer to that or there's a foolproof path uh i think you're kidding yourself so everybody who encounters this i mean this is one of the most challenging things in the whole security world and

quant risk so i'm i'm going to give a try to answer but it's such a big question it deserves more time i invite you to join the society of information risk analysts because i assume that you are interested in a quantitative risk approach to this question as opposed to hand waving or framework bashing or please no jargon hurling approaches okay society of information risk analysts is a community of people pretty grassroots and there's lots of people and resources who can you can connect to that will help break that down and give you tools to make progress on it awesome thank you

thank you for the presentation so um you were mostly it looks like looking at um sort of a large population from the perspective of insurers and things like that i'm wondering if you've had any experience with how well this kind of causal modeling would or or wouldn't work for a company trying to predict uh catastrophic events happening specifically to them for instance to decide whether or not to invest in some sort of barrier to or how much how much they should be paying for insurance things like that very good question very forward-looking because that's the next logical well i won't say the next it is a logical progression for firms like ourselves and models like

this uh i will say the transition between population perspective and a firm specific or even a value chain which is sort of an in-between thing is tricky from a technical and evidence how much granularity you notice in here we abstract attacks into two phases initial access and everything else firms care a lot about that everything else that's the whole elevation of privilege and horizontal movement and partitioning your networks and so on i would be very interested in talking to anybody who's interested in that path official announcement rms has been purchased by moody's analytics moody's analytics is a famous publicly held company modeling risk across a broad portfolio and i know that they're interested in this topic

i will definitely come and talk to you about that afterwards thank you hello thanks for the talk it was really interesting um you talked a lot about external threat actors and i was just wondering how do you quantify insider attacks the likelihood and impact given that the initial access is kind of taken care of so the question is about insider attacks as threat actors our current model does not include insider attacks uh maybe chris can comment it's my understanding that insider attacks are not normally covered as part of normal cyber insurance uh i mean that may that may well vary from company to company but so what we described just throughout this presentation is our framework for trying

to quantify catastrophic cyber risk right so it's our assessment that although insider threat is definitely a big thing the likelihood of you know armies of people simultaneously deciding to you know take up the mantle against their corporate oppressors uh is extremely unlikely and that you know the sorts of attacks that we're trying to model here around big supply chain attacks worms from malicious external threat groups are likely to be the drivers of things that's not to say that we don't consider internal uh uh insider threat implicitly i mentioned earlier on you know we have the this kind of notion of a statistical model where you have the likelihood of certain things happening we actually model the attritional risk

uh through that sort of method so we have quite a lot of data to see that you know these sorts of attacks do happen so inside of our model we sort of are implicitly capturing this but we're not you know explicit you know we're not calling it out uh so so explicitly so in principle we could add insiders as the new attack category and i don't think really anything in our framework would fundamentally change we'd have to add ttps that are appropriate to insiders and so forth but as chris said it probably wouldn't affect our overall model output now i will say uh my own personal opinion to getting back to this gentleman's question about

enterprise level i think potentially one of the most catastrophic types of attacks for an enterprise would be executive level insider attack executives who have enough technical savvy to go back and change your financial systems right so think of worldcom type scandals and so forth if i were an enterprise risk manager i would have executive or privileged access that causes major financial fraud in your threat model thank you hi stefan from sdx um i have a question so we have cat models in the past like for hurricanes strong wind and everything with lots of experience this is all um natural science we can do controlled experiments and so now with cyber we are more in social science the environment

changes permanently now we know quoting nasim talib the bigger the risk of the bigger the event the less we have a clue where or how do you draw the boundary between stuff we can model and stuff we fundamentally cannot model uh so i'll give us both a chance to answer this chris can think about it while i'm babbling on uh you've raised a crucial point so i mentioned at the start that my phd work it was in computational social science social science is a very broad loose not tightly integrated set of theories and methods compared to physics for example or chemistry but there is a lot of things to draw on so i mentioned in the

model of threat actor decision making that draws upon consumer choice firm investment so-called maximization of subjective expected utility and so forth so we can apply these especially in narrow contexts we define a context of certain human behavior for example peer influence to what extent do threat actors think through their own roi versus they just follow the crowd or they they follow on the habits of what somebody else does or they acquire a tool set from the dark web and start trying it until it doesn't work and they do something else so there's a fairly good body of knowledge and methods and theory to inform that now the trick is as you said before and this is a matter of expert professional

judgment at what level do we apply this how granular do we get is it better just to treat this as a you know random dice throwing because we don't know a lot of details and there's so much uncertainty or can we treat it as you know a rational decision making process where the agents have a lot of information and they might even make strategic choices in a game theory sense i'm going to trick my opponent and i'm going to fake this and do that so uh that's my answer chris yeah i guess my i'll just give some maybe slightly broader comments which i think that ultimately however good your model is there needs to be an acknowledgement i

think of the uncertainty in this field is likely to always be greater than in the context of a hurricane for example for many reasons as we mentioned you've got the whole human decision-making side of things the threat landscape changing far quicker than you know than climate changes changing hurricanes for example and then you've also got i think the observability challenge you know with a hurricane you know you can use satellite imagery and you know ground anemometers to measure the wind speed to make sure that your model is producing realistic output with cyber it's extremely difficult for us to at scale you know scan the internet to see which companies actually had a ransomware strain successfully deployed on some of

their servers right so i think it doesn't mean that this that uncertainty does not mean in my view that this pursuit is not worth doing because i think certainly from the insurance perspective as soon as you insure a company you need to start thinking about this sort of thing and either your sort of finger in the air or you're trying to use techniques like we're trying to use but um yeah i think yeah odin dorsal russell was saying but i think also yeah there needs to be an acknowledgement of a greater uncertainty in this sphere thank you uh hopefully just a quick one which was one of your points was around the lack of incident information

one thing was around the lack of incident information you mentioned that internet information is hard to come by i was just wondering since you work so closely with insurers um you know they're clients of yours they've sort of receive a lot of this modeling wouldn't they have a lot of internet information from where they're paid out and have you been able to leverage that so the question is do we get information from insurance from their claims information and others that we make use of so it depends um i think for insurance companies this is my personal take not that of rms is that you know insurance companies you know they view their claims data as

being a form of intellectual property especially if you're a insurance company that insures many many many companies that data on the frequency and the types of events and all these sorts of things is really really valuable to you certain insurance companies will view that actually they'd rather not share that information with vendors such as ourselves because they don't want their competitors to benefit from that intelligence whilst other kinds of companies um maybe take a longer view that they want to actually having robust quantitative methods is necessary for a cyber insurance for a healthier cyber insurance industry more broadly so uh yeah we get we do have some claims data from some of our clients but we'd always

like more but we understand that actually it's a business decision making a business decision as well as other kinds of things there is a informal process where some of that information gets exchanged implicitly it involves them running the model and looking at their own data you know wait a minute we don't what you're saying this and we're saying that so whenever we get bones of contention irritation we're too high compared to what they think or too low that sort of informs us indirectly about not the details but the overall summary of where they are in the risk curve i think as well actually it's also worth noting that some insurance companies do release claims statistics reports which are also

pretty you know interesting where they you know they're not releasing the names of the companies that were hit or anything but they're saying that you know for companies with less than five million dollars of revenue you know two percent of them in our portfolio claim or something like that so i think yeah data is being shared but not as openly as maybe some of us would like yeah thank you

all right i think that's a wrap thanks everybody [Applause] good job

[Music]

[Music] [Music]

[Music]

[Music] do [Music]

do [Music]

[Music] do

[Music]

do [Music]

[Music]

[Music] do

[Music]

you

[Music]

[Music] so

[Music]

[Music] so [Music]

[Music]

do [Music]

[Music]

do [Music]

[Music]

[Music] do [Music]

[Music]

[Music] so [Music] [Music]

[Music]

do [Music]

[Music]

do [Music]

[Music]

[Music] so [Music]

[Music]

[Music] so [Music]

[Music]

foreign

[Music] so [Music] [Music]

[Music]

[Music] [Music]

[Music]

do [Music]

[Music]

foreign

[Music]

do [Music]

[Music]

do [Music]

[Music]

you

[Music]

so [Music]

[Music] do

[Music]

[Music] do

[Music]

[Music] [Music] do

[Music]

do [Music]

[Music]

[Music] do

[Music]

[Music] so [Music]

huh [Music]

[Music]

my [Music] [Music]

[Music]

or irwin can do it um

[Music] thank you i want all my intros to be done as a wrestling introvert

see this is what i miss when i'm just

i guess i should leave this on or start my talk with other moms

[Music]

all right good afternoon welcome to b-side the las vegas ground truth this talk is attack flow from data points to data paths by gabriel bassett um a few announcements before we begin we'd like to thank our sponsors especially our diamond sponsors lastpass and palo alto networks and our gold sponsors amazon plex track and blue cat it is their support along with our other sponsors donors and volunteers that make this event possible these talks are being streamed live except of course in underground and as a courtesy to our speakers and audience we ask that you make sure that your cell phones are set to silent um questions will be at the end if you have a question use the audience

microphone so youtube can hear you i am holding that mic i will put it back on the stand in the middle of the room uh as a reminder the b-sides lv photo policy prohibits taking pictures without the explicit permission of everyone in frame all of the talks are being recorded again except in underground and will be available on youtube in the future please keep your masks on at all times uh if you want to hear better see better feel free to move closer to the center and front of the room keeping social distancing in mind um with that let's get started please welcome gabriel bassett [Applause] it's good that we get the clapping in now because you don't know what i'm

going to say and if it goes bad at least i've gotten one clap for the presentation so i'm gabriel bassett um we're going to talk about tag flows so we're going to do a quick introduction to it and then we're going to do this like in the data driven way right i'm going to show you a bunch of data particularly data we use in information security before like attack flow what it looks like now and then we're going to kind of walk through the process of taking data and turning it into a tag flow and then we're going to look at the data structured as attack flows and we'll wrap things up a little bit and so

who am i um this is me this is my twitter handle i have done a lot of graph things and i want to take a second to talk about where that comes from because i've been doing graph stuff for about 15 years now and it started when i was in the government um when i was in the government we were doing risk models right and so this is like way back because what we did is we came into our executive and we said hey we ran nessus and we showed the output of nessus all of the output in essence and it's a little too much and she said go back and find a better idea

so we came back again and this time we grew up things we said this thing happened this high risk happened a hundred times um this one happened 50 times and she goes that's great i i don't know what that means right like so what so okay we go back to the drawing board we come back we got like the five by five chart right that has impact uh likelihood and impact on it and it's got some red yellow and uh green areas and we go now we know this this risk is right here right um the problem was we never could agree because it wasn't just like one person making this decision like i was representing the government i

had a contractor doing the testing then there's the people that built the system and their government representative and we all disagreed and the problem was why do we disagree like why can't we all just get along well it turned out that it mattered on how you made assumptions like when my tester makes assumptions they would make the assumption of like hey here's the server the power button isn't like covered by a door or anything what if someone walked in and turned that off this is a really important server that's a high risk now understandably the other side of the government's contractor said yes but that server is inside of a room that has its own

access system and set of credentials it's locked down to like 20 people and that's inside of this portion of the building that is only locked down to the people who can get access to it which is inside of that half of the building which is locked down which is inside the building which has its own access which is inside the fenced area of the base which is inside the base and so really the threat over here is probably not what we're worried about we're worrying about the one out here and see the idea is there's a path here right the narrative for the threat outside the building was very different from the threat inside and so we realized or i realized

hey we need to be writing that down and so i did what everyone does when they first start to do this they get an excel spreadsheet and they say what's the is the threat inside the building or not if you answer yes it's high or we add like you know 10 and if it's no we add one we come up with a bunch of questions like that and then we add them up and maybe we multiply it by some random number we came up with because then it looks prettier and then we say any number between this and this that's a high risk and you know these are the lows and that doesn't work and i apologize if anyone in the room is

actually still doing this a lot of people do it's not just you but the problem is people already know what they want the risk to be and after a few times using these kinds of tables they know how to get it out and so they come up with some narrative that fits their mental model for what that risk should be i want it to be high so i know i select this this and this thing to make it high and so now not only are you not getting the narrative you needed you're getting some false narrative and so i said well let's let's scrub that we'll go back to the five by five but when you

put that dot on that five by five matrix you're gonna have to shoot tell me why you're gonna have to write out a little paragraph that says i think the threat does this and then this thing happens and then this and then and when i realized we were writing out this is where i started to kind of get a little bit different from what we normally do today and we realize that what we're really documenting was this attack path right the attacker does this to this system they do this and it has this effect they do this it has this effect a sequence of things and so we started to build paths and that was actually really cool

because once you start building path paths you start asking this question what does that mean in context it's cool if you've got one what if i have 10 what if my 10 all include the same thing how can i combine those together and i remember this moment really clearly i'm kind of grappling with this idea it's the end of the day right everyone's kind of tired it's me and this one guy there left i'm like i'm walking past the school he's like hey i know you've been working on that problem my company had this project a couple of years ago that involved um these things called graphs i'm like cool what's the graph because i

don't know uh like he had no clue but he he kind of he understood enough to know that it might apply in my situation so i go back home and i start searching for it and it turns out it's a really great solution by the way if you don't know what graphs are we're going to talk about that three sides and so i go back and i build um bayesian inference networks in the back of an excel spreadsheet which sounds like a bad place to do that until you realize that i was in the government and they don't let you install things in the government and they really don't let you install programming languages because then you

can run whatever you want but they do let you write visual basic because it comes with microsoft office and so that's where i built this and i left the government after a while i got a patent around graphs i wrote a bunch of blogs i published some stuff i've done a couple of talks and that brings us to today and so it's been clear for a long time that atomic information atomic infosec data is not cutting it right we need to be able to describe the paths and graphs that graphs attackers take you know flows so to speak but we lack the common language to do that attack flows that common language it's a schema for storing paths and graph data

and it's really cool because it's incredibly simple and it's incredibly strong and before well if anyone wants to clap and and tell me that i was great at doing this now that's awesome because on the next slide i'm going to tell you there was actually a team of people that did it um this was something done through um writer's uh center for threatening form defense with um like raja attack iq great uh gauge anomaly apollo at fortinet market city ryu fujitsu andy who's now at apple market title and we were lucky we had many of the right people um ryu worked on this stuff back like in his phd years in the early 2000s like and then this guy is like so smart

that like i just didn't um but he he could be wandering around here and he's so mild matter that like you know he he passed by we would never see him but he's so smart and so awesome i love roof um andy was instrumental in writing caldera uh mark was instrumental in writing six and of course um i've worked with grass for a while and i also maintain the vera schema for the used by verizon for the data breach investigation report and so it was this team of people that did this and so what's what's it look like and so graphs are these mathematical things that they're actually pretty simple they're made up of two things they're made up of

let me we're at c so i just push buttons to figure out which one's the laser or should i actually think about it ah yeah there we are here we go it's made up of dots it's made up of lines for graphs and so the idea is that the dots are nodes the edges are the lines every edge has to have a node connected to its ends you can't draw a line in a graph and have nothing on the end but nodes themselves can have multiple or even no lines so this is a graph this is a graph this is not quite a graph because there's no no time and there's a lot of different types of

graphs there's simple graphs like you know there's no direction you know you don't know which ones to start at the end of the edge there's directed so it goes from something to that's why we have this little arrow on this one there's tree graphs there's acyclic graphs there's cyclic there's directed acyclic graphs or dags which come up a lot in a lot of kind of different situations there's also hyper graphs and property graphs but we're going to talk about a very specific type of graph we're going to be talking about linked data and that means a couple of things the first is that every node in edge is defined by a single string a single uri

specifically and so every edge has a uri and this one it's rdf type and i'll explain the rdf thing in a second every node has a uri this is i think attack flow action one or varus phishing um the only time you don't have those is when a node is a literal so an actual like a bunch of actual string is not a uri or a number something like that and so you combine those into what are called triples here we have a triple of attack flow action rdf type and varus and it's just three things right like you can see a table with three columns and you put edges in there and you have your triples and you have

your graph um but you'll notice right that i didn't spell out a tag flow i defined attack flow over here as you know the af means this uh namespace and namespaces are one of the cool things about link data because you don't have to reinvent the wheel if someone else has an entire definition of how to explain things such as actions or assets you can just use that you don't have to come up with your own in fact a lot of it's predefined things like rdf define um this is a type action is a type or actually that should go the other way phishing is a type of action or action is a type of fishing

there's other name spaces you'll see a lot and i'm going to bring these up for a specific reason so like rdfs is used for labels so if you want to give your name node like a common name or something that sounds a lot better than you know uri blah blah blah there's doubling core that adds things like here's the description for the node there's time for time stamps and there's al web ontology language and it has a lot of useful things like this node is the same as this node or this node has an object property or a data property of this thing or this is a named individual like there's phishing the concept and then

there's the actual phishing absolute action that happened as a named individual and the reason i bring these up is because you know if you go back and you go google this or if you're googling it right now it's going to scare the vegetables out of you because al does scary things because the people who invented al were a bunch of academics who were like you know what's cool we can do actual cool reasoning over this we can do first order logic we can do all these fancy things and we come into and go you know that's cool and when we need to know that we can know that what we want right now is to say that

one node is the same as the other node like we're using this much of this big thing and we don't have to understand that big thing it doesn't affect us in any way but the nice thing is you have this namespace if you have a data set and i'll explain this a little bit later but if you have a data set that includes something like the city of sydney you don't also have to say that sydney is in australia that australia is a continent and it has this location someone else has already built a graph that explains all that stuff if you reference their sydney node then anyone that wants to know that stuff they'll be

able to find it so now we know what graphs are right what about attack flow attack flow is five things actions assets properties relationships and flow so the action is the thing that happened the asset is the thing that has its state change properties are these um nodes that describe actions assets other properties things like that they're the descriptions relationships are just the edges between them and the flow is the set of all this together and so when we look at it as a graph it looks something like this we can see this causal path through the middle of it and one thing you'll notice here is that it goes action asset action asset and

that's very very intentional because we'll get into a little later but in security different people think about causality in different ways if i'm a blue teamer i think that i have an asset and this action happened to it because that's kind of what my logs look like if i'm in on the red team i might say i did this which is an action then i did this which is an action and implicitly it's the assets that are between that i did this action there's like two something um but to be able to capture all the ways that we think about um security it's important to have these and have them go back and forth we'll see some visualizations that show

that later so now we're coming to kind of the fun part what's our data look like today this is the opportunity to look at a bunch of data but not use it um so starting with red team report right this is just something i downloaded off of um pintestreports.com and what do we see here we see a lot of freeform text right not a lot of structure and really also not a lot of solid grouping you can't say like here's a concept and here's there's grouping but it's all over the place so this is what our renting data looks like today now tax simulation data a little bit better it's structured right it's a tree but a lot of times in attack

simulation data you get a lot of raw code it's hard to describe anything more than a single thing without having some clear-cut effects this one single thing points to some code and what we don't want is we don't want to have to describe our attack simulations as like arbitrary code because we don't want to be like hey i need you to go run this attack simulation in your environment with your tools here's just arbitrary code have fun like we wanted something a little bit better than that um signatures right again structured looks very similar to the attack fluid it's because it's like hierarchical data um but there's no real links we're seeing one thing happening in an instant

you know and that's we want to be able to detect more we want to be able to test subtler things we wanted to be able to detect multiple things and then we get into intelligence data right and this is a subset of an eye chart a very small subset of an eye chart these are two records just off of showdom right and there's some substantial problems with this it's super dense um also you have duplicate data so if i have like the same vulnerability down here i'm going to get duplication of references duplications of summaries like those are going to be in every single record that references this vulnerability right that's going to take up a lot of

space but the biggest problem with this is that when we use these data structures we don't know what's in them and so this one like so over here this is an http test um this one i don't think this one's actually an hp it's a different test it's like the https test and there's nothing that says that they need to have the same information so if i'm looking at my dictionaries for a key that's like five layers down in every one of my records it's not there in every one of my records and this becomes this huge problem as you look through data if you don't know exactly what the entire structure of the data looks like you know and who wants

to know the structure of all the data justifying the few things that they want to look for i say this from profoundly depressing experience um okay so moving on incident response data right this is all very textual um it's all kind of these text files by the way big thanks to chris sanders and his training he donated the text files for this he's also helped out a lot and so these are text files of indicators and information um that were collected during this uh simulated ir and the problem right is it's all test even some of this is structured this is structured data but it's being stored as tests and there's nothing that links 222 to

something in this file what's right there or over here or even in the same file other than like wrapping the raw text you know that's not very searchable that's not very usable you know you're never going to find that in six months and so moving on to like thread intel this is uh some noise donate or some data donated by grey noise this is actually pretty nicely structured it's relatively clean this one is but you know you can end up with raw data scanned this one they scan two ports what if they scan every port right then you get this thing that's like this big um also what if you have 40 000 of these

and you need to find every place that 23 is mentioned um do you really want to look through 40 000 pieces of data to find the one everyone that's got you know port 23 in it and even when we aggregate things like threat intelligence like i like to think that this looks good i like to think that because i made it um you can all tell me that it looks good you don't have to believe it though um it's all aggregated and it's pretty it communicates it's clean it hides a lot of nuance because of that aggregation and we'll see that a little bit later you know but even keep moving up the chain let's say that we're making

decisions about our architecture what we're going to invest in you know how do we do that these days we make it one endless you know at an organization i was at the way we did is the entire security department could propose projects and they said hey we think it'll cost this much here's some text maybe a paragraph explaining the benefit we expect to get from it and we build you know one endless and at some point the money runs out you draw a line and you find this stuff and you don't fund that stuff above the line below the line um the problem is there's no context to it do we know the email filter plus

official reporting button do those really we want to buy those together are those complementary are they duplicative and what about the discussions at the executive level right um a new uh exploit's being hit right software x has been used in breaches and you know those companies look like us and you know the executive team says well are we at risk we say we could be we really don't know right we know that we run the software but we don't know if we're at risk because we don't understand our context and so there's a lot of problems here right often our data is just blocks of text it's hard to parse it's hard for people to parse it's hard

for machines to parse it's hard to get value from because the person has to be able to look at it and pull out the pieces and organize them in their mind because they're not organized in the text and if you can't do that in real time when you're looking at it what are the chances you're going to be able to find that later on and put it to use you know we've got a lot of structured data most of it's structured either tabularly for logs or hierarchically and that's better but a lot of times it lacks links hierarchical data is not good at building sequential links of things it can tell you something and then tell

you things about those things and those things you tell things about it doesn't say this thing is related to this thing is related to this thing um also it lacks that context you either are putting too much context in because you're duplicating every one of your records or you have too little context because you can't go find the additional pieces of information about them because they have to be in the record and if you do have all that context in your data and all this structure it becomes so complex that you can't find anything you need a search engine just to find minor things in your data and so attack flow improves infosec data and we're going to go through this as an

example to see how that works i like examples so we're going to walk through so we said attack flow is like a schemer right it's not a tool it's not a solution it's how you structure your data and the schema itself lives as either a json schema or as a graph we also have data and this is a good example because anytime you have actions assets in order you can create a flow um and it can be in incident response reports red team reports really any of that data we just looked at but these are two good examples because if we look over here if we look at what's happening in the ir report it's very asset centric

the mail logs show no proxy and host logs detect where a red team report starts with we fished right we installed it starts with the action happening and so it's two different ways of looking at the exact same thing but now we have our scheme and we have our data and you know i assume that everyone knows exactly how um we need something to put them together which is nice because we actually have that tool um michael uh corenzo i i agree with williams um put this together as part of the mitre project you define your action and you define the mandatory properties for it you define your asset in the mandatory property and then you can add additional

properties down below and these are all just kind of point and click in the user interface to build this out and once you've built it you can export your data as a tag flow and so we can export as a json schema and the nice thing here is the actions are grouped together assets are grouped together properties are grouped together but i really prefer data as a graph and so still it's a graph stored in json but it's a little bit different it's just a list that's what a little flat bracket tells us each thing in the list is a node and so the id of this node is action one and then for every one of the properties

the key is like the relationship and the values are the other nodes and so action one is a named individual it's a type of phishing and it's a type of attack flow action we can also say the vector for it was email or the description is the actor finished a victim you know things like that and so if we want to know what a node is connected to we can just go to that spot in the file and then we take that graph because let's be honest none of us want to work with json directly we want our machines to work with json and we put it into a graph database this is onto text db

graph or onto text graph graphdb this is one of the many many tools you can use for this there's tons of graph databases out there i like this one because it has a nice ui it's got a nice api and there's a good free version of it and it works great for this kind of data but you know you we said these are these are just triples you can put them into a relational database or any other type and we can see where this happened right we still have our causal path we have actions that lead to assets which lead to actions which lead to assets we have the properties a couple other properties and we have

our flow note up here but looking at is fun and you know i probably could spend a bunch of time up here just showing pretty pictures but the reality is we want to use our data and so we take that data and the cool thing is it's in a database it's not just living on its own in some file we can now query it and so we can say on the left what assets have been compromised or on the right and i know this isn't readable um what actions have been taken but we don't just stop there right because we really want to know things like what assets were compromised in a certain window or which action is taken most

often but that's this query this is in a language called sparkle which is a standard graph of query language but if you were storing this in a relational database you'd be acquiring a sql or you could be crying with gremlin or cipher or kql depending on what you're storing your data in anything the nice thing is how flexible data is and so you go find what you need in your data and not just in individual pieces you find it in all your data because all your data is living in the same place so going back to our challenges right instead of a lot of text we have nodes that represent a single concept instead of complex json

tactful lets us represent the parts that we need um however if there's more context we want we simply link to it somewhere else someone has probably described it we can go and bring in their knowledge as well the relationships are a um clear and explicit you know we know how to connect things we can show causality the paths the attacks take right something we weren't able to do before and so let's go back through our data and see what it looks like now that we've turned in tesla we start with red team data right it looked like this before now it looks like this um this no one in this room is going to look at this and understand what's going

on this everyone can look on right the red team started with email and phone they vish one phished the other it compromised the person who made the error gave out their personal information or installed malware on the user device this led to credentials and installed malware over here the sensor exploited a vulnerability internally on these two systems one allowed for additional directory transversal which compromised phi you know and not only the cool thing is not only is this much more legible this gets fed into whatever tool you are using to store this data this data isn't looked at once and then left on the cutting room floor it lives in your context for your organization to use

this nice structured data did you get it okay um looking at tax simulation data this is a really fun area because right now it's kind of we can simulate basic things and we can write a bunch of programs and do a bunch of stuff in sequence what if we could do it dynamically right and so let's say that the attackers start with the domain and email and so what you do is you match into your graph what little sub graphs or patterns match what data i have well network discovery does and so we're going to do the discovery maybe we discover an rdp server over here if we have email well the action that matches that or one of our

actions matching that's going to be phishing and so that compromises a person who makes a mistake and compromises the credentials back to the attacker and the cool thing is after this happens and this happens you go back through your list of potential actions you can take and you see which other ones match in the graph because we have compromised credentials and a compromise ip server we have a match and so we take another action we use those stolen credentials to compromise the rdp server you know the same way we think is people we can now document that and have machine okay i'm sure everyone like then things like but do we really want to do this

and you're absolutely right you probably don't want to do this but you can do it i'm going to get in trouble for this one um signatures right our signatures are pretty static right now what if they could be a little more dynamic a process creates a file that same process runs that file the process of the browser and the file isn't living in a place a browser should put files right now we don't have to base detections on individual um events and the reality is there's a lot of organizations that are doing all already but a lot of it's done in raw code and things that are not communicatable what we want is something that's communicated

we want a rule that i can then send to someone else to say hey can you test this to see if it works you know can you test this within the system our intelligence it is convoluted we can take it now and we can look for the parts we want let's say that i want to know about this vulnerability right here and so i search for that and then i double click that and it says hey here are the instances of that vulnerability that exists and i say great show me the interfaces that others are found on it's found on this interface and that interface and that interface which is found through these scans on these assets and we can

click through we can do it and the first couple of times that's going to be really cool and you're going to enjoy that it's going to be a lot of fun after the first couple times you're just going to write a query that says hey tell me every system that has this vulnerability on it and it's going to dump it out for you it's going to find that context and it can link in context from outside your organization if you want to know the physical locations or things like that our incident response data very kind of messy useful from a human perspective but not well useful later on what if we take that data we put it into

markdown we're defining an edge right here because we have a source a relationship and a target it's just in markdown now we have that we have our incident response in our graph in our structured data this could potentially be living with your intelligence with your renting data now when we go look we can go find the things you want in fact we could bring in that structured data that we had during our incident response and bring it into the graph directly and then output from the graph back into our markdown allowing us to both work as kind of text as well as as structured data oh and this is um let's see this on the left this is onto

a text craft vegan and obsidian and i'm trying to point out tools as i go because i want to really impress on you that there is so much ecosystem that already exists around this stuff um looking at the additional the grain noise intel and there's a lot i like i think i parsed through an hour of data in like 46 000 records um let's now we can go and explore that you know i can go search for the vulnerability i want i can find every thread actor using it and i can click through those things and then of course my data right it's like it's pretty but it's hiding some things um this is what it looks like on top and

this is just what a subset of it looks like when you go start looking at the paths the attackers are taking this is not fun to look at either this is very messy um we'll instead look at a subset of it well now we're starting to see hey if i'm an engineer trying to figure out where i want to apply my mitigations maybe i'm starting to understand that i want to put things on my server layer because it looks like my aggregated attacks are mostly coming through my server layer either as a malicious action probably using credentials or is an error but this is still a little dirty what if instead i start to look at the path even

as a couple bar charts i look at what action happened first what actions happen the middle what's happening last i'm adding additional context that i didn't have that was obfuscated when i looked at this from a point of a single bar graph moving on to the budgeting right we just kind of picked our one to end list now what if we go and we build out the attack flows for it now we're saying okay this mitigation right here helps with this right vulnerability management probably helps with all of our servers um two-factor authentication probably helps where we authenticate email filtering helps where we handle emails now we can put this together and say hey what we want to do

is we want to do a couple of things together as a group because they add together they form kind of a single boundary and then we want to do these two things together because they form the next boundary and then the next thing is to form the next boundary we can make decisions about mitigations that complement each other and finally that discussion right this kind of awkward discussion that many of us have had in some form or fashion what if instead of saying oh this vulnerability is happening we said this attack path is happening and when we ask are we at risk we say well we went back and we looked at that path we pulled that path out of our data

and we've tested our phishing defenses and we've tested our malware response and the likelihood of malware staying on the system is this number the malware likelihood of being fish is this and when we add this together even though we know that we're vulnerable to the exploit the overall path is low and we can say that confidently because we queried a data set that we are storing that's storing information about the structure of our organization the story and information about our red team results it's storing information about what assets live and what the likelihood is from our in their phishing results and so there's even more benefits we get from the ecosystem itself um all this i've done with like open

free stuff um but the nice thing is like academics have been working on rdf for years and years and a decade plus now and they built a bunch of stuff like that's the good news the bad news is they built it all in java but you can live with that um there's databases there's graph databases there's every other type of database there's huge numbers of file formats if you like your data tabular you can do that if you like your data in xml i'd like to introduce you to other data sites but you can use that there's json tab uh tabulated human readable things you name it there's a format that fits how you want to use the

data um and then because it can be structured however it can be stored however and there's even like api tools there's validation tools there's programming tools visualization tools editing tools querying tools analysis tools all these things already exist for us because we're working in a known ecosystem and there's huge stores of data there aren't great structures for security data there are great structures for every other type of data in the world in fact like this is this data set that are the data structure that google uses to build structured data into web pages you know it's all machine to machine readable we don't have to be in the middle of it and that's good because the reality is not everyone

wants to be looking at this kind of stuff and there's more we can add we can define sub flows so i can say this this subset of things happens often to me does it happen often to you we can both define those and say what things are common between us and so going back to visualization this is important because the first place people want to approach this data is as a visual graph if you want to know more about visualization i give a talk on a microsoft bluehound 2019 graph visualization and one of the things is that after a while you don't want to look at graphs you just want to use the data but it's important to kind of go through

an example here this is something i pulled out of a threat report have a nice graph like and right but you read this you're like hey i know more things and if i were to ask you okay go put those into practice you're gonna go well i guess i patched this right i go well but how do you go how are you gonna apply this to your system and then well no right like i can't apply this graph i can apply knowledge i gained from it i can't apply this but what if we go structure it is attack flow the attacker exploits something over here and i've left out the properties but you can define the type of exploit

on a server which then using data that it also sent runs a loader which then adds code we can look at this from the attack team perspective there's the other side later you can take a picture of and here we're looking at just the actions whether the server the asset is the relationship between them and so this might be how a red person wants to think about it you know the blue side might want to think in terms of assets with the actions being on the edges you know i'm thinking about what's happening in the server the data and the code being loaded and then finally you can look at it all back together and one of the things

that i want to add is if you look at it all together you get to carry this additional information and so you can say not just these things are related but here is the requirement for this exploit to work right and here is the state change imposed on this server and this server being compromised is a requirement for the loader to run and you can put this data into a structured data set where you can analyze it and use it with all your other data you can share it with other people you can get their data and we can all actually put things to use but you may not want to always look at it

right like you start to get complex attacks and they start to get busy there's a few things that are important for visualizing this kind of data like grouping together nodes or putting blocks but you know what i want to do i want to display it like this i've got my assets i've got my actions as lines i can build this as an interactive visualization i can scroll around in it this is an easy way to show the same thing just because we're using graphs doesn't mean we have to look at graphs you know and that's important because as we go into the future right this is a schema this is not your single pane of glass this doesn't solve all

your problems it's a foundation to build on and that means building now is it's possible to build bad things on a good foundation like i don't want to give you like i'm excited about this you know you may be too i don't want to set you up and have you feel disappointed is there things to be built right and some people don't want to build things right some people just want to like buy the finished house and that's okay that's how home builders make their money um as my homebuilder knows um so that's okay and not everyone's gonna get it because the prego paper that you know google published years and years ago they described thinking like a node

and that's not something everyone wants to do or anyone can do and that's okay attack flow can meet you where you are and so to help people with that we're working on things like documentation you know we showed you like here's what the data looks like before and after i'm working on getting your documentation so that you can do that process yourself we're also looking at building training so you know i said hey you're going to enter information into this web app you know i want there to be videos so you can watch someone else do it to help train you on how to do that and then we're working on improved tooling and that means

more and quicker ways to enter data it means easier ways to query the data if you've ever seen the vertex dot links storm query language i really like it at mirrors r's implier verbs it's a great query language i'd like to see that usable for attack flow and it also means better visualizations easy push button stuff for whatever kind of user you are whether you're a red team user or blue team or an executive and you know while we're talking about things the future we want to talk about maybe some of the more advanced things right because we said action right there's nothing in there that says it's attacker actions we're using it that way

but what if i describe um normal behavior as actions i can start documenting how my system normally operates or what about interdependencies haas keynote um i assume i was running pbj and i missed it talked about how what i would call what he calls resilience what i would call blast rate is the effect of one system having an adverse state on other systems that's just an action this system changes state that has an action of changing the state of another system you know this one's availability is compromised that one is this one's integrity is compromised this one's a table we can start to build that blast radius as a structured set of data or even response right i can say if this

system is in this state i'm going to take this response action you know and document that in the exact same language as we've documented everything else and that opens up a lot of opportunities right the first is just structuring path data right now all of our data is logs and it's all sequential what would be great is if we actually were given the causal paths between our log events and i actually know some organizations that are working on them additionally you know what if we weren't companies that can provide enrichment so i mentioned vertex they sell enrichment what if you had your organization's data structure you didn't want to really store all of the say the attack techniques and stuff what

if you just wanted to call into someone else's database of what all the threat actors were so when you saw this ttp you go just query theirs and you link their data into your data and it looks like a seamless graph database from your perspective but on the back end it's all being linked together from all the other disparate sources and then what we need better analysis there's opportunity to write analysis against the data and there's opportunity to build automation automation of red teams automation of defense and blue teams and finally for incident response there's the ability to apply things like link prediction so in graphs you can use machine learning to say hey given the graph we see

there may be edges that we don't see and so we can say we've done this incident response this is what we've seen so far this is our attack flow machine learning model are there any edges that maybe we missed and it goes hey have you considered this edge that this action was taken related to this asset and that gives you another place to look in your response for things the attacker may have done that you didn't know about and so attack flow really is the next step in information security data and if you want to go get involved or go look at it these two repos are the place to start miners repo it has the json schema the

builder a little bit of markdown to explain it also the bottom one is the graph schema as well as the code to convert between the two and the reality is this is probably gonna leave you with more questions or many questions um and so because this is very new this is something that just was published in i think april reach out i'm here to help i want to see this i think it is really important for the industry to move on and how we structure data i get a lot of security data um i get a lot of security data and so the opportunity to improve howie structure data is very central to my interests

um and so reach out i can help i'm looking to help and i would love to open it up for questions but i've got 10 minutes to get back to pros versus joe's to help in the game and so thank you very much [Applause] no questions um if you find me or if you would like to walk towards common ground i would be happy to answer questions all the way

i should have left the mic

[Music]

[Music] [Music]

[Music]

do [Music]

[Music] do

[Music]

okay

did you switch it over okay all right good afternoon welcome to b-sides las vegas a ground truth track this talk is weeding out living off the land attacks at scale given by oh god i forgot your name by adarsh um a few announcements before we begin uh we'd like to thank our sponsors especially our diamond sponsors lastpass and palo alto networks our gold sponsors amazon invisium and google is their support along with our other sponsors donors and volunteers that make this event possible these talks are being streamed live except an underground track and as a courtesy to our speakers and audience we ask that you check to make sure that your cell phones are sent to

silent or vibrate if you have a question we're actually doing questions at the end if you do have one use this microphone that i'm standing at in the middle of the room so youtube can hear you um as a reminder the b size lb photo policy prohibits taking pictures without the explicit permission of everyone in frame these talks are all being recorded and will be available on youtube in the future again with the exception of underground of course please keep your mask on at all times um if you want to move closer to the middle of the room please keep social distancing in mind as well um with that let's get started welcome adarsh [Applause]

thank you um can everyone hear me okay all the way at the back all right sounds good um so hi everyone i'm adarsh i'm a research manager at surface ai and today i'll be talking about waiting out living off the land attacks at scale um so a little bit about me um i have been working at the intersection of security and machine learning for is it cutting out i feel like it's cutting out okay

sorry about that um so i've been working at the intersection of data science and machine learning for four and a half years all of that has been at surface ai i joined right out of grad school i completed a master's degree in computer science with a specialization in artificial intelligence and machine learning at uc san diego currently i live in denver with my dog and i'm originally from india i grew up in bangalore and move to the u.s for grad school so before i begin uh talking about the technical details of this talk i'd like to mention that this uh the work that went behind the stock is a huge collaborative effort across several teams in sophos ai

this involves data scientists data engineers threat analysts software engineers and program managers and i'd like to take a minute to to acknowledge their contributions and thank them for it anyway now let's get to the technical specifics uh what is this talk about this talk details the development of a machine learning system that detects living off the land binary attacks uh this system is supposed to be another detector or sensor that feeds alerts into the security operations center if you attended ben's talk yesterday uh ben gilman from sophos ai uh you got some context on how we have different sensors that feed into our feed alerts into our security operation center and how these alerts are manually then reviewed by

our stock analysts so we look into the technical details of how we designed the system that surfaces living offline binary attack alerts the challenges we face during the development of the system the strategies we use to mitigate them and finally the generic lessons that we learned along the way if you attended josh's talk yesterday you'll remember this slide spoiler alert i come to pretty much the same conclusion at the end of this talk but i just take a different path i use this project as kind of a case study to demonstrate each of these points that he made yesterday okay so what are the goals of this system because it's important to define the goals before we start working on it

so the first goal is to surface surface additional living of the land attacks to the security operations center that are not yet being detected by existing methodologies but more importantly the second goal is since this system is going to surface alerts to a security operation center uh it's not supposed to be flooding the stock with alerts especially not false positive alerts now that we have the goals squared away let's move on to understanding the actual problem what are living off the land binary attacks living off the land attacks or urban attacks in short are attacks that involve the use of binaries that are either pre-installed by the user in a system or existing system binaries

these binaries are used in a malicious way and that's what constitutes a large attack and in general the attack vector for this attack is a command line that is executed against said binary these attacks tend to be extremely hard to detect and defend against for many reasons some of which are that they have a very small footprint on the user system sometimes it's really hard to differentiate between like legitimate cis admin activity and urban attacks and several other reasons so um our problem is that we want to detect uh low bin attacks with machine learning so is this um a good nail for the hammer of machine learning um in order to determine whether this is a

good candidate solution to be solved by machine learning uh we need a couple of things we've established that the artifact that we can use to detect a loyal attack is usually the command line that is executed so we need a lot of representative command line data that is very similar in distribution to the actual command line data that we'll be predicting against second we need a lot of good labels for this data in order to teach a machine learning model what is a malicious command and what is a benign command we need to have plenty of examples that span a wide variety of use cases well the first problem is not really an issue because

working at a large security vendor we have plenty of access to the actual distribution of command line data that we'll be predicting against um we in this work we limit our we mainly focus on uh detecting and surfacing alerts to the mdr the manage detection and response system and there we have about a 1.5 billion command line invocations per day across all our customers which feed into the existing detection and alerting system so this is the data that we actually want to plug our model into and surface additional alerts a machine learning based detector is actually the perfect addition here considering the incoming data volume because invariably when you have one and a half billion command lines there are a

few attacks that are like getting through that are probably being missed right now so having data is not an issue but often we run into snags when we need to have enough labels [Music] finding labels for command line data can be a little harder and it also often becomes a catch-22 problem where you know if you have a reliable quick and accurate way to label your command lines then why do you need a machine learning model to detect low bins however we don't have any such quick reliable systems so we have to resort to a three prong strategy to label the command line data that we've got

so here are the three different strategies that we use to label our data the simplest source of labels for our problems just lies in past data uh some loyalty attacks have already been seen and investigated by the security operations center and we have that information stored so we can just directly use it the second source uh where we got really creative is indirect labels so force has a lot of different products and across all these products we have labeled data for several artifacts like files and urls and we can use this information to indirectly label command lines for example if a malicious file in a user system also triggers the execution of a malicious command line that we have an

example of a known malicious command line that we can use to augment our training data um and the third strategy that we used uh is that we now have a dedicated group of threat analysts that are part of surface ai who distill all the information from like the first two sources and all the knowledge that they have and create rules these rules can be used for labeling our data [Music] so here's an example of the indirect labeling through a surface product that i talked about intercept x has something called root cause analysis where when it finds a malicious file it constructs a graph of all operations that a trigger a triggering file performs on the user system so this

involves like touching files deleting files uh creating processes and more importantly uh running command lines so we collect a lot of potentially malicious command lines from this data and here on the slide you can see an example of like a root cause analysis graph and a command line that we obtained from there that could potentially be malicious some other sources for indirect labeling uh we use sandbox behavior reports uh one good thing about sandbox behavior reports is that the detection engine hooks into windows anti-malware scan interface which means that it gets access to the actual blob of code that is executing when a command line invocation happens uh which means it can get through a lot of issues that

obfuscation or like when you're trying to execute code from ps1 file uh we can't see what the contents of the ps1 file are from the command line itself so we get a lot of indirect labels from that source we also do url lookups where through the next gen web product we have like a repository of urls that are labeled to be malicious of benign and if there is an embedded command line embedded url within a command line then we can look up that embedded url and see if it's something malicious so like if a command line is operating on a malicious url it's more likely that it's going to be malicious and then uh the final approach

which is like manual review and labeling through [Music] incident investigation data um so here we have our analyst go through incident investigation reports and manually review whether the case was actionable whether there are associated loan command lines in the incident and then they create rules using regular expressions to capture future occurrences of such commands and these rules are categorized into four different types so there are strong malicious rules a malicious rule is given to like a known malicious command line uh or unknown attack and is given only in cases of really high confidence like you can treat this as basically a block list um an example is when yeah we have a known attack pattern

uh like a powershell command is performing an unsecured download through a very specific port so that's a good way to like identify that something that we've already seen before is happening on a user system and then there are suspicious uh rules uh so if command line activity is seen as potentially malicious but we don't have high enough confidence to completely convict it then it receives a suspicious label so an example is like listing credentials which is suspicious behavior but could also be legitimate in a different context such as when a system is trying to like look at credentials or something then we have uh the other side of the coin where we have weak benign labels or debatable labels

potentially benign activity is given this week benign label uh and we think this activity is more likely to be benign but a machine learning system could potentially fee on it because it's doing something strange like you know it has a big basics for encoded chunk in it that could that a model could see as potentially malicious and then we have strong benign labels which is basically known benign activity and for our machine learning experiments we basically take the first and last category of rules uh the block list and allow list and augment our labels um using these rules um and the suspicious and debatable rules are used in a different way which i will get

to in a different section so in a nutshell uh here's the entire labeling strategy there are two label sources there are indirect labels that are obtained by cross-referencing command lines and then there are command lines that are found to be malicious based on incident investigation reports and then there are human analysts who are looking at both these data sources and creating rules which are also then used uh to augment the the labels that we have um at the end of this um labeling strategy we end up with a labeled data set which we can finally use to experiment with different machine learning models so here are some data statistics for the data sets we used to train our machine

learning model we have about 76 million uh samples in our training set out of which about 1.6 million are malicious uh seven million samples in the validation set order with 70 000 are malicious and then about 12 million samples in our test set uh so importantly here we use time splitting and i think uh my colleague ben talked about this too in his talk about how it's really important time split where you're essentially simulating uh new data so your model has seen for example in this case the model has seen data till first april 2022 in the training set so when you're trying to validate the model when you're trying to test its performance you're not giving

it any command lines that it has already seen um so it the validation set is created in such a way that there is no overlapping commands uh and it com it contains only the command lines that are seen after first april 2022 in our systems uh is the same with the test set where we collect data from the first may of 2022 to the first july of 202.

um and then we first train a baseline model on this data we use baseline models as a simple benchmark before uh training on more complicated models so that we keep track of how well the actual models are doing and that they're not performing worse than the simplest possible model that we could use so for this baseline model we pre-process the command lines remove white space decode any basics for encoded chunks and then develop a feature representation this feature representation consists of like three different parts we generate summary statistics for the number of transitions between six different character classes we have digits white space uppercase vowels lowercase vowels uppercase consonants and lowercase consonants and we count

the number of times in a command line we have transitions between those character classes so we basically get a six by six grid of counts which we then flatten into 36 different features we also count the number of occurrences of special tokens like a partial invoke expression or a dollar sign that could indicate like that you're trying to use a variable inside a command line and then the final kind of features is like the number of special symbols that are used in the which in the command line like opening and closing braces were there any urls used and also we ultimately use the length of the command line and uh for the machine learning model we

use a simple xc boost model for our experiments so that was the baseline model and then we also conduct our experiments against two convolutional neural networks um in order to provide the command line as an input to these models though we need a different kind of feature representation we first need to convert these characters into integers as a way that the model can consume it so on the slide i have a simple example of how we do this with just uppercase alphabets so if i created uh if i created an integer representation for um alphabets a to z where i give them like numbers starting from 1 to 26 uh then if i have a string called string that says

besides las vegas i can replace all the characters with the associated integers replace white space with 0 and then i have a list of numbers that represent that string the only difference between the feature representation that we developed and the example here is that the size of the vocabulary or the number of characters we consider is a lot higher uh and for the command lines we basically consider all um all printable characters and um these are the two neural network models the convolutional neural network models that we experiment against the one on the left is a simpler architecture with a single convolutional layer and a small dense layer and the one on the right is actually a neural network

architecture that we have used in the past to detect many different string artifacts like urls or file paths or [Music] registering keys any any string artifacts basically so you can read about this this uh model architecture on our site uh where one of our data scientists samara tafik has written a great explainer blog post uh so i won't go into details of the network architecture here in the interest of time so these are the model results that we got for the three different models these are the best results that we've obtained after hyper parameter optimization if you're unfamiliar with this plot uh let me take a quick aside to explain um it's called the receiver operating

characteristic curve or the raw curve for short um it basically plots the true positive rate on the y-axis against the false positive rate on the x-axis for different thresholds of model score let me break that break that down a little bit so the true positive rate is the number of correct malicious predictions of a model against the total number of true malicious commands in the data the false positive rate is the number of false positives as a fraction of the total number of benign samples in the data so the machine learning models that we train here basically output a score between zero and one with a higher score denoting a greater chance of a command line to be malicious

so when i talk about deciding what threshold to use that's the value that i'm talking about where if i decide that anything above 0.5 is what i'm going to consider as malicious then i'm going to get a different subset of malicious commands detected by the model as opposed to if i used 0.6 so that's my threshold so why do false positives happen and why do the true positive and false positive rates change based on the thresholds that we use for a model when we train a machine learning model what it is doing is learning an association between the inputs and the outputs as we saw in the case of command lines machine learning models can consume

numeric inputs so we need to create numeric representations for the artifacts that we want the model to predict on so we can't just like supply it with a command line we need to create a representation for that command line um and in an ideal case uh just like on the left the representation that we chose for our data uh might make it easy for the model to separate out the good from the bad like instead of uh coming up with like 50 numbers if you came up with two numbers that represented your data and plotted it on a scatter plot like that uh then if you chose your two numbers correctly ideally you sh you should be able to

like separate out your good from your bad samples but unfortunately this is only an ideal case and in a real case the nature of the data itself often does not let you do this so the real case is much closer to something on the right which is why if you change your threshold and if you change the location where you draw the line you get a different false positive rate and you get a different true positive rate like you and you're basically the problem that the optimization problem that you're trying to solve here is whether you are going to bias towards creating more false positives or false negatives and like i mentioned in our in the goals section before uh we

definitely do not want to create a lot of false positives and uh even if it is at the cost of a few false negatives we want the system to be more usable and creating a lot of false positives just makes the whole system useless so that was a lot of information so i'm just gonna pause here for a few seconds uh this is my dog his name is hobbs i named him after my favorite a character in my favorite comic book calvin and hobbes when i got him he was supposed to be five years old and a lab puppy but turns out the rescue made a mistake and he's actually a lab legal mix so

he's actually an adult and looks like a lab puppy now there's a false positive i don't mind

okay um so we described the machine learning model that we trained we described the results that we got um a simple question is this deployable so what we did was we looked at the output of this model when we plugged it into a one percent sample of our one and a half billion command line event stream and this is what happened this plot basically tells us the number of detections by the machine learning model on a daily basis for like the past three weeks you can see that it's creating hundreds of alerts which practically makes it unusable because it's very unlikely that there are actually hundreds of living off the land binary attacks happening so

we had a model with a reasonably good accuracy um that is doing really poorly why is this happening this doesn't seem like expected behavior right well this is known as the false positive paradox or the base rate fallacy and this is commonly seen in a lot of machine learning applications especially in security uh let's say that we have uh let's understand this with an example let's say that we have a near perfect ml model even better than the ml models that i've trained so far like i could read i could conduct research for another like five years and get the best possible model out there and let's say it has a true positive rate of

100 which means that if it ever sees a malicious command it's always going to detect it as malicious and the only downside is that out of every 10 000 command lines that it sees it's going to say that one of them uh 10 000 benign command lines it sees it's going to say that one of them is actually a positive so it's going to make one false positive out of every 10 000 samples this is a really good model right but somehow if you actually compute the math when plugging into our event stream it turns out that this model creates uh so if we plug into an event stream and assume that there are 150 malicious

command lines a day and one and a half billion benign command lines a day which is probably close to the real number uh and it generates 150 true positives it has a 100 true positive rate so it's going to detect every one of those malicious commands but what happens to the false positives since we have one and a half billion benign commands even a model that can create only one false positive for every 10 000 samples it still creates a hundred and fifty thousand false positives and at that point like the machine learning model is really not helping the stock analyst and the stock analyst is just going to drown in alerts and if you do the math that is

one true positive per like thousand and one false positives which is a terrible model accuracy number um so why does this actually happen uh if you notice if you look at the distribution of malicious versus benign that is the main culprit here we have very few malicious commands or a very low base rate of malicious commands uh compared to the number of benign commands so hypothetically if you had a 50 50 distribution if you had 750 million malicious commands and 750 million benign commands and you ran them through the same model that model would produce 750 million malicious detections for 75 000 benign detections which is which is really great it's like 10 000 uh true positives for every false

positive right um so yeah this this is uh the main problem that we face when trying to apply machine learning to this to this use case so how do we work around this problem what is the solution is the system completely useless have we failed our goals well not quite uh there are options as josh and others have mentioned before in their talks deploying ml as a standalone system almost never works especially in security and because we always have this kind of skewed data distribution so what we need are guard rails we need to control when the machine learning model actually comes into play so that it can contribute where it performs best so we don't apply to every single

command line that we see we apply it selectively what guard rails should we use earlier we talked about using analyst defined regex rules to augment our labeling and i alluded to this that we also have suspicious and debatable labels that we use uh which is exactly what we use for guardrails in this case um so just to detail the system a little bit here um if we have no rules matching for a given command line then it doesn't matter what the ml scope is we are just not going to create an alert uh here are some examples where the machine learning model thinks that something is malicious but there are no regex matches so we're

not going to alert and if as you can see it seems like these are not actual malicious commands and we also notice that our machine learning model has a bias towards detecting smaller commands as malicious because it has very little information so it's behaving in an unstable way and then if any of our strong rules matches any of our high confidence roles matches then we don't feel the need to go to the machine learning model uh to tell us whether something is malicious or benign we just directly take the output of the rule and we decide whether to surface an alert or not but the real contribution of the machine learning model comes when there is a

suspicious rule that is triggered so if you remember a suspicious rule is when there is potentially malicious activity but we are not entirely sure because in some context it could be used in a benign way so here we use the model to say that if the model scores highly on a command line that triggers a suspicious rule then we generate an alert and if not then we don't generate an alert we actually create uh something called a silent detection where we want to review it later um and so we won't surface an alert but we'll store it in our telemetry uh for further review and then we have debatable rules um and again we do something similar where

if there is a debatable rule that is triggered then we look at the ml verdict uh if the ml if the machine learning model says that a debutable role is um sorry is is a command line that triggers a debatable role is malicious then we create a silent detection and wait for further review but if the machine learning model says that something is benign and it triggers a debatable rule then we don't surface an alert so that was a lot of information and i want to put it all together and um provide sort of high level context uh just to reiterate just so you understand all the pieces that go into the system um so during machine learning training

we collect data from the event stream uh we clean and sample it and then source labels from three different locations we have indirect labels we have pass data and then we have allow list and block list rules that analysts have created uh using this label data set we then train a machine learning model and then when it comes to prediction we plug both the machine learning model and the analyst created rules into our system and we only create alerts in two separate scenarios one of them is if a block list rule is triggered and the other is if suspicious activity is detected and the machine learning model also detects the same suspicious command line so

we started off saying that we wanted to surface additional alerts into the system while not flooding the stock did we actually meet this call this bar plot shows us the total number of unique alerts that we generated per day when the system is deployed and uh looks like we certainly create a manageable number of alerts now at least compared to the previous system the blue bars here denote the number of alerts that are generated by the block lists and the orange bars denote the number of command alerts that are created by suspicious command lines that are detected by the machine learning model so it looks like on most days the system does not create more than 50 alerts

a manageable number of alerts is well and good but how good are the alerts actually um so it's important to create alerts that are very likely to be malicious because we want to build a sock analyst trust in our detector so we want it to be as precise as possible this plot shows the true positive rate and the false positive rate for the alerts generated by the entire system when plugged into the event stream uh computed on a weekly basis for the past four weeks we see that the true positive rate hovers between 0.9 and 1 which is a pretty good rate of you know detections a pretty good precision combine this with a manageable number of

alerts and we think we have a very deployable system the second question did we actually add value the plot here shows the number of alerts that were seen in the entire system on a weekly basis for the past three weeks uh the teal block basically shows the number of alerts that were created by the existing system the yellow section shows the intersection which is like the existing system also detected these attacks and our system also detect these attacks and the blue section which is the most encouraging is the new alerts that were surfaced by just our system and burnt surface by the existing system and although this number is small right now we have nowhere to go but up and

over time we think that our system should improve significantly and produce more actionable alerts a big part of our system is the involvement of surface and this office ai analyst team who have created dashboards that consistently continuously monitor the performance of this system uh they then use this uh information to like tweak the rules catch false positives and improve the guardrails as a continuous process and again this is uh in line with what josh talked about and what ben talked about it's involving humans in a feedback loop in order to improve the continuously improved performance of our systems so a quick summary of everything that we talked about um in this talk um loyalty attack detection is a hard

problem with several challenges there is a large data volume there is a low base rate of malicious activity and labels are hard to come by we've demonstrated some strategies that we've used to mitigate these problems and work around them another important lesson that we learned here is that good data engineering really pays off dividends uh we basically went fishing through our data lake and got as much data as possible uh in order to label our command lines and increase label coverage um we got yeah we got representative data we had a good labeling strategy and we also created good dashboards um performed principled validation and analysis to ensure continuous improvement of our of our system

um another lesson that we learned here is that the best machine learning models cannot perform well as a standalone system they need guardrails and the way forward basically is to integrate rule-based detection and human analysis into the machine learning system and create continuous improvement and feedback loops and that's all i've got for today thank you any questions [Applause]

all right i'll be around in the room if you have any questions you can come up to me and ask me more [Applause] thank you

[Music]

[Music] [Music]

[Music]

do [Music]

[Music]

[Music] do [Music]

[Music]

[Music] do

[Music]

[Music] so [Music]

[Music]

[Music] do [Music]

[Music]

[Music] do [Music]

[Music] do

[Music]

so [Music]

[Music]

[Music] so [Music]

so [Music]

[Music]

do [Music]

[Music]

[Music] [Music]

[Music]

[Music] do [Music]

[Music]

[Music] do [Music]

[Music]

[Music] do

[Music]

[Music] so [Music] [Music] do [Music]

[Music]

[Music] do

[Music] so [Music]

[Music]

[Music] [Music]

[Music]

[Music] do [Music]

[Music]

[Music] do

[Music] do [Music]

[Music]

[Music] do [Music]

[Music]

we'd like to thank our sponsors especially our diamond sponsors lastpass and palo alto networks and our gold sponsors amazon invisium and intel it is their support along with our other sponsors donors and volunteers that make this event possible these talks are being streamed live except in the underground track and as a courtesy to our speakers and audience we ask that you check to make sure your cell phones are set to silent or vibrate uh questions will be at the end of this talk um if you have questions use the audience mic in the center of the room uh near the projector so youtube can hear you as a reminder the b-sides lv photo policy prohibits taking pictures without

the explicit permission of everyone in frame these talks are all being recorded again except in underground and will be available on youtube in the future please keep your masks on at all times and if you want to come closer to the front or to other people please be respectful of social distancing with that let's get started welcome brittany [Applause] hello and welcome to my talk my name is brittany bach and i'm here to share notes on my project which is repurposing vulnerability tickets to predict severity levels and introduction to natural language processing and classification algorithms

thank you so how i got here um i started my career mostly on the receiving end of information as a tech writer and then transitioned into infosec as a security analyst working on the compliance side of things afterwards i moved into hands-on role as a security engineer where i found my niche as an operations process lead it's this role that allowed me to work with multiple security workflows and continue to ask the question how can we do it better from my experience doing something better or process improvement starts with a lot of talking typically teams get together and discuss how something is supposed to go afterwards there are some testing to work out issues once enough tests have been run and a

working process is in place then commitment in the form of documentation occurs the documentation signals there is an agreed-upon method of carrying out a process the expectations at this point are set eventually the process is completed enough times that consistency of expected input and output are generated and then the process is automated which simply speeds up the rate of completion it's at this stage a lot of data is collected there's always the possibility of going back with new information found and running through these steps again but eventually you start asking the next question which is what do we do with all this data so in the context of this project the data we're talking about is in the form

of vulnerability tickets vulnerability tickets are created to track the remediation of detective vulnerabilities they typically hold a lot of sensitive information about the impact of product or service including associated teams remediation instructions published descriptions and completed resolutions to name a few they are usually maintained in a repository where they can be tracked and recalled as needed they are prioritized based on severity they can provide descriptive statistics where the gathered information can be quantitatively described or summarized vulnerability tickets are generated in natural language

natural language processing is under the artificial intelligence umbrella since there is not a one-to-one match between natural language and computer language you must convert natural language into a format the computer can understand once this com conversion is completed then your data set can be run through various machine learning algorithms that can detect patterns in your data natural language processing is a two-fold process first there's the data pre-processing and then second there's the algorithm development both of which we will explore shortly

so checkpoint one before we go any further we'll just take a moment to pause and reflect on why all this might be important to you as a recap there's a lot of tribal knowledge and accepted analysis within vulnerability tickets by repurposing the information process improvement can continue following automation this process can then lessen human dependency and increase accuracy you're also creating a nice transition from descriptive analytics to predictive analytics predicting severity levels is just a start any information that is consistently tracked within the vulnerability tickets is fair game to be analyzed

project methodology there are three main steps to this project first data preprocessing which is the collection and preparation of data to run through machine learning algorithms the end of this step will result in a final data set for step two second there's algorithm development we're going to cover five common classification algorithms and how they help to detect patterns in the pre-process data set the end of this step will result in observing the accuracy percentage of each algorithm and selecting the best one for the prediction task finally we combine the data set and selected algorithm to complete the to complete the prediction task the end of this step will result in a right or wrong prediction from the test

input

okay step one data pre-processing the first thing you need is data so normally you would already have this internally within whatever ticket repository your company uses in this case because we're using public data the exercise of web scraping came in handy to gather the needed information as an alternative you can try finding a data set in a community repository like kaggle but for me i wanted the experience of building and cleaning my own data set so i want the python route the other steps of pre-processing include removing dupes replacing missing values with nan removing irrelevant words and using a stop word list to help with the removal and adding a custom header and just as a side note a stop board

list is basically a list of non-essential or no impact words like pronouns articles prepositions and conjunctions um examples of these would be like a and the your r and there so it's basically words that offer no value for the prediction task the final step was applying count vectorization to the natural language data set to make it algorithm ready this is basically running a script on the data set to convert the text to numerical form in this case we are counting the frequency of pre-selected words mentioned in each cve vulnerability description all of the above changes were completed using

python so here's a quick side-by-side comparison of the natural language data set next to the count vectorized data set this is considered the completion of step one data pre-processing of natural language processing as mentioned the data set on the left includes the cve id severity level and most importantly the vulnerability description in natural language upon count vectorization the frequented words become the header and the number of times they are mentioned in each vulnerability description are catalogued underneath the severity level is still maintained because it is what the algorithms will use for the classification the data set is now ready to be added to the classification scripts

okay step two is algorithm development so as mentioned before the second step in natural language processing is algorithm development the classification task is figuring out whether the frequency of certain words across multiple vulnerability descriptions are more likely to appear in one severity level over another therefore allowing us to predict severity level for a discovered vulnerability for this project i opted to focus on five common classification algorithms these are going to be helpers in detecting patterns in the data set for example perhaps there is a certain word or cluster of words that are more frequented and say a vulnerability description specifically a critical vulnerability description detecting this pattern would allow us to predict critical vulnerabilities in the

future now there are a lot more than five classification algorithms but when you begin studying machine learning without a doubt these five will be encountered so for the sake of an introductory level talk we will use logistic regression k nearest neighbors support vector machines decision trees and gaussian naive bayes we are going to review these algorithms and then we will make a selection based on results from the cleanse data set

all right so logistic regression is our first algorithm or model that we're going to be looking into so this is considered a probability driven model logistic regression finds patterns and data sets by identifying the probability of an outcome so in this case the severity level and this is based on the features that are present which for us is going to be the list of our most frequented words out of the five this is considered the most straightforward model

the next is k nearest neighbor or knn [Music] this one is a distance based model so when new data is introduced to the data set the nearest data points are identified and these are considered the neighbors using the value of k you can adjust the perimeter of the neighborhood and so although this is set to 3 in the actual script you can actually adjust that that value uh so if you look at the image to the right you can see there are three closest data points selected uh of these three the one with the shortest distance to the new data point will be selected the new data point will then inherit the nearest neighbors class

okay so the second one is support vector machines or svm [Music] so this one is a multi-dimensional model and it's generally really helpful in classifying when you have a really complex data set its goal is to find the maximum margin distance in order to make the classification so in this case it's a two-step process that happens iteratively until the maximum margin is detected so for step one svm generates multiple hyperplanes with which are basically lines that separate the data points into classes in step two the model then picks the most optimal hyperplane which is basically the one with the maximum margin so the maximum margin tells us svm has found the line that makes the clearest distinction

between the two classes the support vectors are simply data points that help the model accomplish classification so in this case that would be the blue circles and the red squares so from personal experience when i was running this this data set for this project through this model this model took the longest to process and so i'm not i'm thinking that that occurred because um based on this definition probably finding like the maximum margin was quite difficult so when i ran it through the skip i think it was almost like two hours in order to complete processing

all right so up next we have decision trees uh this is a flowchart model so an interesting requirement with decision trees is the data set must be labeled and this is known as supervised learning algorithm so basically like you cannot have a data set that doesn't have labeling of features so in our case we have features we're very distinctive of what we're looking for um if you had a data set that did not have that and you were doing more like exploratory data analysis then this might not be the best algorithm for that particular project but in this case it was okay with decision trees so the way it works is it classifies by starting at the top or trunk of a tree

and then it moves its way down with more specific questions to figure out similarities and differences within the data set so in this case you are classifying based on the most finite shared category

all right uh last one gaussian naive bayes so this is another probability model um there are actually three types of naive bayes models but we're going to choose this one because frankly it's the simplest this model is a derivative of the bayes theorem it means the probability of one event is determined based on the probability of a completed event gaussian classifies based on a normal distribution if you look at the image the model is calculating the probability of observing each data point based on the class okay so next we'll talk about algorithm selection

so to figure out which algorithm to select for the prediction task you run the cleanse data set through a script that holds the libraries and classes for the selected algorithms the output provides an accuracy percentage for each of the models so that's the screenshot that's on the right based on the results it's clear that support vector machine returns the highest accuracy percentage on the provided data set however as i mentioned this also takes the longest to process so for the purpose of the demo we're going to go ahead and use the second best which is logistic regression but i will still show you output on how how the support vector machine fare

all right second checkpoint um so as a recap we know natural language processing is a two-step process of data pre-processing and algorithm development we covered each of the classification algorithms that will be used to make an algorithm selection and then next we're going to walk through the prediction expectations and then input test data into a prediction script to observe the results of the prediction task

so this project is derivative of two other projects of similar nature during the initial research i wanted to get an idea of expectations and what the definition of success should be if you look at the table to the right there are important data points for both for both of the example projects as well as the third one which is mine as you can see there are both variances in overlap across all three projects the general methodology of data pre-processing feature extraction again that's the number of word frequencies and algorithm development were also completed the only difference is the number of vulnerabilities and features collected are much higher in project three in thinking of what i was aiming for

with this project i wanted to gain an accuracy percentage that realistically could get presented for user buy-in i knew if i had anything less than 90 then the likelihood that users would feel confident in trying out this tool would not be very high so fortunately after an intensive journey of developing the size and the quality of the data set the goal was reached with the logistic regression and the svm models respectively

next we're going to talk about the impact of data to classification accuracy the relationship is absolute when i first started this project i was gathering cve data including severity levels and vulnerability descriptions manually i was hoping an even distribution across the four severity levels would result in adequate classification accuracy percentages unfortunately the process was cumbersome and unsuccessful with such a small matrix you can see as the data set iterations progress and the number of vulnerabilities and features increase the classification accuracy percentage increases in tandem for example if we follow the first row of this data set the relatively small number of vulnerabilities and features results in less than optimal accuracy percentages across the four algorithms

the other interesting part about this table is how mistakes in cleaning the data set impact the accuracy results this is evident in rows four nine and ten you can see either an errant column or row was left in the data set before running the classification script and that does impact the accuracy results

so here's a visual of the accuracy trend um this is evident with the linear trend line which is in hot pink and you can also see the random peak with data set 4 where the column id was left in it should have been removed but it basically created a false positive where i thought i had hit accuracy gold but in reality i just forgot to remove the id column um so this is one lesson from this a key takeaway is that you know if you see your results being really really good even though the goal is to get as close to one or 100 as possible if it doesn't make sense with the rest of the testing then it's probably a good

idea to check your data set and in this case you know just not removing a single column that had no value completely skewed the results but aside from that i mean you can see that as the vulnerability numbers um through web scraping increased and also as the number of features increased which is the frequency of words as that count increased um you know there was just a steady trend up where the accuracy percentage also increased so data matters okay so a couple notes before we get to the demo um i did record the demo i did it this morning because frankly i don't trust my coordination to do this live but we're going to cover preparing the

data set using a randomizer script and then we're going to cover data preprocessing which is going to be cleaning up of the data in both natural language and post vectorization last we're going to run the cleanse vectorized test input into the prediction script and observe our results okay so movie time

this works great all right so the first tab is just showing the classification accuracy this is the percentage output that was in the screenshot and that's just what it looks like once you run the script

all right so this tab um we're going to create our natural language data set by randomly generating a list of cves these are going to be pre-processed for the algorithm development so this is not a requirement for this project but it's just added for for effect uh just to show that you know if you randomly have a new cve that's published and you know your organization is aware of it like you're obviously not going to have that information right off the bat okay all right so yeah that's data i'll put i did select five for the demo you can do a lot more and i have done a lot more but again since time is

important i went ahead and randomized five uh and then you can see the id severity level and vulnerability description that's all in natural language so next we're going to open up the generated csv

and yeah

okay so that's all the natural language data upfront in the csv so the first thing i'm doing there is removing the header we don't want to include that because then that's also going to get vectorized and it's going to skew our results the other thing that i'm doing is i'm adding a placeholder column so the count vectorization code it will basically stop at that particular like placeholder and then it'll know to like iterate to the next row and so i add just this like arbitrary term and i just added like starter column because that's where i want the iteration to start

okay so the other thing i'm doing is i'm changing the file extension from csv to text

okay next tab so now we're going to run that text file through the count vector z account vectorizer script um so this script creates like this useless first row of all zeros and um you do have to like remove that and so that's just what i'm showing here is that it works but for some reason it creates like a first row of zeros that has to get removed the other thing it's going to create is like the column of ids identifiers so that's another thing that gets removed as a part of the data processing

all right so that's what it looks like when it's vectorized you can see all of the features at the top and it's like 592. and then you can see how the occurrence of each of these features is actually listed and again we're removing the first row so that's all the zeros we're going to move that that's not valuable

okay so as a side note you can actually just do all of this through code but i am showing the manual side of this for transparency this is pretty much the stuff that has to happen before you can run this through your prediction script

okay yeah and i think that is just removing the identifier and then there's just one small formatting that's left before copying and pasting into the script

okay so that's the vectorization for all five cves we're going to copy that and then we're going to go to the script and we're going to paste it in uh and again we're doing the stem on logistic regression which if you can recall it was roughly 90 accuracy so we are looking for pretty good results

okay so now what we're going to do is we're going to run the script and we are going to be able to see um the actual prediction that's made for all five cves and we're going to also see how it compares with the natural language output

okay so that is the prediction that's made on all five cves it's red left to right which is going to be if you're looking back at the um the table it's going to be top down so critical i think it's yup medium low critical critical and okay so it looks like there's a one-to-one match on all of those as a side note i have done this with like a lot more test data and it's not always a hundred percent um but it's pretty close i mean even when i ran it with like you know like 12 different test inputs i still got nine out of 12 using logistic regression um and that was about the same with svm as well it

just takes so long to process but um but yeah it was pretty spot-on with the accuracy percentage all right

so this is how the other models fared um the only one that came back with a little bit of a hiccup was the gaussian naive bayes but that was actually pretty expected and still getting four out of five is pretty good because i've run it again with a lot more test data and there have been times when i only got like three right out of 12. but if you look at the accuracy percentage it's relatively low so it's not the best model to use for this project but the other models you know they're they're relatively close in accuracy percentage and so you know the match was pretty spot-on and that was also the case when i ran

with a lot more test data so that's pretty that distribution is it's pretty consistent

okay so a few key takeaways before we wrap things up um first you have to keep processing time in mind for when the scripts run i found as my data set grew some

BSides LV 2022 - Wednesday - Ground Truth

Related talks