← All talks

Keeping CTI on Track: An Easier Way to Map to MITRE ATT&CK

BSides DC · 201941:19368 viewsPublished 2019-10Watch on YouTube ↗
Speakers
Tags
About this talk
Organizations across the globe are looking for ways to use MITRE ATT&CK TM in their environment, but are unsure how or where to start. Our team is constantly receiving requests to teach their analysts how to relate reports to ATT&CK; it's a time-consuming process with a steep learning curve. We get it, reading reports line-by-line to search for adversarial tactics, techniques, and procedures (TTPs) and then verifying if those behaviors align to documented TTPs in ATT&CK is challenging. As ATT&CK team members who do this daily, we thought that there has to be a better way. Using our ATT&CK experience, Python, Natural Language Processing (NLP), and some WebDev, we created the Threat Report ATT&CK Mapping (TRAM) tool to automatically extract TTPs from a prose report. So why should you care? Not only will this tool help the ATT&CK team keep our public repository of Groups and Software updated with the latest cyber trends and attacker methodologies, but it can help your internal organizations too. TRAM can analyze multiple reports, map them to ATT&CK, and provide insight to your overall threat landscape and security posture. This enables defenders to test whether current tools and procedures are effectively defending against their most common threats, while also allowing red teams to develop prioritized adversary emulation plans with these TTPs in mind. This talk will review our methodology, discuss our challenges, and demonstrate how you can use the tool for your own organization. With the open-source release of this tool, anyone can go from an ATT&CK zero to hero in no time! Jackie Lasky (Cyber Security Engineer at The MITRE Corporation) Jackie Lasky is a Cyber Security Engineer at MITRE. She is a member of MITRE's ATT&CK team where she focuses on cyber threat intelligence. Prior to joining MITRE, she interned with the Department of Defense where she gained experience with malware analysis, data analytics, and machine learning. Jackie holds a B.S. in Computer Science from George Mason University.
Show transcript [en]

teaming little random fun facts the soft summer I did a triathlon there like chai tea and the lost file let you look that one up and I'm Jackie Lasky I'm also a cyber security engineer here at miters so I do a lot of the same things as Sarah so working on CTI for our attack team and then also some threat hunting as well in my spare time I love to travel I've been to 20 countries I love to do drone photography things like that and if I'm home I love to play with my dog so today we're going to be talking about what is a tack so if you haven't heard of it no problem we'll catch up

and then we're gonna show that mapping to attack can be hard and time-consuming but we hope that we have a solution which we call tram and we hope that we'll end with how this can help you so I like to start with David Bianco's pyramid of pain I think it gives a good description as we start to get into attack I'm sure many of you who've seen this many times but so we'll go ahead and start that things like hash values are easy for a tough for adversaries to change right they're also easy for defenders to detect but the problem becomes that if you're a defender looking only for the hash values in your adversary changes one bit in their

malware file will that hash file value is no longer valid as we move up the pyramid same concept for IP addresses domain names these are pretty easy for attackers to change but if we look at the tippy-top in the red here the TTP's or tactics techniques and procedures these things are at the behavior level so they're much harder for adversaries to change because like humans right we get stuck in our ways we have habits that we all have so this is where attack falls we hope that we can frame kind of these TTP's with the hopes that they can be detected in your environment so what is attack in simple terms it's a knowledge base of

adversary behaviors I almost like to think of it kind of as a encyclopedia of things that we've seen adversaries do it's a great place to it's based on real-world observations everything that we have an attack links back to an open source report so that you can go read more about that maybe technique or software yourself as well as just like confirm that we're on the right track right it is free open and globally accessible this is important this kind of falls with miters overall vision of being in the public interest we hope that you know students and other organizations can use attack as a common language depending on your purpose so please feel free to check it out at

attack don't might or not work as driven by the community so as I try to mention we don't have all the answers we solely rely on public reporting and so a big part of that is people like maybe you in the audience who writes reports that we can use for attacks so a lot of the information there is because somebody mentioned hey you don't have this technique or hey I wrote this report that I think would be added value to the community so if you're interested in commute in contributing please reach out so we'll go ahead and break down the matrix so this is how it's commonly represented we have the tactics across the the top there the column headers and

these are the adversaries technical goals so their goal is to initially access their victims environment from there maybe another goal is to laterally move or you know collect data from that system within under each column are the techniques and this is how the goals are achieved so for example if their goal is initial access the technique and how they might achieve that might be spearfishing within each cell we have you more information this is the procedures and here we have specific technique implementation so as you can see a couple examples for spear fishing would be that you know an apt 12 sent some PDF attachments whereas a PT 19 with sin like Excel based documents right so an

added level of detail there so this is an example of one of our techniques on for each technique on the website is a description there's kind of two main parts here one is a just an overall of what the technique is and the next is how adversaries can manipulate that for their advantage cuz a lot of times adversaries are using normal machine methods but they're abusing it in some way next we have what platforms this runs on and recent update is that we now include a lot of cloud platforms as well next we have data sources for every technique and these are a high level starting place but it's a good place to start if

you're trying to detect these techniques there are some data sources where maybe you would consider looking for logs for a given technique additionally we have mitigations and detection so kind of like data sources these are high level because they're going to be very dependent on your own organization and what tools you have in place but we hope that it's a good start of ways that you can not only mitigate a given technique but also then detect it after maybe it's been done as I showed in the matrix we also have the procedure examples these are the most granular bit of information for any technique which maps to what software and or groups use that technique we also have references again

all of this is open source reporting we want you to go and check it out yourselves as well so very similar we have group pages as well again you'll get a short description of things that we know about that group as well as associated groups we realized that these groups are not always a one-for-one because they often come from different data sources and different views of what the group is doing but they seem to overlap enough that it makes sense to group them together we have what techniques they used similar to the technique page so that you can see exactly how that group uses a given technique we also include software that a given group has used and

again references so please check it out yourself so how does this information get into attack so this is a big part of what jackie and i do for the attack team is we find reliable open source threat reporting there's many different sources that this comes from but from there we then find the behaviors in the report so thinking back to the attack structure right the kind of the why how and what the adversary is doing is generally the the part of the report that we're looking for so how does this look so here I have a blob of text from a report that I'm reading and I like to initially just kind of read through it and anytime

I really see like a verb that's really a good indicator of the adversary doing something right if it says like that apt whatever is doing something I like to just kind of start highlight that and I'll map it to attack next crime getting started with mapping to attack sometimes it's easy to start with the tactic level right these are the high level adversary goals because it gives you a narrower scope to look for right so starting with the tactic I'm gonna go through a map all of these so starting with that first one I'll just focus on the first bullet so the Trojan office gates executable code as I'm reading that looking across the different tactics like I can pretty

easily gauge that it's not initial access that doesn't sound right they're not collecting anything and as I move along when I get to defensive eight it's like yeah okay it sounds like defense evasion so from there I can move on to the technique level this is can be really challenging because there are a lot of techniques but since we've already filmed the tactic we've narrowed the scope of our options right you can't like you only have to now look in a given column so as I kind of read that sentence again off you skate okay it's probably gonna be Aki skated files or information not ones pretty straightforward but a lot of these can be pretty tricky so a few tips

that I use when mapping to attack is using the search bar on the website for you know semi unique words for a given sentence so like in this example maybe I would have searched for off you skate and just see what techniques show up also ctrl F is super helpful right just via notice so um but so now that we have a list of techniques that we've mapped to attack from this report what can you do with that information so we have four main use cases first is detection and this is probably one of our most common use cases is people using attack to build out analytics for their environment to search for this type of behavior as

you're kind of doing that you can build out of assessment and engineering plans you can see where maybe the gaps are in your organization are you detecting what you said you maybe are and from there that can help gauge if you need new tools or new configuration in your environment threat intelligence can help you track what a PT's or techniques you care about you know you can see if groups are evolving over time and adding more techniques to do their tool bag lastly is adversary emulation this is a specific type of red teaming where and at like a red team would use a given adversary may be apt three or twenty nine to test that against your

environment using those selected techniques so attack has a lot of great use cases but there's clearly a problem the process for getting our data into the attack site is done completely manually we often have a huge backlog of unanalyzed reports that we have to kind of prioritize some over others and also now that since we have humans analyzing it we have human error so this leads to potentially inaccurate information so two analysts might have two different interpretations of the same text there's also things like about availability bias which basically means we're more likely think of the attack techniques that we know or that we've seen before which could cause us to overlook ones that were less familiar with the process for

training new team members is also really difficult too so the process like you and I seen from Sarah can get kind of complex it takes time to learn all the different adversaries that we track the malware that we have and the techniques that we have we're constantly adding more as well and changing things as we go so that's our solution the threat report attack mapper we call it tram for short comes in so this is basically just going to be an easier way to be able to help people map their threat reports to the attack framework so this is just a glimpse into the process kind of like a Sarah showed you before but the process

for mapping reports to attack so we usually start with our backlog list reports and we kind of make our way from the high-priority reports to the medium priority reports and then to the low ones and we either get assigned one or we choose one ourselves and then after we have the report we want to look at we kind of go through it like you saw Sarah do and you highlight along the way finding the attack techniques that you know or that you don't know just using control F things like that and then afterwards you copy and paste whatever you've typed out into the attack site so it's all pretty manual and you can probably imagine it's a really

labor-intensive process so sometimes it could take anywhere from two to six hours to go through a port like a report might be maybe one or two pages long or it might be 30 or 40 pages long depending on the vendor and the group so yeah so it could be a long time so life with trim how it works is basically an analyst submits a URL to a publicly available threat report and then trim does some magic behind the scenes which we'll get into in a few slides to work through and find these techniques for you so it's saving you that time of having to go a manually search for these in the report and having to read it

really thoroughly as well so basically what it does is at the end it'll have a printout for the analyst of the techniques that were found in the report and you can either decide whether you approve them or not so you can go through and either accept or reject which will show you when we do our demo of how it works and basically this is helping to make a more streamlined and efficient process for adding data into the site it'll also help people that are outside of attack because they'll be able to analyze reports and map them without having a diverse huge knowledge of the attack framework so before we show you the tool we kind of wanted to

give you like an in-depth show you an in-depth approach of how we really like started this project from start to finish we always like to preface with the fact that we're two CTI tech analyst we do have backgrounds in Python development computer science but we aren't we don't have PhDs in AI or machine learning so this is definitely like there might be problems with our approach maybe not but it's definitely a catalyst in the right direction and we're really excited to share it with the community and see what feedback we get and how we can improve it so the very first step of our process was to get data so if we do this using the

sticks and taxi server so we have sort of an automated fashion of being ated way of being able to grab the data and use it for our project so what we focus on is for each attack technique there's an example section so basically we take down these examples and we feed them to our tool so these are all going to be what we use for our project we're not using the description or other parts of the attack technique but just the procedure examples and since we do this using the sticks and taxi server we always have an automated way I've been able to get the most up-to-date version of attacks so whenever there's updates we can just easily grab that which is

great so after we've gotten our data the next step is the pre-processing stage so this is where we kind of clean up and prepare our data for our tool so a lot of the text in the world is some unstructured raw text in a human language and we have to be able to make it understandable by computer so basically that's all this step is is doing like our data normalization so doing things like removing stop words from text so words like and or the or it those all take away and here are really heavyweight in a sentence so we just focus on getting rid of that stuff so that our tools can learn from the actual

verbs and words that are in the sentences that we want to teach and then we also get rid of capitalization as well so getting rid of any extra like use at less useful parts of the text that we don't need and then in this step we also do our natural language processing so we use pythons natural language toolkit which is really great so reducing words to their root using like stemming and limit ization and then taking our words and splitting up them or taking up our sentences and splitting them up into different words that we can do more analysis with them as well so after we've cleaned and prepared our data the next step that we did was to build and

train our models so for this project we used pythons logistic regression which is a supervised learning classification method so basically how it works is you have to feed it data to learn with you can also use things like tf-idf and count vectorizer to extract your features this is all pretty standard out of the box from pythons libraries and then we also do some tuning of our models of cross-validation so how it works is each technique has to have their own model built and for this we teach each model using their examples section as the positive class and then the negative class which is all the false label data is just examples that don't relate to that technique so for

this we might use things like true negatives which are sentences that are from a report but they don't relate to any technique at all so we can use those to be able to further differentiate between other techniques as well and then we also use any false positives as well so if our tool goes through and it says like one technique is one thing and then we agree that it's not we'll feed this back into our negative class to help it learn better but depending on the feature extraction method that you use the time may vary to actually build these models out so we actually use a serialized format in Python called a pickle file so that we can easily

rebuild these models or not have to rebuild these models upon every single analysis of our report so the next step after we've built and trained our models is to test it but before we could test it we had to collect a good data set for testing so our team currently has like a feedly CTI RSS feed that we use to sort of look through and search for new reports that are out in the wild so feedly has a really cool feature where you can create a board and sort of save all these reports that you know are gonna have good data to them and then you can use their Python API to bring those down

extract the text in JSON and feed it to your model so this worked really well for us because we already knew that these reports we're gonna have lots of attack techniques in them already so so after we have that we can start the testing phase this is kind of where we're just extracting out the sentences from the reports and then feeding it to our models so basically doing a lot of the same and I'll pee stuff that we did before between on the new text and then transforming our test dataset with the models that we created from the previous step so we have to run our sentences from or the sentences from the reports through each model because a sentence

can have more than one attack technique so basically how it worked before is we gave it training data so it was learning this sentence equals obviously two files or information but this time we're giving it no data at all except for the just the unlabeled sentence from a report and then it makes these predictions on its own so we in this step we also do a lot of examining our model performance so looking at things like cross-validation scores for different methods that we're using and then also doing like average and standard deviation so things like that that's all gonna happen in this stage so after the report has gone through the model we can go through and get a

printout of the techniques that were found in the report which we're gonna show you a little bit so basically how it works is you go through all the labeled data there'll be a sentence and then will be a technique next to it if it was found and you can either accept that technique that was found or you can reject it so if you accept it it'll get tracked as a true positive obviously if you reject it it'll be a false positive so this is really useful to be able to go through and a lot quicker go through a report and be able to map it to the attack framework so after you kind of repeat that process for every technique

that was found you can go through and take care of the rest of the unlabeled data so this is gonna be any technique set you see that may have gotten missed or something like that so if you want to add that in we have an easy way to do that as well and those get tracked as a false negative and then everything else from the report that doesn't have a label or it doesn't have a sentence or technique for it and that just gets labeled as a true negative and we can use those to rebuild our negative class for our models again and the very last step in our process was to create a feedback loop to improve our model so we

used of a database to kind of track the different true positives false negatives all that good stuff we had different tables to kind of help keep our annotations all organized so that we could reuse them and make our model better each time so when you're done with a report you can export it in PDF and then if you want to rebuild the models so if you have a lot of new annotations that you want to add you can go ahead and create a new pickle file and run that through with your new reports so there were a lot of challenges along the way that we had one of the most obvious challenges is that extracting meaning from text is still

really hard to do humans in our time and time again better still better at getting meaning from a text than a computer is and was just starting to bridge the gap between teaching our classifiers how to learn and interpret meaning from text like a like a human does there's also things like prediction error so we this can come from having an imbalance dataset or incomplete data sample so I'm having noise in your data or anomalies or outliers so if we had like an example for a technique that didn't really make sense it was kind of like an outlier we had to sort of take that on a case-by-case basis and figure out if that was really useful way of

finding that technique out in the wild but almost always a problem comes back to not having enough data to train with so this leads to things like in balanced datasets where you might have like a really small positive class that you're training with in a really large negative class which is definitely our case because not every attack technique has a lot of examples which also brings me to my next point was that some techniques have no examples at all so if they're a new technique you might have zero or you might have one or two examples in our days so for that we can't really build a model with so we had to have some sort

of a back-up plan for those techniques so for that we use just regular expressions and string matching to find that technique until we have enough data to train with in the future great so that was a pretty in-depth view of like what's happening under the hood but now we want to show it to you so everything is run locally on the user system what and so once you load it on up all right so let's say I'm on the internet looking around find this fin 7 rapport I'm like ok this seems interesting I think they're telling me some techniques in here I can go ahead and copy the URL go into tram hit enter new report and

just paste that in there I'm also gonna grab the title so that as I have lots of reports I know where this came from then I'll go ahead and and submit so you can see kind of up in the tab old circle running it's doing all the NLP stuff that Jackie just described in the background so this is a report I actually did manually many months ago while like shortly after it came out so I just wanted to go through and since already did it right let's see if tram can find some of the techniques I found so I see some spearfishing some scripting file deletion schedule tasks so this just gives us a preview of what

I'm hoping tram will find when it's done analyzing the report see something yeah off you skated files and PowerShell cool so I think that's it so go back to tram and it looks like it's done so from here now it's created this new card with the title and I can if I hit view source it'll just take me back to the original with URL but we're here to analyze so as you can see it puts it kind of into this format I'm gonna zoom in and start scrolling all right I see a highlight cool so it found something all right so now it's giving me two techniques spearfishing link and attachment so now I can go back reading

the sentence again it's talking about malicious attachments so I rejected the spearfishing link and go ahead and accept attachment now you can see that it's down in the confirmed techniques only accepted techniques will go down here like Jackie mentioned this is important in differentiating when techniques are right or wrong so that we can give the model that information so that hopefully next time we run this report it won't tag that a spearfishing link but and like we thought like a very similar example for the computer to differentiate so we're working on getting more data in there so it can make those distinctions but it's a start so I'll go ahead and look for other techniques in this report

demos are fun cool all right so I'll go through you know just a few examples here and there but essentially at this point the analyst it is still a manual process to go in and accept or reject the techniques but we hope it's a much better starting place right the that the model has at least given you some places that you can start and go confirm if these are right or wrong you're self and over time we're hoping that there will be plenty more rights than wrongs so just accept a handful here but so this is an example of D off you escape but to show you what happens if I reject it you can see that it actually removed the

highlighting from the report to kind of just remove that visual indicator that there's no technique there if I had accidentally done that like oh no I need to add it back we do have the add missing technique button where users can go ahead and add in additional techniques that maybe the model didn't catch it all or yeah human error if you need to add it back in no problem so once I go through the rest of the report do another example of the reject just to show that that highlighting is removed

all right great so once I have a I would have done that more thoroughly as an action on this right but go ahead and export the PDF so give that a second we'll open it up so right now the report itself is just the raw sentences there's no visual indicator in the body of the report but my favorite feature is if we scroll down we get a table that will have the techniques and what sentences those techniques came from and this will be based on the confirmed techniques in the future but so as we can see the right spearfishing attachment there's that sentence where it was the malicious attachment scheduled tasks same thing I can then go and look at that sentence

and see that there was two daily scheduled tasks entries there all right so that is it for the demo so we realized so this is a work in progress right and like Jackie mentioned we don't have a lot of experience with web dev or NLP and all of that so we're really happy with the progress we've made both there's definitely a lot of work that needs to get done so we're working on cleaning up the code there's definitely some little bugs in there we're hoping to make it more user friendly we realized that the report right now due to the way that NLP tokenization happens it strips out all the HTML and formatting stuff right to clean the data

but we hope in the future to maintain how the original report looked have more of that format like the paragraph formatting the headers images that kind of thing and as I showed you with those accept-reject buttons the more times we do that we're hoping to increase the accuracy of our models so that'll be an ongoing process and we're thinking about things in the future too maybe pulling in multiple reports at once maybe we can feed it an RSS feed or give it a directory of reports that maybe you have locally system so maybe you're sitting there you're like all right that's cool but why does this matter to me well we hope that this makes it easier to get started

with attack our team has had several organizations and teams asked you know hey do you guys have training to teach our analysts how to map to attack and although we don't have a formal training we're hoping that maybe a tool like this will make it easier to get started if this seems like an overwhelming task additionally we hope that it finds techniques that we forget about or like we haven't really even heard of we recently added a bunch of new techniques and I'm not sure that I could tell you what they are but good thing computers that's not a problem so we hope that it will catch the ones that we forget about lastly we hope that you can use

reporting that's important to you so maybe you're in the finance industry and so you don't care as much about health care reporting or whatever it may be you can focus on reports that are important to you in your organization to get more TTP's related to you just kind of a refresh of some things that you can do with attack these go back to those four main use cases of detections writing out analytics making security assessments and like for engineering decisions do red team exercises based on maybe you know the techniques that reports that you found interesting pat in them and so on so some takeaways so we hope that today you understand kind of what the

adversaries TTP's are and that attack can help frame these behaviors you can then use this information in a variety of ways I think we touched that mapping data is a - attack is hard and time-consuming but we hope that tram makes that easier and best of all we do plan to open-source this we're aiming for the end of the year so that the community can not only use it hopefully help us make it better and help the rest of the community as well so with that I think we have plenty of time for questions but thank you that is it

and we do have stickers as well I saw a hand over theirs yeah yeah we were looking at that we haven't made like a PDF extract or anything yet like we've only looked at just text but doing OCR for getting images out of reports or PDF something great - yep no those are definitely on our to-do list yeah yep like you measure with the bias one thing that we were thinking is tracking which analyst accepted or denied different techniques so that then maybe the rest of the team could review that and give their own inputs but that's on the long to-do list that we have yes

that is not something we've thought about but that's a great suggestion so yes something to think about

we don't have the exact rate because every technique is gonna have a different model so they all have their own logistic regression model so we're using different things like the built in scoring with logistic regression and cross-validation scores kind of like two new things along the way there's obviously some techniques to have a lot more false positives than others but those are the ones we can work with on a case-by-case basis to try to tune and change things up like the different folds that we're using and the threshold things like that so great

you know uh no currently I guess we rely a lot on the vendors to provide accurate data but I would say I put back to the references and maybe ask then that right but we do take their word for it at this point yeah we have a fewer vendors that we trust over others maybe more so but which obviously has some bias too but

you can recognize all right that that sack out here you could see that this these techniques were used the same techniques were used like chaining together that's sure didn't really so just for the folks in the back the suggestion was having a way to make some sort of signature on a given report with those techniques so that they could be tracked over time yes

that's a good point like yeah but no so we currently aren't using the description of the techniques for for the models but we are using how that technique is described in a sentence so I would say that based on that I don't know the reviewing how the techniques are written would play a huge part into our model for this use case yes yeah I mean we have different unique identifiers for reports that are released so we'll use that kind of - I guess differentiate between ones that have been analyzed and ones that haven't but if there's another report that talks about the same adversary doing the same thing we still use that data because it's described in a different way so for

us that's good because then we can add it back to our models and you have more data to work with so just having a unique reports fine but we always want more data so yeah and just to add on to them something that we have thought about right now the other tools run locally but if it was maybe ran in an environment that could be shared among multiple analysts there could maybe be a way to tag that Sara's working on this report Jackie's working on this other one so that the same analyst isn't doing the same report twice so again it's on the to-do list one day it seems small and a fruit out

like what do you um well a lot of it a lot of the data that we're using it's already in the attacks site so if it's made it to that that far then it's usually a pretty good source already yeah no I mean I'll say yeah actually as I was putting the demo together I was like oh I want to find a report that has a lot of techniques right to show everybody and some of the call go to some common vendor sites that I thought would have good rapport and yeah whether it was our model or their reporting like only a couple showed up so I would say that that's not necessarily a bad thing

though just depending on the type of report so maybe I was selecting a report that wasn't talking about software or something in a way that wouldn't make sense for tuck but I ran it tonight yeah I'm making the demo video go here and then there yeah so if we were looking at individually at like some techniques that may have done kind of a bad job I think one of them was like security software discovery that was a really hard technique because a lot of in reports they talk about different antivirus like tools or whatever which can go which is what that technique sounds like from the examples so it can easily be kind of like it can get thrown

off pretty easily so for those we kind of had to look back and make sure like the data we were feeding to the models was corrected in the right format didn't have any extra like characters or text that was kind of like throwing it off or outliers things like that

it's kind of just based on sort of what we know was like a big group right now or a big Mauer like a big campaign that's happening things like that it's not really like a certain methodology yeah it's not unlike yeah we're like oh this one's a new one on apt 41 it's a huge report like what's those prioritize our efforts on that over something smaller

that's one of our hopes with this project yeah absolutely we hope that once it's open-source and like especially those tables at the bottom where it pulls out right the technique and the sentence that came from that would be a great way to say hey attack I did a report that's not there yet like here's the table here's the original source like can you add that data in and we have people from the community that regularly reach out to us with reports that they find and they map themselves so this would just make it easier for them to be able to share that with us

right not at this time the idea we have had before but just with the current workflow for attack I don't know that that's gonna be a great option think one in the back

not at this time because the models is based on the database which is which includes all the data from the reports but we do recognize since it will be open source if it made sense right for you to keep the tool internal for your own use case then we hope that there's a benefit there as well yeah so you could use it for your close source reporting but if that reporting is obviously not unclassified then we wouldn't be having access to it anyways all right

yeah yeah definitely thank you all right well without thank you all feel free

feel free to get a sticker or ask us questions ask additional questions but thank you for your time thanks people had questions