
foreign labs
demo for about 15 minutes okay is this better ah okay okay I'm Giacomo Bergamo I'm with invincia Labs um I wanted to demo for maybe 10-15 minutes our synomics malware triage tool and uh invite you guys at the end to also register to use it if it's of interest and then I'll also answer any questions so synomics is the result of about uh well it's an offshoot of about four years of DARPA research myself Josh sacks Alex long Siegfried gold and Robert gov were involved in this project and the purpose of the Cyber Genome Project was to come up with technologies that would help analysts deal with malware at scale because you're being hit with hundreds
thousands millions of malware samples and you need some way of reducing the workload right now the standard way of dealing with this stuff is reverse engineering one file at a time and that's not really scalable so we developed several Technologies to do some first pass automatic analysis of the files and also cluster similar samples together to help out the analyst and when we built several several visualizations and workflows that could be used to reduce your workload so here's the landing page for synomics what you see is what we call our ciscape or cyber landscape and all it is is imagine a 2d plane divided into 256 by 256 bins and the spikes represent the count of
malware samples in each bin and when our system processes malware samples um what will happen is that malware samples that are very similar will end up in the same bin or nearby bins this way even if you're looking at a data set of thousands hundreds of thousands millions of malware samples you will still be able to observe Trends so let's say you know you have a spike emerging somewhere maybe you should go and look at what is happening there what are those malware samples or if you filter the data you could look at the landscape for two different intrusion sets let's say and at a very high level determine whether they're similar or different um
so in addition to just being some you know pretty 3D thing it actually has some use so like I could actually click on one of these bins and get some summary information about the malware that falls in that bin for example the top capabilities and then I could navigate to some other section of our of our tool to look at those samples more in depth so um let me just click on this and it takes you to our faceted search View and just to give you an example of you know how these bins work what we see here and I'll come back to this a little later in more detail but what we see here is that most of these samples have
this compresses or decompresses files and uses gzip capability that we automatically detected and that's one of the primary things that made these malware samples fall into that same bin in addition to the capabilities the file type and the rare strings that we extracted from the malware samples all go into the calculation of what bin it should fall into so you know this is just a fun thing to play with but other entry points into the app um on this landing page or this search box where if you already have a file that you're interested in you can enter the hash and go to what we call our home page for that file which gives you all the analysis that we
performed on it if you're logged in you can upload new files and labeled them labeled a group with some label or um if you've been uploading files you can view reports on the analysis that was performed so I uploaded a couple of groups this is the report for one of them for each file it gives me the capabilities that were detected and then it gives me shortcut links to other parts of the of the system like the home page for that file if there were any similar samples it'll take me to the comparison view for that sample and then the exploratory reviews like the faceted search and the Signet which I'll get into so this is just a way of keeping track
of what you've uploaded and looking back at reports about them so let me go ahead and go into the faceted search mode of using our site so this is like a faceted search like you might see on amazon.com but instead of searching for televisions 1080p Sony you're going to be filtering down the data based on things pertinent to malware so maybe you'll want to filter down by the group labels that you gave when you uploaded stuff maybe you want to perform full text search and this will search everything from file name to names of resources and dlls inside the binary to any string features that we extracted when we broke apart the binary interesting tokens
and then filters so we have our automatically detected capabilities we have things like strings IP addresses host names so and then tags that you or other people may be added to files so you can use any combination of these to filter down to malware samples that are interest of interest to you um so if I filter down to a particular group that I uploaded so I uploaded some samples from putter Panda which was a Chinese set of malware you can start to see the the power that we have even in this simple search View so every row here represents a malware sample and then you have the name of the malware sample if there were any images
extracted you'd have a little icon and then you have what we call our capability fingerprint and what this is is every color represents a particular capability that we've trained our system to detect we have right now and then the width of the bar represents the confidence that we have that capability exists and if you hover over these it will tell you what that color actually means and other samples in the same view that share that capability will light up now this is useful for not only getting a first pass idea of what these things are doing but you can clearly start to tell what samples are similar so we sort of expected these samples to be similar
because we're all part of the same intrusion set and in fact that seems to be corroborated but we can also see that there are two that are slightly different instead of modifying Windows Services they engage the registry in a more specific way so what this lets me do as an analyst so instead of having to individually reverse engineer each one of these and run each one of these in a sandbox to try and determine what's going on I can already see that they're pretty similar so maybe I could select a couple of these and label them in some way so that I would know that I can look at for example pick one representative sample
from this group to reverse engineer or run in the sandbox and then apply that analysis to the rest now this is very high level but even in this view I can get more detailed information so if I hover over one of these rows and maybe I'm interested in exactly why the system believes these capabilities exist I can click on capabilities I can open up one of these detected capabilities so I opened up the one for reads rights from Windows console and I can actually see the tokens that we extracted from the binary that our system that makes our system believe that capability exists so and the way our machine learning system works is we took
millions of stack Overflow posts and we looked for string tokens within those posts that occurred frequently in posts about a particular capability X and infrequently in all the other posts so that if we saw that string token again it would be likely in a binary it would be likely that that capability exists in the binary especially if we saw multiple tokens related to that capability so here we're seeing the string tokens that we extracted and we're seeing the probability that the capability exists according to our system given that token and then we're seeing uh representative stack Overflow post where we saw this um so in some cases this is this capability is reads rights from Windows
console some of these strings are very human readable like write console W um but maybe there are others that are more rare or strange or that you haven't seen before like get ACP or Peak name pipe or who knows what and so maybe what you want to do is actually go and look at the stack Overflow post and try to understand in context like why why is that string related to this capability and do I believe the system or not so I can actually go to the stack Overflow post and read and you know read the whole thread and try to understand what's going on here so this is very useful for an analyst in general but especially a
junior analyst that might not have Decades of stored knowledge about about this um you know about dlls and function calls and and parameters that are used for different capabilities and what this also means we trained our system on Windows binaries um but we could very easily train it to look for capabilities in Android or iOS or OS X um and it would continue to work and the system would continue to evolve even as programming languages and platforms evolved um so that's our basic search view but there are other ways of looking at samples too so I can switch over to what we call our Signet or network visualization and here what we see is when you arrive
here you can select between groups that you've uploaded any tag groups that you or other people have tagged and you can also enter in particular hashes of files that you're interested in so let me pick a couple of groups of samples that have been tagged and drop them in and what we see happen is every little gray node represents a malware sample or software sample in general um and then the larger nodes represent labels so I dropped in three groups so that's why there are three labeled groups up here and then sure how well you can see this on the projector but similar samples will become connected via edges and cluster together so if I was just as a
malware analyst given a bucket of stuff to look at or maybe over several days given buckets of malware samples to look at I can drop them in here and see which malware samples are similar and can be looked at together or you know I can pick one representative sample to reverse engineer or run in a sandbox so I if and if I click on one of them at the bottom here I see rows like in the faceted search representing that one malware sample and its neighboring samples in addition so I dropped in a couple of groups I saw that you know there are some clusters of similar malware samples and our system has a large database of
right now about 35 000 other samples that we've uploaded I can if I click on one of these samples I can see that there are other similar Neighbors in the system as a whole so I can tell it to go and fetch those similar samples and drop them in here so even though it wasn't part of you know the labeled intrusion group that I was looking at I can still go and find you know are there malware samples that maybe someone else looked at and did some analysis on and I can reuse that analysis so now these are dropped in here and I can do the same sort of tagging and labeling that I could do in the
search View and in addition there are a couple of other things that I can do on the right hand side here I have a panel that is showing me attributes of related related to all the samples that are in view here and I can act and the counts of the number of malware samples that have each attribute and so maybe when I hover over these and when I hover over these they light up so maybe for some reason the fact that they share a hostname or share an IP address or share some other attribute is of interest and makes me think I should look at those files together so I can actually drop in additional label nodes
let's see where that dropped in and this will stick around as I do my exploratory analysis and so maybe I would want to go and look at these two files together um okay so so far we've gone from like a really high level like landscape view of the data to searching through the data with attributes to looking at it in in this network visualization and I'm able to glean some idea of what these files do and which ones are similar or not but maybe I need to look um at a very low level at the at some files that I think are similar and understand exactly how they are similar or differ so what I can do here
is I can jump into a comparison View and what this is is this uh so what we see here is on the left hand side of these little blocks this represents the malware sample that I just clicked on and then going from the left to the right are the neighboring samples that were clustered together in that Network visualization going from most similar to least similar to that first one that I clicked on and what what I've done here is I've broken apart the data that we extracted and the analysis that we've done into different data dimensions capability strings dlls function calls images tags IP addresses host names registry keys Etc and what I have on the most left hand side are these
similarity histograms that show me how many of those neighboring samples are a hundred percent similar in that particular data Dimension to zero percent similar so I have for example of these 17 samples there are 11 of them that are 100 similar in capabilities there are only two that are 100 similar in the extracted string tokens so I can actually then narrow down based on this similarity histogram which ones I'm interested in comparing so maybe I'm only interested in looking at files that are 75 similar in capabilities or more so I've narrowed it down and then what I can do is I can actually open up these sections
um and actually look at the low level features so here I can clearly see that our automatically detected capabilities um how many of like 10 of these are absolutely identical and then one has an additional use of silver silver light capability when it comes to the strings you can see I have 137 pages of strings that I could go through here and I can sort this in different ways so by clicking on each of these little um Venn diagrams these blocky Venn diagrams that represent each malware sample I can sort that particular data such that the this in this case the string features that go that are present in this particular Mouse malware sample gets
sorted up to the top or or at the bottom and then the other malware samples that share that feature will also have a block in that row so by going through this I can clearly start to see which of these samples are actually similar or not and how if they are differing how they're different so you know I can see that there are there's one additional one that is absolutely identical in everything including an IP address that we extracted and then they start to become more and more different so I can see how the software is evolving or see if there actually have a similar provenance or not and and make some determination about what I can look at together and what I
have to look at separately and just like in the other views I can select some of these malware samples and tag them in some way so that I can find them again in the system uh let's see and finally um there's what we call the home page view which is accessible from many different points in the system so this is a general report on one individual malware sample and it shows me the neighborhood of similar samples from here I can jump into that comparison view again it shows me all the different groups that this file has ever uploaded as a part of it shows me all the different observations of this file an interesting thing that our system can do is it can
automatically generate yara's signatures because as part of the analysis process in addition in addition to doing that machine learning to figure out the capabilities one thing that we're doing is we're keeping so we break apart the binary into the string tokens and then we're keeping a global count approximate Global count of how often every string token is seen in our entire system so by definition we know which string tokens are actually very unique very rare and we can use that to generate a principled Yara signature as opposed to a human thinking oh this string looks interesting let me use that um and then you can also see all the images that we extracted you can see the the
similar samples here in the row View and you can see the capabilities with all the evidence for each one and just like I showed you in the the popover in the search view you can jump to stack Overflow document for any one of those pieces of evidence and try to understand it better and then we have just a list of the most rare strings in cases of interest and one thing that you can do is a lot of our site is hooked up so you can search for samples that share any of these attributes that you're seeing so I I have some crazy string token here and I can actually go and search for all samples that share that token
now do I know why these three have that token I don't know but it may be of interest to me to dig further and understand these yeah so that's the basic walkthrough of the system does anyone have questions
yes
right right so right now so what we have right now is a beta site up where people can go and play with it so we don't have it hooked up to do the dynamic analysis but the idea would be that you know if we were running a private instance of this in addition to being able to select some things and add a tag you would select some of these samples and hit a run and sandbox button and that information would percolate back into that sample home page and into the features that you see um here in the you know the filters um so right now we're doing our similarity based on static information but during our research process we had
an ensemble method for doing the similarity that included doing Dynamic runs and generating fingerprints like this but for dynamic sequences of events that happened
so we we have looked it so that was one reason that we wanted to do The Ensemble similarity because running it in a sandbox would help get around some of that um we have also looked at integrating on Packers but honestly we haven't found any good ones um that can actually unpack uh you know custom Packers um so right now what you'd see is things are similar because they use the same packer is what you would end up seeing in our system until we can integrate on package
no what would end up happening is it would end up attached to both groups so it would you know so you'd have one of those label nodes then it would have an edge connecting it to some samples in that group then they would have an edge connecting at some samples in the other group which is a good way of seeing like there actually is some shared code between these different intrusion sets
and so if any of you are actually malware analysts and are interested um you can actually go to www.synomics.org and register and I can create an account for you guys to play with the system upload files
all right thank you very much um and I'm gonna be over at the Zachary Piper slash and vincia Booth the rest of today and tomorrow if any of you guys have extra questions