
our family is and therefore do what we need to do about it for what you know whether it's something new like the you know a researcher is usually looking for something new and interesting and then this responder is trying to just find something that's similar so they get a shortcut to identifying what our family is so also I wanted to pause here and talk about the nomenclature problems so this is a huge debate among the malware analysis reverse engineering and the dfi our community and so there have been attempts at making a unified language to talk about malware I'm not gonna actually going to details about some of them they're fairly complex but one of the things that that I have
thought about myself and this is just my own idea in my own life opinion on the problem is I think that there is a way to solve the marketing problem and the nomenclature problem and actually still satisfy a marketing departments need to be famous because that's what the the driving factor there is to be famous in their you know environment and so this same problem was already solved in biology so my background is actually in biology and so the the Latin name for organism does have a factor in there for the fame of the scientist like if you look at the name Latin names of many organisms they'll have you know either the scientist name
or something funny that the scientist wanted to put in there it was all sorts of all sorts of like fun factor in there but it's still a uniform nomenclature for organisms so it in a way I think the the sociological problem that we have that's part of the marketing problem has already been solved it's just mean to agree on a single nomenclature and then it kind of allowed to grow from there but so I digress a little bit and so the the second thing that you want to do is once you've identified the malar family then what I wanted what I want to do is take a large body of samples you know something like the bodied samples that
might be at a virus little for example or the body of samples that are reversing labs which I think you were here for the talk earlier he talked about that as well so there's there's large corpuses of data and so within a set of samples I want to find other samples that are related in some way to the sample that I'm looking at right now so that that's a way to find other samples and then from there you can jump from those other samples to perhaps you look at the the hash values and you'll find that those hash values show up and then an intelligence report or some companies blog those developed a good piece of malakut so they can use that
information as well so I wanted to begin talking about identification the first identification and I want to talk about and this is sort of the the worst identification method so it's the the one that you will get the least value from but it's also probably the most accessible it's the the easiest to get the data for it's nearly always you know freely available you can look up on virustotal or other websites you can see the scanner results afternoon you know if you if you're not working on a piece of malware which is sensitive you can always upload it to one of these sites and get the scanner results back so I'm going to walk through my methodology for interpreting
a beast and results and also I you know I'm working on some code to turn this into an algorithm and so it should be on github fairly soon and if you want to you know if you have any ideas or any problems that you see in my method here please let me know I'd love to collaborate on this so one of the main problems that that you'll notice is many of the AV scanner companies they share a scanner engine and so instead of actually rewriting their own scanner engine the white label or license and other companies scanner engine and then call it their own and then sell it and so when you get a body of a B scanner
results there this this actually causes a little bit of a mathematical problem for you because it looks like you're getting the same results from lots of scanners but actually you're only getting the that you're getting one result and you need to discount or at least count these as one scanner right so you'll see the scanner is alright this is one I just picked because it had this random-looking number at the end which was you know from looking across the other scanners this number is very unique and so this standard result came back from adware a lack that defender any soft f security data and microworld he's game so these companies at least at this point in time we're sharing how to
shared engine or components of the engine and have the same results sorry
actually these results are from scanners that i so i have a scanner of multi AV scanner project so these are actually scanner results of my own but you can actually just as the that's the hash of the sample so if you've got access to I mean you don't need access to fire so let's see the scanner results there you'll be able to look at little verify these results so one of the things that one of the things that you you encounter with the another another thing that you see with a V scanner results and there's basically two diametrically opposed forces at work in AV companies and they're actually driven by the bottom lines so it's the you know this is
actually a money problem it's not necessarily you know a you know conscious like problem that people have said alright I don't want to do this I want to do this no it's more a money thing so as you can see you've got the generic beer and then you have EC brow and actually advertisement for TC browser a great company my friend's Brandon and Jeff go check them out but so you get a specific detection as opposed to a generic detection and so when you have a generic detection it doesn't you're basically covering as much malware as you possibly can with one generic signature and so when you do that the number of people and the number of resources that it
takes to develop a signature that's generic that covers lots of malware is cheaper than developing a specific specific detection that goes down to a malware family or you know a specific actor group or something like this so you've got very very specific scanner results then you've got generic scan results and it's it's cheaper to come up with generic scanner results and so you've got a force pulling you know in that direction there are a few AV companies and I'm going to call out a few of them and they're you know your results will vary you know there are generic scan results in there as well but for the most part Microsoft is probably the best so if you
put power into Microsoft they're going to give you some very specific scanner result and it's also the name of the malware and they come up with they have a lot of times they have their own nomenclature as well so you have to kind of know what this particular malware Act was the name they come up with actually means to the rest of the world but they're very accurate and very specific with their ABC and results isa so folks Kaspersky and if you have any other suggestions please let me know like you see in a good results from one or another and I've also put on here most of these have threatened cyclopædia x' and your you
know your mileage will also vary the Encyclopedia is just how good information or how in-depth the information is there but here's the here's the process that I use for boiling down a be scanner results so if you want to follow along on your own this is the the md5 of the sample that I'm using for this and so what I've actually done is they removed a lot of the the scanner results that were generic so anything that was generic and actually yet so anything that's generic also there's a few like Zeus Zbot Zusi these are not actually specific results these are also generics it's definitely if you see Zeus or Zbot that's probably not - not these days
please unless you're in a time machine so most most of these most all actually all of these are non generic so these are specific in some way or another and so what you want to do is kind of look for some of the patterns that I had noted earlier so if you know these so remember that set of Av scanners that all have the same result and i've acted at that little the the be down there I'm just going to call that the same let's kind of ignore that one don't know what that means but that that's definitely they're sharing they're sharing the scanner engine but they might be used there are little little fairy dust but
let's call those all the same because that's basically one scanner scanner engine so we've got another one so this pair down here also appears to be sharing a scanner engine so a vast ADG again you want to count that one so then and by the way this is by no means you know if you're a mathematician the length is this is very like you know finger and the wind kind of stuff here okay so basically since I've boiled down the a beware to one now I've got a I've got a word that repeats itself somewhere in those results of sinning so the semi is actually a mallow family and then his bar also you find is bar in these three
and so is bars another Mallard family and so these are these are kind of the same so you know I'm gonna lean towards this being as bar but it's not really I mean this is totally you know not very scientific you know a little bit approximate but at least I've got at the end of this process and at least I've boiled it down to two words that I can then go and look for research that's been done on these malware families and then I can say like alright from research that they know is from these valor families is the thing that I'm looking at actually related to these so I can do things like dynamic analysis
static analysis look at other features of the malware to see through a different pathway of Association are they associated to these malware families and then once I say alright maybe this is part of one of these of our families now it kind of shorts you know maybe take but just by boiling down gave you scan results but of course your mileage will vary and then so there's another identification method and this is one that came out so this this actually debuted last December 15 or shortly before last December but at vodka in France the the folks that run mouth PDA had had a talk I was actually there for them with the they debuted mouth pedia
which is actually of the lot of the datas of the public but it's also a closed trust based system so if you are a malware researcher and someone else can vouch for you being a malware researcher you can join the community so that you can add signatures and help with detections in there so Val pedia the goal is to provide like a single research single resource for rapid identification of our and so it's basically you know an encyclopedia where there's a few different data points that are there there's a you know potentially so there's a there's a huge number of our families in there but the coverage has two signatures and the coverage as two samples varies depending on which
one you're talking about so they need that you do need more people working on the project that donate time and effort in fleshing out of the whole corpus but it basically provides a couple different things so it provides the yard rule potentially or multiple yard rules and then it also provides samples of the malware and then it provides potentially background information on what the Bauer family is and where it's been seen and other information like this and also can point to different reports that are about that particular router family I did have to kind of go backwards a little bit because the the the set of the are rules that are in there do need some help
there isn't coverage for many of the families that are in there so to get a slide to show you all actually had to go backwards so I had to grab a sample that I know is one of the families that plug it in there and then get the get to matched results but again like it's a very good place to start as you know the project is I think about only one year old so I encourage I encourage you all to you know join the project work on it you know help out if you can and then another identification method my former employers reckon Act so there's a free there's a free account reckon Act I
still you know recommend that you use the free free data a lot of the stuff that's in there it was built by my team they're still maintaining it but one of the things I wanted to point out to what we did was we actually pull and they serve this is a blog posts and a knowledgebase article on how to use this so once you have the free account there is a repository in there connect all the technical blogs and report so any any company that have the published is a law that has indicators in it and is on an RSS or an atom feed all those are pulled in and on the indicators are pulled out so you can read the blog and
then you can actually have indicators that are associated with the blog you know URL indicators IP address indicators posting them indicators etc etc so that's all there and you know one of the things one of the things you can do is if you have a particular piece of malware that calls back to domain name you can then look for other pieces of malware that call back to that same domain name so if you've got a particular domain name you could try putting in that domain name or IP address and then you can see are there reports out there about mälar camels that call back to that same particular piece of infrastructure and then also you know I like to say the there's the
xkcd the appropriate xkcd so when all else fails and also not only when all else fails at Google it anyway like whenever you're doing is Google whatever you've got because you've got the you know how to become an IT expert the expertise is Google the name of the program plus a few words related to what you want to want to do and then follow the instructions so very much related to this if you are researching some particular piece of malware then you've got the impact or the hash shop shot and you know sha to d6 or whatever throw all of that in to Google and just see if someone else see if it comes back as
being in you know you'll occasionally you'll find malware sandbox results from you know hybrid analysis and other public sandboxes and all sorts of interesting things that are out there indexed on Google so always use it and so that that's that's the end of like the kind of basics now we're going to get into more sort of esoteric malware analysis techniques that can be used for Association methods so static analysis if anyone is familiar with our analysis there's two two basic categories of analysis so there's static analysis and dynamic analysis so static analysis is where you're basically looking at features of the malware file without actually running the power 5 so you're basically analyzing maybe the source
cut-the-knot source code but the assembly code you might be analyzing some of the features of a document the author of the document any other like the modified eighths and all sorts of information there's a lot of metadata in a file you can access without running that particular file so this is all about static analysis so one of the couple of couple of interesting hashes are SS deep so SS deep is a contact triggered piecewise hash and if you think about it basically if you have an md5 hash or a cryptographic hash like that and you change even a small piece of a file like chains one bit flip one bit or something like that you're going
to change the entire hash value and that's that's what it's designed to do is designed to detect small changes like that but what SSD f-- does is it kind of if you think about it you break a file into pieces and then you look at the ashes of the pieces and so therefore you can take you know a whole set of different files each of them broken down into pieces each room look like little little patches and so when you compare those you'll be able to see if this like this file for instance is the same up until here only that part of the SSD actual change and so essentially what you've got is you have a file so if you
think about if you think about all of the samples that you have in your library as points in a three-dimensional space you've got one point which is which is not only the the that's basically the file itself so the the the file itself is always going to match itself so that's the point in the center and then you've got other files that are a certain distance away from the file that you are looking at and so what you want to do is say you basically have a sphere of what your tolerance for how similar something might be and so if you decide everything that's in everything that's a certain distance or less away from that point which is the file you're
going to say is related to that file so that's sphere and everything within it are all a cluster files that are matched you know to the SSD that you're looking for and so this is a it's a it's actually a fairly powerful method for comparing files but the huge problem is almost all malware comes in a packed form and so Packer what a Packer does is takes the executable executable it gets loaded into memory and and run it actually takes that data and either compresses it or encrypted or does something to it to obfuscate or encrypt it and then put those in another executable file so there's a part of an executable file which has the the
payload that you're looking for inside of it and so the problem with SSD and many of the the corpus of malware files they're all packed and so what you're doing is you're finding the SS deep the similarity of the Packer you're not finding as its deep similarity of the payload that ends of it that it's delivered and run so that's one of the failings of SSD but if you've got a system where you've got all of the unpacked malware and you're running SS deep on unpack malware you can have better results import hacks so em hash is also a very very powerful tool for comparing malware files and in PE so PE is the is the standard for executables
in Windows world and so you have an import table which is the functions that are imported by that particular program and so the import table if you take the import table itself and make a hash of that import table you can then find other other PD files that have an identical import table so if you have an identical import table chances are that other file well might just be randomly the same import table but probably not it could be related to each other so this just gives you a way of kind of dipping into a pool of files and pulling back some are potentially related to the one that you're looking at and then you
need to dig deeper to verify are these actually related something that is just you know happens to have the same impact but there's also like I said one of the problems with SS beef of course import hash also has its own kind of problem so if you are if you're looking at the malware which is written and dotnet all of the import hashes are going to be exactly the same so when you find a dotnet a piece of net malware and you take its import hash and you plug it into the virustotal for example you will have just all of the other duck net files out there that have that same same import actually you're
not putting you won't be able to find anything related using import hash if you're in the land of net so file labels filings are actually a very interesting and they're they're not necessarily interesting on their own but they're interesting when there's a pattern that's used to create the file names so as anyone heard of like a domain domain generation algorithm eg a domain name so DGA is basically an algorithm that someone writes where what it does is it you know it generates a domain name a new one and that domain name is generated based on a particular algorithm right so it will maybe it's a cryptographic algorithm maybe it's some some algorithm so you're going to get you know a pattern in the
middle user agreed so if you think about that from the algorithm perspective a similar type of algorithm could be used to generate filings so you like a file being generated over there or it just repeats a certain set of file names it uses but basically if you can develop a regex based on the algorithm or the possibilities of the file names then file diem does begin to be powerful wave but at least one additional method of relating to file - two samples to preserve a you've got two samples that use the same algorithm to generate file names that's you know one indicator these two things that might be related and also you're going to see a
couple of like things here where I'm I lean on reg X and you know developing red text patterns - to match with and with the adversary with the malware is actually doing and this is another example of that so and I included this so I know I know that my talk is about pairing malware files and this doesn't get you anywhere about comparing models I threw it in here just because it's related to like the the next slide and so when you've got a malware payload often that malware payload will sit on a compromised website and then the malware that's on your machine will reach out to that URL or reach out to that FTP site
or whatever and download that payload and then run it and so the URL structure of that download URL can't be really used to relate malware families but what it is directed what it is related to is the vulnerability in the CMS that was used to - you know exploit that website and then use it to post your own or someone's you know malicious files and so and what I'm talking about here is for example if you've seen some of the WordPress themes like 2016 if you see any of these they're all vulnerable and they'd all have massive problems you'll see those in you know phishing URLs and you'll see them in our most in URLs all the time and I know by
the way anyone here was a fan of regex I understand this is a very very very simple regex there's no magic here at all but it's it's actually more just to demonstrate for people who are not familiar thread that's like how this might actually work and this is not a very powerful registers more dispersed for demonstration purposes in a moment a couple slides I'll show you an actually powerful regex for identification of apt malware so and that leads me to this next topic so the c2 so this is actually the command and control infrastructure where the malware is calling back to get commands to deliver data that it's stolen from it's from the victim and so the structure of
the c2 URL is directly related to the malware family so this is something that was set up in by the malware author and or or by the malware itself or indirectly by the Colorado so this is an example of a regex for identifying a particular apt Bruce Mouw family and so this is one of the things that happened earlier this year this is back this is from in February this year and so this particular adversary had to change their tactics because they were most of their targets were on either Google for business Google or gmail email infrastructure and so they were having they were having a big problem because Google had started getting very good at
finding their their attacks and locking them before you know someone even before it even gets to like your spam folder and so the adversary needed to change their tactics and so what they did was they put the first stage of their attack on blogspot and so what this did was it circumvented Google's own you know malware protections and detection because blogspot is their own property and so sending a URL inside of Gmail from its referring to a blogspot you were all circumvent all of their protection and so this is actually this was one that was sent out to a number of different journalists and I think you can figure out who this one might be and
I and for any of you that are very carefully looking at this reg X you'll notice so there's a UID here and you'll notice I didn't include the UID here and that's for a very specific reason the target's email address was you you encoded sorry basics before encoded in that particular query string and so I never have to put mail addresses up on my slide deck and if you want to read a bit about this particular this is a blog post that I wrote back in January February and it's a bit bit of a jab that's you know it was like a one-line forwarder as their blog post was a terrible blog conversation really learned how to blog
better and then you've got exit metadata so if anyone is familiar effective so exit is and you know I am I don't come from Agra few worlds I might not get this exactly right I think active came from photography and it's a way of storing metadata about you know photographs and so one of the so it's actually grown way past photography or anything like that and so there's a there's a program called exit tool amazingly still written in Perl but you know to each their own it's a very effective tool and it doesn't just work with photographs anymore it works with any file that you can possibly think of and it does a lot of very cool things
and it gets a lot of it basically does a lot of the static analysis for you and so one of the things you can get from this you know fashion is there's a bazillion things you can get from it time stamps are important like the compilation time stamp one of the things that I use compilation time stamp for is looking at the time stamp of the files compile time and comparing that to the time that it was submitted to virustotal or submitted to hybrid analysis or submitted to somewhere online and if those two numbers are too close together then you know that it's the adversary that's submitting and testing their own malware so that's a
you know thing to know they denote so I mean I could spend years on exif metadata but i encourage you on your own go to look up exit tool start messing around with it and by the way exit tool does now have a JSON output there's a flag or you can output the data in JSON and so it's super easy to write a Python script wrapper around that and like suck out the data in JSON and the music as you need to so this is this is another another example of you know ways of relating malware so the code signing certificate so code signing certificates can be abused by the adversary in many different ways so
you can actually create a fake sir of course and then you know self-signed cert where things are of some type or if you're really bold you can go steal someone's actually you know compromised some company or a developer steal their credentials and then have and then start signing software as them and then also you can have something that looks like it almost is a signature but there's something broken about it like it doesn't work but all of these methods can be used to identify related malware so this is one of the folks on the research team they're connected they found so this is this is a sample which was found to be des Ruby apt and there's
a valid signature from the top tools so enough and then if you look for this same signature they were able to find other samples that were related to that so and and used in previous attacks and so one of the reasons why this is important is it tied a number of samples that were unknown to a fairly famous fairly famous breach that you might be familiar with so this is the anthem wellpoint so this is definitely not well points hostname w e eleven point but this is you know this is one of the one of the samples that was that was related to that particular breeding and this is another blog post that was written about that by my team
so please go check it out if you want more information on that particular so there's also there's a number of other different metadata points that you can use in the the static analysis and of the world sections imports and exports and resources so I this is probably a pretty busy slide but it kind of shows this is the PE the PE executable format like where you find things in the PG format and so this shows you you know imports and sections all these are in very specific parts of the file so they're you know you know there's a standard so you can go look and see that particular piece of the file and so what
you're basically doing for example in sections or resource you take a cache of that section you take a hash of that resource and then you can and if you've gone through and done that for all of the files that you have you can then find files that share the same section or share the same resource and again just like import hash and other other hashes this is just a way of kind of gift into the pool and then pulling back some that you think are related to the file that you're looking at and then you need to look deeper from those through dynamic analysis run them through other ways of tying them together just because
something has just because something has a shared section doesn't mean that it's related it just have a shared section but it is a better indication of being related than not sharing a section so you always need to have other methods of relating it this is just a place to start so you know this is an example of a sector in this sample that's the the hash of the sample and then this is the hash of this particular section this hash same thing here so our conversion this is a resource and this is the hash of that resource there's a couple of other related methods for finding things that are related to a file of you're
looking at virustotal has a virus all have many many many different indexes that you can search through and so one of the things they've got indexed is called similar to it's some proprietary black magic that they use you know it's fairly effective but and also a couple the last month I went to go visit my friends reversing labs in Zagreb and Croatia and had a nice happy hour of them across they're little epub and talked about malware analysis and actually gave this talk to them in their office and they noted they noted to me that they actually have their own version of that you know similar to algorithm but their's is not black magic they actually
published exactly what the algorithm is how it works and it's all right there so you know I encourage you to take a look at that so it's called it's just the reversing labs hash algorithm RHA but this is another another way of basically clustering or finding related mail or flies and then when you get into or when you leave the land of executables and you're talking about documents so documents have their own special set of metadata you can extract using static analysis a lot of you know your mileage may vary on different different data points there but some of my favorites are author and if you're talking about you know office documents they're going to have the potential would have an
author if it's set and then if you're in the land of PDF the equivalent of the author is actually just called the PDF producer but these two data points are basically exactly the same they're like the name of the person or name of the company or remember that wrote the document time stamps are great again I talked about why time stamps are interesting also time stamps are good for you know if you've got a body of different files and you've got some a number of them that all share the same timestamp that's another indication thing might be related language your mileage will vary unless you're talking about a very unique language that is you know fairly obscure I'm not talking
about or Russian or Chinese there's not not skewer at all so another next-next association method i want to talk about everything i've talked about so far is static analysis and i'm not going to go too in depth into dynamic analysis I'm just kind of gloss over some of this stuff but one main way of kind of finding what our file might be related to another one is through a mutual exclusion so for those of you that are not programmers totally fine what a what a mutual exclusion does is if you have a processor today processors aren't just like the 386 or 46 back of the day where there's like one processor you now have a processor that has multiple cores on
top of multiple cores you might have multiple processors with multiple cores and so what you get there is a problem of concurrency so if you've got one one program and that program wants to do something so run some code it needs to know that if it's running that code on a particular pour some other copy of the exact same thing isn't running somewhere else and bumping into each other right and so the problem of concurrency is solved by reading a mutual exclusion so the first the first thread or the first program to run will create a mutual exclusion which is meant to be kind of a unique string and then the other one is going to look for that Mewtwo and then
it finds that mutual exclusion and it won't do it won't bump heads with the one that is already running and so you know this is sort of the more technical or what I just said but essentially what that means is that that mutual exclusion being a unique string you know uniqueness is one of the inherent qualities of this so it can be a mutual exclusion that also means that that unique string might be reused across different Bauer panels or across one male or family or different samples of an army so two things two different files use the same mutual exclusion they could have either a shared code base or they could be the same our family
registry keys so I'm not going to go too in depth into registry keys but please go get out the Windows registry I don't want to have a lesson on Windows programming but basically the registry or for all intents and purposes it's a place where programs can save some data and then access that data and so basically there are you know there's basically a couple different functions you can either delete the registry key you can change a registry key where you can write a new registry key and so the set of changes that software makes to the registry that set of changes can also be unique and can be used to identify you know share code base or our
panel now I'm going to kind of go advanced into some some of the more fun stuff and from the talk earlier so there their company is one of the companies that does this you know this this sort of work and so one of the things I wanted to start with and I know by the way I had the machine learning folks at reversing labs and I changed the slide that's that's coming up so I do know that these are not machine learning algorithms thinking but this is actually a talk that was given at Baca last year there was a workshop I took the workshop awesome workshop all of the material is on github and so I encourage you to go
through and you know work through this this workshop on your own it's a you know you know i opening workshop i'm not going to go into too too deep about what it does basically uses these two algorithms and you learn how to use these two algorithms to find clusters among and they actually use a fairly small data set just so that you can kind of have in the classrooms it was the zoo which is a you know it's a reference data set for malware I think it's on github but don't don't quote me on that just Google the zoo malware dataset and so one of the things that I learned from this is you know a
lot of a lot of people think to do our customization you need a lot of like GPUs and a lot of horsepower and stuff like that but you don't necessarily need that you just viewed there are a few crafty ways of using sparse matrices and other methodologies where you're and you know I'm a biologist I'm not a mathematician so I'm going to kind of talk to you as the mere mortals that we are and so basically a sparse matrix from the way it was explained to me so if you look at a matrix there's a lot of numbers and the matrix that are all zero and so you're basically wasting compute compute time by you know spending time
you know having the computer a bunch of zeroes so there's a way to basically collapse the matrix down so only the data that is nonzero and so that actually saves your computer a lot of processing time and so you know one of these fixes everything the other ones roll tape so I also wanted this is this is a bit of an aside and it's a little bit of an advertisement for one of my friends project and so this is the intersect finer things and I encourage you to follow all of her on Twitter so we had she had an idea a while ago to have wine pairings with malware analysis so you know what basically you know
carry the finer things blank you know which which malware pairs Bachelor Malbec so please please follow this it's more fun not really serious but you know so yeah this is this is a group that I encourage you to follow okay and then I'm going to quickly am kind of going to go really fast do this so the diamond model of intrusion analysis this is a way of taking different data that you have about an intrusion and making sense of it using graph theory and this I know I have not go on this is not malware related but this is actually from Star Wars so we actually applied the diamond model to different parts of everything
in Star Wars and to be honest the the argument that happened in our office as to where things belong and how they should be related to each other was literally like a shouting match and people were getting angry about their opinion on where things in the diner model and so this is this is a lot of fun but if it was that it was quite a tough week for us trying to figure it out but if you want to have the serious the serious version of this this is the paper itself so please go download the paper I read it it's a very powerful tool for finding you know from making sense of an intrusion another
Association method this is ice water and so this is actually the link to ice water and this is some I'm going to use this so this is a publicly accessible system you can take [Music]
[Music]
and so is alright so you've got the clock you basically clusters into one you've got a cluster a malware that's related in this thing it has twenty five thousand five hundred and fourteen members of the cluster and then you can see another cluster that is two two bits difference of two two-bit distance from this cluster and then these are clusters that are one one bit different and so this is basically what it's comparing so you see these are all of these are members in one particular cluster and you can see you actually see visually that why they're all elated you can you know even the thing that I can see these are related and you can also see where
they're changing you can see that one little spot down here area kind of wiggling around as you switch through all of these you can kind of see where where something is changing see that so that's another way of looking at what's related alright and then as as you had mentioned earlier so control flow graph analysis and actually I want to talk about the topic this is actually quite old now it's from 2014 Derby con so this is a really good talk if you want to get introduced to control flow graph analysis what you know what it's used for why it works kind of break them down exactly what it does so there's a YouTube video
right here Douglas Goddard was the speaker and actually in the video if you open up the video if the camera were like you know maybe a couple inches this way see my head front row but basically what it does you're you have a control flow graph and if you if you're looking at just the control flow graph or non-library code so if you're looking at the control flow graph of library of code you're basically developing a signature for a microsoft programmer and that's not not something you need to do so basically if you're if you remove all of the library code and all the code that was written by Microsoft in the library or written by someone else in an
imported library and just look at the code of the control flow graph that was written by the author of the Mauer you can use control flow graph analysis to find other malware that has that same control flow graph signature and there's many things that you can do with that you know it's it is yet another way of finding potentially related files I always like to say this actually began I always like to say that you don't want to use one particular method you want to use as many methods as possible and you know once you have multiple ways of demonstrating something is related to another sample the more variety of pathways that you can relate the
stronger your confidence is that those two things are related so that's it this is my twitter handle please follow me on twitter and questions I think we have we have three minutes yeah I really put back to the start of the presentation you were going through the commercial into various programs and showed that six or seven of them or even the same engine I mean yeah a couple other ones that we're using like to whom were using this similar engine do you find that it's the same ones all the time or will those group well those groups would be different depending on what Trojan defeated like maybe you'll get six for that version but you'll sleep together
yes this problem would be a whole conversation itself but the value of attribution of these so is it just for like classification purposes to find another lated now so I think let me put a question in your mouth I think are you if you use the word attribution so you're you're basically asking like how valuable it is towards the goal of attribution yes okay so as as you know the leader of team that does attribution often one of the things I want to stress with attribution is attribution can never be your goal don't ever you know that's that is complete nonsense never have attribution as your goal go through your malware analysis process and attribution may or may not
from the process that you go through so you're never you're never looking at a piece of malware like if you're an instant responder your goal shouldn't be I need to tribute this longer to a bow or family or I need to attribute this to a threat after group or a adversary that should never be your goal what your goal is is to analyze the software figure out what its capabilities are figure out what it's related to and during that process you may chip away at something and go huh that's interesting and then attribution may emerge from that process but it can - it's nonsensical for that to be the goal of your process you don't
start out saying I need you to tribute this alright thank you