← All talks

Advanced Techniques for Real-Time Detection of Polymorphic Malware

BSidesSF · 201624:4091 viewsPublished 2016-04Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
In this Session, we will introduce the audience to various techniques that are used in the identification and classification of polymorphic malware. By definition, polymorphic malware easily evades traditional signature based detection methods. Approximation Matching algorithms such as ssdeep have had much greater success in detecting polymorphic files. The ssdeep hash is one of the more popular attributes that is computed for a file by a number of sites such as VirusTotal, Malwr and Anubis. Newer algorithms using bloom filters have also shown great promise in detecting polymorphic malware. This session gives an overview of these various algorithms and compares their efficiency and performance.While ssdeep is a good tool for comparing two known files, it becomes computationally expensive when a new file (and its ssdeep hash) is to be compared with a large database of existing ssdeep hashes to determine the closest match. In this session, we enumerate a class of techniques which reduce the lookup time significantly and allow for fast detection of similar files. These techniques are then extended to the classification of polymorphic malware and we show the efficacy of these techniques with real data collected from the field. We then analyze the performance of these algorithms both from a speed as well as their success rate.
Show transcript [en]

We have a Did I say that? A jeep. Hi, I'm Doug. I'm one of the organizers. What are you going to talk about? Polymorphic malware. Yes. Polymorphic malware. That's easy for you to say. I'm trying to think of something. Yeah. I dare you to be folksy about polymorphic malware.

That's just his sheet. So, oh, so we're ready to start. Sorry about this. Am I going over? Uh oh. It sounds like we need to start. Okay. A jet, right? Yeah. A Jeep. A Jeep. Yes. Okay. So, once again, my name is Doug St. I'm one of the organizers for Bides this year. And I want to thank everybody for attending. If you got a problem, talk to the other people. If you like it, talk to me. That's the way we run things here. So, Ajit, right? A G. Yes. I'll get it right by the answer. I'm not going to try your last name. Advanced techniques for real time detection of polymorphic malware. And I gotta tell

you, I got some in my pants right now that I'd like for you to analyze. Okay. Take it away. I'll let you go for it. All right. Um, well, first of all, I really want to apologize, you know, for this talk. It's kind of technical. I didn't realize that I needed to add a lot of like cats pictures and, you know, dog pictures and make it really, really funny. I will do that the next time. Okay? I promise. Um, and this is a little bit on the, as I said, on the technical side. I will try to make it as humorous as possible. And feel free to stop me at any time if you have any any

questions. I know that, you know, you guys want to head out somewhere probably really soon and uh I'll try to do my best. Okay. All right. There we go. Friends at home want to hear you. Totally. Thank you. All right. So, my name is Ajit Kiagar Rajan. I am currently an independent security researcher. Um, I used to be at Fidelis before this. Actually, if there are any Fidelis people over here, um, uh, for the rest of you guys, I'm actually in stealth mode right now. So, um, yeah, it's just it's just me. Um, oh, sorry. Okay. Don't move. Okay. All right. Um and prior to this until about 2 months ago I was I was at Fidelis um

you know in various director roles over there. Um a shout out a lot of this research was done while I was at Fidelis. Uh this was in the summer of last year. I had an intern from uh University of New Haven his name is Vic and uh he helped me on a lot of these projects and trying to run in in terms of running the experiments and so on. So um uh a big shout out to him for actually getting a lot of this work done and of course for Fidelis to uh for providing a lot of the data. So what exactly am I trying to do over here? This talk is actually a little bit about just looking at malware

uh which is something that we did a lot of trying to make some sense of what does what are these some of these patterns that we're seeing and um you know try and improve detection. So the Fidelis box for example uh is a network security device. Uh we scan you know malware as it goes by uh in and out of an organization and then we try to do it really really fast. So um one of the one one of the requirements was that how can we improve our fast detection techniques uh for malware um and and that's what this sort of research was focused on. So malware generally is distributed through multiple ways. Just a couple of them

listed over here. You got malicious websites, email, you get malware in email, fishing, spam, and then if you happen to be infected by some kind of a bot or something like that. Um then there's malware that can come in over command and control channels as well. So uh those are like just the various mechanisms for distributing malware. Now on to detection techniques. So um there are various types of detection techniques and usually you try each one. So if for example you get a file and you see that um it matches a list of uh matches one amongst a list of signatures that you potentially have then you can say okay you know what it's an exact match it's probably there's

malware you probably have a some some uh attributes that go along with that signature and you can say that okay I'm done it's malware I can I can I can rest if not then you move on to some kind of static analysis and then heristics and then and and ultimately to emulation. Emulation where you have like a sandbox type environment and so on. And the key here is that you can see how the time scale sort of progresses. So signaturebased detection literally is in the microscond you know uh range all the way up to you know several minutes for doing emulation. So you want if you can somehow get your detection as close to you know microcond

millisecond as possible then your speed of detecting malware obviously is significantly higher. So uh so you want to you want to sort of focus on those techniques that you can do really fast detection with rather than you know um the heristics and emulation uh techniques because obviously you can't send every single file for emulation analysis you know it's just not going to work. So what we were trying to do was uh we get this report every day of malware that we that our various sensors have seen uh in the field and what we were noticing is that uh we saw a particular file you know that was determined to be malware and we would see like a boatload of MD5s associated

with that one particular file. So obviously signature matching was not working for that particular file and also why were we seeing so many different MD5s for the same file. So we analyzed that a little bit more and found that um these were uh malicious websites that were continuously modifying the files on the web server itself. Okay. So uh um and and this this type of modification is called polymorphism. uh in case you're not familiar with that term. And what they do is the code or the or the file uh is altered by changing a few bits or you can do encryption or you can offiscate but ultimately the characteristics of what the file does is exactly the same.

Okay. And as far as polymorphism itself is concerned, the serverside polymorphism is what we were observing with these multiple MD5 files that we were observing. Um, and there are two different techniques that we saw. We saw sites that did volume malware. So for example, they would send out a mass email to millions of people with like some URL in there so that the file name stayed the same for everyone. And then as people clicked on those links and went to that website um it would then serve out new variations of that file by just altering the file just slightly. Uh compare that with ondemand malware where it's a much more targeted attack where they can encrypt it and offiscate it and

so on. But then it obviously takes a lot more time to do these things and therefore volume malware you know just has a few uh bits that are that are modified. And so what our goal was to try and see if we could figure out or or um detect this volume malware which was escaping our signaturebased detection using some other technique that was really fast rather than doing the static heristic or emulation based analysis. And again we couldn't tell at that point which one of those were actually being hit. We definitely knew it was not signature based. And what our goal was to try and see if you can detect that using something close to signature based

analysis. Okay. Um and is everybody with me so far? Okay. Cool. Awesome. So um enter SSD. So how many of you guys know SSD? Okay. I see a couple of hands over here. Okay. So SSD is essentially a uh an algorithm for figuring out what is called a fuzzy hash of a file. uh as compared to like the MD5 or uh the SHA Shiaan 25 SH SH SH SH SH SH SH SH SH SHA 256 and so on. So the way it works is that uh the algorithm sort of cycles through the file and then it breaks it up into chunks and then hashes each one of those chunks into a single bite and then concatenates all of that. So it's

sort of like a compressed representation of that entire file. I mean I don't want to go into the details of how SSD works. You can just Google SSD and you'll find a whole bunch of hits on that. And uh and and essentially you get a a what is called an SSD hash or a fuzzy hash. Okay. And this diagram sort of gives you an approximate idea of how that how that works. And a lot of malware sites today um compute this SSD pash. So for example this little snippet here is from virus total. Um you can see that um uh amongst the malware properties which include you know MD5s and and shiaans and all that

the SSD hash is also provided and pretty much all of the the popular malware analysis sites provide that as an attribute for the malware file. Okay. And and so how do you use this? So primarily for a comparison. Okay. So the SSD algorithm uh that you can download from uh you know from source forge I think it's source forge or or wherever wherever it's hosted right now um has a mechanism to compare two files. So what it what it'll tell you is that if you give it two files and run SSD on that, it will tell you that hey, you know what, file one matches file two and then it gives you a little score at the very end. And that score

will tell you how close those two files are together. Okay, so in this case over here, file one matches file two. 99% of you know they're they're identical. Of course, they're completely identical give you 100, but in this case, maybe there was a very small difference between the two files. And then again if you do file three matches file four with 44 that means that there's a significant difference between that but there is some commonality elements in between the two files. And another use case for something like SSD uh is in um uh grading exam papers or reports you know so what they do is they look at two uh answers and see if there's like a like a

similarity between these two in terms of the words that they use and so on. just to a side note. Um anyway, so back to what we were trying to do. You know, we I I showed you this uh um uh report that we were analyzing and we saw all these files that had different MD5s and uh but the same file name. So what we did was we looked started looking more closely into those files. So in this case we looked at this thing called data ticket.exe. This was a campaign run by some malware author out there and uh if you just went to virus total typed in data ticket.exe it gave us literally thousands of entries. So obviously a lot

of the malware vendors had submitted these files for analysis to virus total and they had copies of all of those files. So in addition to what we had also downloaded a whole bunch of samples from uh virus total and uh in here you can see that the last field like in this first line over here that says 0074 that's actually the MD5 of the file and every one of these has a different MD5 and the very first um the first field which has a bunch of numbers in there uh is the beginning of the SSD hash. Okay. So all of these are the SSD hashes. What it what it uh uh what it shows is the block size. There

are two hashes and then the file name. And uh if you start looking at the way these uh hashes are constructed like if you look at the second and third entries, you'll start seeing that there's the same characters that start occurring until probably like somewhere close to the very end. So that just sort of shows you what how similar they potentially are. So then we started digging in further into these files. So just did a quick hex dump of the of the file itself and uh notice that uh there's just one line that is being continuously modified between a lot of these files. And uh on the right over here, we did the uh did

the results and found that over a thousand files had a 99% difference uh a 99 99% similarity between each other. Okay, so clearly there's polymorphism out there very very rampant. Now the question was so how can we use SSD potentially to improve our lookup because uh that's the uh that that was the key goal of trying to use something like SSD to do lookups. I mean you can do comparisons between two files. If you got a malware file or two malware files and you compare them, sure you can find out whether they're similar or not. But then now I receive a malware file for analysis. I have a whole bunch of SSD hashes. Am I going to sit and compare

every single one of them? So that's a that's an order end problem in some sense. Just going to take way too long. So that's the SSD lookup problem in some sense. So you got an input file coming in. We have a huge list of uh a huge database of hashes and then we literally have to compare every one of these hashes to figure out which one is the closest and if there's a a closed match in there at all. So uh SSD didn't look so good there in in its base uh uh form in the way they had uh they had structured the code and so obviously we needed to make some changes and we

thought thought hard long and hard about this particular problem. So what we did was we decided that okay we would take that hash and then we would break it up into chunks and then create an index of each one of those chunks and then try and look up each one of those chunks individually and make a little link list if there were additional entries in there and then link it back into the main hash database. And that immediately gave us a significantly um uh high rate of lookup almost like order one um in in all of our uh um uh uh lookups. So uh that was that was a huge win for us in terms of how we can do

lookups using SSD and not having to do a file comparison with every single hash. So the next question was then okay so we've got the SSD lookup now working pretty well but how do we create this hash database? Do we we take all our malware samples and if we just stick them all into the hash database then you know we're going to have a very very large database and there's going to be a lot of similarities in that database itself. So our next problem was to sort of figure out okay how can we optimize the database itself. So in this case the problem became okay um I've got a input file I'm going to stick that into the

database and the question is whether I need to or not or if there's something else that's already similar to that particular file. And we played around with a couple of algorithms. What we did was while we were inserting the file into the database, we found that there was almost always a natural hit. If there was one already in there, we can set various thresholds in there. And if there was, then we could apply one or one of many algorithms in terms of either replacing the new one with the old old one, keeping the old one, or if the threshold was low enough, just add it in there as well. So those were those are some of the database optimizations

um that that we had made. So um any questions so far? Okay, cool. Good. Okay, so that's essentially the u um main techniques that we developed for using SSD from a lookup perspective rather than just a compare comparison of two files. Now the question was um is SSD really the gold standard for doing fuzzy hashing? And the answer is probably not. Um, and and I'm really surprised that it hasn't been replaced by all these malware vendors by something that's a little more powerful, a little more current. So, first of all, this technology is circa 20 uh 2006. That's when it was first uh developed. And since then, there have been multiple algorithms, you know, such as MV hash, SD hash, SIM

hash, you know, some by Google. And then there's this whole area of approximation matching that people have that some researchers have been working on and there are some some uh experiments being done right now in comparing the effectiveness of these algorithms with SSD and in almost all cases they find that these newer algorithms way way outperform what SSD does. One more thing is that SSD actually has a limitation in that you can't compare two files that are very different in terms of their size. So the initial block size does matter a lot. So if you do uh if you create another file that is actually identical in terms of its content but then you just pad it with a

lot of zeros or or no ops then the SSD algorithm sort of uh doesn't doesn't work so well. So um it does have those limitations. And the next thing is that as we started looking into more of these files uh malware files that um um we were see detecting in the field, we were we began to see a small trend in malware authors going from polymorphic malware to like metamorphic malware. And what this means is that polymorphism is just simple changes in the files. like as I said you know a few bits change uh encoding encryption whereas metamorphism is more about just changing the overall structure itself but the behavior tends to be the same so we're seeing we're

seeing that happening as well so we're getting less and less hits now in terms of the same of the different MD files on the same file so the thing is I think now what we are thinking is maybe there's something like machine learning that might be a better fit for detecting these sort of approximations and so on and that's something that we're exploring right now. Okay, I think I did it pretty well here. So, um, you did great. Awesome. Thank you. Any questions? Yeah, go ahead. So, I'm assuming you train this on a Windows. Is that correct? On a on a what? I'm assuming you train this on a corpus of Windows malware. Is that

correct? That is correct. We only use .exe files uh currently. Yes.

uh we have not done it but the same same concept can be applied to any of those as well. Sure. Yep.

That is correct. So encryption wouldn't work over here. Uh what you have to do is you have to unencrypt it and then work on the unencrypted file. So with with encryption that's the that's the issue. And again you can also do things like you can take a memory dump off every single file and then analyze that take a computer an SSD patch of that and then go there. So you can actually do it at multiple stages depending upon where your comfort level of operation is. Yep. Uh so I think traditionally with malware the malware developers like to use kind of standard to uh unique samples, but uh usually like stuff isn't but now I've been noticing that

you can use something like um take predicates that uh succeed um like dead codes when developers insert dead codes it's uh specific to one function but if you can insert predicates that are like a few functions It might completely uh remove the ability of like standard detection. You seen anything? Oh, absolutely. So, so the thing is SSD in its raw with with its raw binary format analysis will completely fail in those cases. So what you have to do is you have to take every one of these files and sort of normalize it which is you have to as I said go the ultimate the ultimate u mechanism for using SSD is to get a memory dump so that you

eliminate all those you know uh no ops and all that kind of stuff and then see if you can apply it on that. Um so that's that's sort of the only way right now and I and as I as I alluded in my last few slides even that can potentially be uh subverted by by malware authors today there's a huge trend towards going to towards metamorphic malware where they change the entire set of instructions like for example instead of register A in the memory they will start using register B and then you know change the an order of instructions such that it does the same thing but then in different sequences ultimate behavior is the same but you

know you have uh you have different different ways of sequencing things and all of those things will completely defeat something like SC like if you use assembly language like the normal assembler or even something like Java line code or like CLR where you kind of control the structure uh more so than You would if you just do like GCC then you would be able to evade most of these kind of yeah you're able to decide like what register you're going to use and then you just kind of have another pass over the compiler say I'm going to change you kind of have like semantically equivalent code generation it's not necessarily metic like what you had in the 90s. You kind

of make it. It's somewhere in between basically. Yeah. Yeah. I I'll give you my business card. Please get in touch with me. I'd love to hire you, you know.

Any other questions? Yeah. One more. Okay. Sure.