← All talks

CG - Who Maed Dis; A Beginners Guide to Malware Provenance Based on Compiler Type - Lucien Brule

BSides Las Vegas29:1961 viewsPublished 2018-09Watch on YouTube ↗
About this talk
Who Maed Dis; A Beginners Guide to Malware Provenance Based on Compiler Type - Lucien Brule Common Ground BSidesLV 2018 - Tuscany Hotel - Aug 08, 2018
Show transcript [en]

all right hey guys thanks for my talk prefixing this with this my first talk ever a security conference and my first time at be slots Las Vegas cool alright so welcome to who made this a beginner's guide to malware provenance based on compiler type yeah so personal anecdote when I first started as an intern end game we had a lot of false positives and false negatives for essentially malware that we had to look at way to look at a lot of them and I got a lot of experience looking at these things and this kind of some knowledge that I gained while doing that and I really wanted to do it faster let's have this

thing click boom ok cool so first off Who am I and why am I talking to you I'm Lucian Brule I really like computers and stuff and I obviously have no idea how to talk to people right but more importantly I'm a senior RPI up in upstate New York click there we go I look like that I'm our research Internet end game I play CTF and go to hackathons I mean I'm interested in durability research malware analysis especially reverse engineering and I like entrepreneurship that's kind of my side thing now also whoo but not me is just sensually my affiliations oh so we have n game which is hiring and they're outside like literally right outside

this room so if you want a job and you think this stuff is cool I'd go talk to them outside also RPI SEC is Rensselaer's capture the flag team they're actually playing at DEFCON which is really cool so if you're DEFCON go cheer them on on the floor it's awesome alright so motivation like why are we gonna learn about my lower provenance right why is that important like why are we here click there we go alright I'm gonna read this through so we typically classify malware by behavior but we need static analysis and instrumentation to understand samples and if we know a malware his composure and its origins we can reverse it faster and if we can

reverse it faster we can process more samples and if we can process more samples we learn more we get better and then we go home earlier Sophie a lot of samples we can like get through all of them really quick if we're really good right alright it's gonna preface this with some lingo when I say sample that essentially means like an executable like a Windows executable right that's analogous to like a program that's the thing under test that we're looking at an FP or an FN if I talked about that during the talk that's a false positive or negative so essentially we're gonna test something and see if it comes out to be our concluded thing so if it's a malware

and we say it's not malware that's that's bad that'd be that's a false negative right so it's not mount false positive yeah there we go okay nerves TLS thread-local storage that is this thing that Windows and Linux do and actually it's a cross-platform thing implement the different ways that essentially let you run some kind of code at a given point in a programs lifetime compilers are what translates source code into the actual binary or the sample shellcode is an independent program that can live inside of another sample so it lives off the land so to speak and then bootstrap or initialization is essentially the startup routine of any program it's not like the web CSS framework it's it's

like the thing that starts up for your program cool all right so our goal is we're gonna cover key parts of a Windows binary and we're to know how compiler provenance can give us expectations that we can use to reverse it and how to roll others those expectations to actually reverse it faster and better hopefully if you're in your kind so what a compiler is like I just said a program that converts instructions into machine code or a little lower level form so they can be read and executed by a computer and then provenance is something's origin essentially where it came from and what made it alright yeah so problem an existing research this is

a pretty big problem space so we're gonna talk about like classification of malware in traditional some origins so a quote is by this professor Upchurch at the universe the Air Force Academy in Colorado Springs he said an anteater and many birds insects reptiles and amphibians are related because they all consume ants now that's essentially how we would classify malware today it's they behave in a certain way ergo they all are the same thing but we know they have some kind of deeper composure just like animals right I'm actually sad my chip isn't playing but that's essentially a would it be Donald Duck with some other ducks and they all look the same therefore they're this

they're all ducks right because they look the same false so that's I just said and if you go over the duct test that's essentially something looks like a duck it walks like a duck it talks like a duck it must be a duck and in biology we know that things have bones right we know that like wolves are related to whales because they have some kind of like common ancestor way back when but whales but in the water but they still have like flippers and little toes that like wolves have right now and Malware if we were to look at its composure and certain internal composure but look at things like the libraries it used or

maybe even the compiler that generated the malware and that's essentially like it's it's actual composure not just how it behaves and in terms of in industry and how we classify malware we use terms like spyware like a virus a worm or trojan and that's those are all behavior based things for example if you open up a virus total which is like a gigantic database of like a bunch of viruses and malware and benign where like like normal software you can see how all of the antivirus companies actually classify the malware and they actually kind of differ some of them actually do look at composure and some of them look at behavior so for example you'll see on

the right hand column you've got heuristic agent malware but then some of them are actually saying that oh hey it's it's a trojan it's ransomware and it's petia right and then if you want to go look at this file yourself we could probably go check that out the hash below after the talk the slides will be online all right so existing methods to determine malware's behavior and it's composure right especially composure so we can dig into the actual binary and look at the the hex and we can manually reverse it and that's manual verification we can use some kind of a little bit more advanced methods like hashing which we'll get into and then we can also use like

machine learning ai and statistics to kind of immediately tell us like what malware is composed of or whatever and what it does for two different things so hashing we have a couple types we have cryptographic hashing right which is like you see like md5 sum or like a sha-512 or a sha-1 that's kind of like verifying that if you hash one file it's going to be exactly similar to another file of its hash but then you have things like what if I don't want to test similarity between two files right I don't want it like a true or false is this the exact same file then you get into something called fuzzy hashing which the current

implementation as SS deep and that's essentially like a hash that's within a given distance from another hash will be similar and you can infer some degree of similarity from the hash and then you have import hashing which tells you it's pretty much just a hash of everything a program or a piece of malware it uses its import table like what it what libraries it uses and that tells you what it does because if you know that it's importing things for like cryptography or it's importing things for network requests or file access or anything like that malware with or oh gee we're spoiled it we spelled the whole talk whoo guys you can go home now it's cool it's done I

was like okay import hashing things with similar imports things that use the similar libraries will all essentially have the same impact and that was actually a jiff it was really great it was a hash brown with like the the math lady meme over it because it's hashing right so it took a lot of math and it's crazy oh this thing is not doing too hot or maybe I'm just nervous all right so current research in AI ml in statistics to actually get you the behavior of malware it's a lot of what the current AI companies do essentially use some there's some research into natural language processing techniques to infer the composure of malware and there's also other things like looking

at by entropy and just using general statistical statistical methods to determine what a given piece of malware does and that's a whole lot of math that I haven't gotten into yet but I've read a lot of papers on I think it's interesting but it's not necessarily for a beginner to look at know if they could necessarily would be useful to look at that so we're gonna start in like a happy middle ground I think and then we got the brain meme because this is high level stuff right all right fun facts and figures aka the threat landscape and what compilers are going to be looking at to determine where samples are coming from the majority of windows compiler market

share it's a vast majority as visual CDA like any new like Windows binary has compiled like in the last couple years it's going to be Visual Studio you see a lot of Delfy and Borland after that and then the the niche but still significant market is mingw which is the windows implementation of GCC like the gnu compiler suite also of interest that we see samples in would be like the intel c compiler which is more of like a scientific compiler for optimization and the tiny c compiler which is actually really cool if you want to go and compile some samples and really dig into out compiler works it's open source and you can build your your own samples and

tweak the compiler really easily that way and they also see things like all the assemblers you got yasm azzam Nazem and gas the new assembler and then some very nice things like a vb6 from way back in the day generates different kinds of binaries the.net framework generates a bytecode that then gets run then you have the rest compiler and the go compiler which will not be covered but those are also of interest if you're interested in them now we have sub-segments all these things so if you look at Visual Studio Delfy and mingw we have all these different things there could be different versions because the compiler versions will actually change how the output program is generated and

what it actually contains we have the target platform which is essentially like if it's windows 32-bit versus 64-bit it might do something different we have additional options it could be inserting stack checks or like cookies in the stack instead of combination or including other libraries inside of the file and it could be packed with there are so many packers and that's essentially just a program that wraps your program it makes it somewhat harder to reverse-engineer and then so these are all sub segments there are a lot of tools to undo all these things and come to something that's called normalization which essentially if you have different if you have all these different sub segments it makes

them back into the bigger sub segments so let's say you were compiling something and it was optimized so that's essentially like maybe it finds some kind of generic optimization in your code like it unrolls a loop or something then it will put all of these things back into the original like unoptimized version so you can actually look at it and check differences amongst different for area all right so one thing we have to know is to actually infer the composure the differences of malware is the like I said right there guarantees and expectations so what guarantees expectations are made at the program run time and how are they satisfied by give compilers and how does their

satisfication actually matter so we're guaranteed at run time we know that when a Windows executable is loaded we have some operating system level things that happen like you have to schedule your process right and then we know that at some point Windows is going to hand our Linux whatever you looking at specifically Windows is gonna hand over control of the program to your program right so in this case I'll run through it the windows does things like map all of the executable map the fql and all of its imports in a memory it schedules the process and then windows will jump in your program your program actually starts at the very beginning when a TLS callback is registered and

you guys probably what steals callback we get to that the C runtime has to be set up if using C right because there's a lot of library calls and C that you might want to use but that's like an internal thing to see we know that we have to registry exception handlers for in C++ we have to construct all of your classes at the beginning of the program and then after that this C runtime will actually call your main after it gets things like like arc ve arc C and actually gives it to Main now this Jif was pretty cool this was Bill Gates I with the tablet and it was like recursion because it's like the image is

loading always that was cool and missing out on that one all right so we have four things to look at for each of these compilers and like as a new person this is what I found most interesting in terms of provenance and why the compiler matters for different binaries so we have process initialization we have module loading TLS callbacks and the calling convention and those are all really important and we'll get into those right now with Visual Studio cool so on the right we have Clippy he should be blinking at you he's not and Visual Studio has a very interesting thing in that it adds an actual XML assembly at the end of a programs resources so I'd

like to turn a little able to state things that it's done in the binary which other compilers do not have so if you happen upon a program and you see like oh where does this XML manifest thing that's probably Visual Studio and then you can go look at that and that's interesting to look at it'll tell you some key things which you can read about on MSDN and it's not always the same things so sometimes it get like a really interesting tidbit so it's actually useful to look at that and then it also has some typical imports which it imports a lot of things but by default it includes one thing which is important we'll get

to that so calling convention it uses seat deckle by a default which essentially when you're looking at the disassembly right you have push push call and then you'll see some stack cleanup and then I'm also covering windows right here some whenever you see a Windows system call it uses standard call which is the caller doesn't actually to do any of the cleanup the operating system as that for you all right so if we're looking at a Windows program and we see some pair of we look if we're looking a visual studio program we see something it isn't using see deco we're like oh that's where that's another option they specified or maybe this is something that isn't visual studio even

though I already inferred that maybe it was right based on our initial guess that's kind of work out where to build a bunch of guesses and try to just back justify our guesses right so MSV see run time initialization is what it does which is actually very different from the other two it uses windows built-in things this is from reversing some window stuff so at the very beginning of our programs lifetime it goes to ntdll it does something which just starts your program like think about the operating system level of things right and then it goes to kernel DLL which calls base thread and it thunk and this is as a visual studio like 2016-2017 samples

that I was looking at and then your program is called an AM SVC program will have an entry point once then jumps into like a stack check routine and then after a couple more stack frames down it'll actually call your mate and that's somewhat interesting because like the others like note I don't know laser on this whole thing but note like the additional kernel32.dll is different the other couple compilers are looking at all right so MSB CTLs callbacks we're gonna get to what a TLS callback actually is essentially lets you register your code at the very beginning of a process so like before main is even called before like literally anything in your program runs you can actually run code before

your program which is odd right you're like well I broke code that should be in my program right but it actually runs before mains so you can run code before main and as a beginning reverse or I thought that was of interest so these are things that we want to look at so essentially just makes a bunch of stack frames and then this happens like kind of asynchronously over the lifetime of your program so it'll just interrupt your program run that function that you've registered or do whatever you the programmer told the computer to do without actually messing with any of the things you've set up so if you're in the middle of a function and another process

is created your key let's call backs like BAM I'm right there I'm gonna do this thing even though you were already doing this thing so then we have module loading is different right so every MS PC program will have VC runtime and then like a 160 dot DLL and then you'll also see kernel32.dll loaded statically and the static imports are important because MSP C will statically import things and that'll be if you open up a program in c FF explorer which essentially tells you the different sections of the program and all of its attributes it'll say like these are the things that this program does and we know that M s BC by default doesn't dynamically load things so it

won't go look for a library for example because if we saw dynamic module load in m/s BC we would know that like that's something fishy right and we want to go look at that thing alright so next Impaler is mingw so mingw depends on the new library DLLs describe the jiff right so that's a bison running like very majestic aliy and then the GNU logo on the 4 gram cool so the GNU library dll actually usually like if you see a packaged program there's they're shipped with the program that you get so you'll have like an executable and then some dll's that it's gonna come with that are hopefully in a folder that it knows

about there we go so it uses c deckle just like MSV c which is push push call and then some cleanup right and that's that's pretty standard you can also change that just as the other compiler was and it's version of TLS callbacks are actually dynamic so you can record on a Mac TLS callbacks and and let's say you were used to you were we're used to reversing a lot of mingw binaries or sorry you're used to reversing a lot of MS vc binaries and you see like oh like I thought that mighty let's call back should be statically defined in the file and then you're like what is this dynamic thing mingw and you might think is fishy right

mingw will actually dynamically load all of its TLS callbacks when the program started so the process initialization and the TLS callbacks are all lumped in this gigantic stack frame which is when main CRT start just does literally everything a cool thing about mingw is it's all open source so you can go look at how it does this so if you wanted to go modify your own compiler or modify its and its compilation you could do that so this is a really big false-positive slash false negative opportunity let's say you're an AV company and you're like oh dynamic TLS callbacks like no that's bad we should not do that there's a thread on Twitter three months ago that was like compiled

the smallest hello world binary and then uploaded to a V or virustotal and see which AV will actually hit on it right there like you're like a detection on it and if you just compile a binary with Minji doors and then GW uploaded you can get a program that has does absolutely nothing to be a false positive cuts a lot of av we'll look at the compiler bootstrap and go like oh I know T let's call back definitely now or like this even though it's probably just literally nothing like that's bad this is bad we shouldn't do this that's cool and as an analyst maybe you can start to infer that like oh it's mingw

compiled application from the things I just learned and hey like there's an amateur let's call back in the very beginning of the program and I'm inventing on that so maybe I can just skip it it's probably fine right so we just learn that that's cool so it's runtime an initialization it just does the like we talked about that out there you it does exactly what we talked about in in the window slide like the the program loading part the process bootstrap part just in order but your program does that instead all in one gigantic sample room so register CLS callbacks are extra steps and exception handlers it grabs art V and arc C and then calls main all in this gigantic

run-on function and then your is actually called and passed RV narc see that's pretty straightforward and then you'll also see a dynamic load of the runtime DLLs so the very beginning of when main CRT start up you'll see like oh it needs to depend on all these C runtimes so it'll just go look for those TLS dynamically load them and they're not pathetically included in the actual binary so if you open it up in like cff explorer you might not necessarily you're not guaranteed to see those in the delisting right and then these are the things it imports so it's kernel32 and ms v CRT so that's the C runtime DLL and then it also dynamically

loads Lib GCC and then it's a version number so the extra dodge dash X DLL and then live GC J - or whatever version it's on DLL and if you don't have those then it your program won't work so if you're running if you're ever wondering like why an executable you have isn't running on your system or you can't get it to actually run in a VM or something like that you actually need to go find the appropriate DLL drop that in for it and those you can just grab off on SourceForge I think is where mingw hosts its releases and yeah so some of the used imports are placed in the import directory but also you're guaranteed to

see dynamic loads cool and our last one is Delfy so its logo is the DX logo we're talking with logo it's look as the DX logo and this is the oracle of delphi as painting so that's the shift with smoke on the right hand side so this method of module loading is somewhat interesting it's complan ation time it's very interesting you'll see like a tag like compiled at x time in older Delfy binaries that's interesting it's calling convention is very weird and it has a lot of artifacts in the files like all the days of the week and months and things like that even if you tell it not to include those it sometimes will

include these things and you might be reversing a program and see these things and just be like why and then you know it's like oh it's Delfy and I should expect that these things are in there even it doesn't use them cool so it's calling convention is a variant of fast call calling Convention which is like this unique thing to Delfy called Borland register where it passes the FIR three arguments through eax ECX and EDX and then it pushes any of the last arguments onto the stack and then calls the function so you're guaranteed to have your first couple arguments and Exe txt DX and then everything else on the stack and that's kind of interesting so

we know that like EC X can be used for like system string operations and things like that so as a new guy I kind of got thrown off I was like why is it doing this but then I had to read up on the calling convention and know that this is a standard thing for Delfy binaries about the tippet the runtime initialization is pretty much the same as like as GCC was it's just two calls and then it just jumped straight into your to your main but there's like a before main before it jumps so that's setting up the run times things like this it's TLS callbacks are both static and dynamic it's actually interesting

because some Delfy forms applications registered dynamic she loves callbacks that are used for things that you don't even write or you don't even define so like we said I saw one that was a button clicks were actually associated with a TLS callback which is very interesting you like spawn off different windows and things like that so it's module loading is really cool if you get a Delfy binary like a little like sheet to go really fast it's just scroll down like an ID or whatever reversing program you use to the very bottom of the code or the text section they're both the same thing and you can actually like just see a listing of everything it uses and if you have

all the names resolved you can get the functions and you can even like start to get if you're looking at like a window to application you can get like all of the the button click callbacks and things like that and you can immediately start to see what is happening in the file without even really having to look at any of the functions or delve down into any of like the basic block stuff all right so takeaway we have all these expectations for different compilers and I recommend that like you guys probably have like complete view after this talk because it's so concise but we know that we can expect things from a given binary and we should build up our expectations

until our expectations are broken or we see something is really fishy and we want to go look at the fishies thing first and that's always like Michael my reverse things we know that for faster triage we can do but we can leverage our assumptions of the program under test to reverse it faster because if it breaks our expectations and jumps say to another calling convention something like that we know that maybe that shellcode maybe that's something that I should really be digging into further and forgetting everything else we know that if we're losing dynamic analysis on a program and we see something like m/s VC doing a lot of dynamic loading we know that's probably officially summer look at that

whereas we know that if we see some events from say mingw and that's dynamic loading and they're just loading the dll's like that's probably okay we don't have to look at that and other things like that or GLS callbacks even could be dynamic or static based on the compiler as a data scientist there's like I mentioned the FP or FN opportunity and mingw of forgetting the dynamic loading in the beginning and the dynamic TLS callback registry you actually can start to forget some things in certain sections and the placement of certain calls matters so this is an output from the tool that I'm releasing after this talk essentially on the right this is a hello world program compared to a

program that does a lot more so it just writes a file it makes net request you'll see that the compiler and all the code related to the compilers places the very like the top like twenty by sixteen bytes of the the program so that's the very top and they're essentially the same we know that we're looking at two programs that are compiled the same compiler the top part of the program should essentially be the same unless they're doing something interesting if you know there's something interesting we should go look at it so we see that it's generally the same at the top and then very different at the bottom if we saw for instance a large black segment

in the like a larger than these smaller black segments in that image we saw a very large chunk of of black that means those two programs are very different and we should go look at that because maybe somebody is hiding something in the compiler or the compiler bootstrap and it's like a crafted binary or maybe they have some kind of modified compiler and so I learn through that so you can reduce the you you can read yeah you can increase accuracy by accounting for the fundamental properties of binaries given their compiler so many references to function offsets and Delfy binary is more benign whereas the presence of pls callback orchestration code image UW is fine again that's just different sis

compilers and why they matter so my tool is open sourced not now but as soon as I get to my laptop and click the make public option on github it'll be released it generates these different charts it also gives you the similarities so essentially how similar a given program is throughout the file all the visual readout and then I also specify some additional reading everything this stuff is cool there's a lot of research papers to read on it you can get into the more higher level stuff that's being researched currently that is the end of my talk thank you guys for coming out first time speaking so I appreciate it [Applause] [Applause] and I'll take questions on anything cuz

I'm sure that I missed things that questions all right oh we got one sure yeah is there a way to make one of the any of the compilers appear to behave like one of the others so you could sort of spoof the visual studio one versus the mingw yeah definitely so I mentioned like making a crafted binary so you could actually start to if you wanted to just get around some kind of automated analysis you could probably start to put certain strings from a visual studio compiler into say like a Delta compiler and just build up it's kind of like a it's analogous to a spam evasion so you could start to like put a

lot of good words and things like that in a a binary if they're bytes like good bytes and things of that good sections or you could statically defined imports in things that don't exactly define imports you could identically load things like let's say someone order detection to call something a mingw binary right based on it loading the GCC runtime or something that all those DLL so you could then dynamically load those in Visual Studio and maybe you'd get some kind of evasion like that that'd be something to look at you definitely can and I think that's doable there actually some some rules online like euro rules online that will look for a complan ation date of like 1970

like epoch time in to actually detect Delfy binaries so maybe if you put that actual tag in your header you could get something like that definitely possible I think it's possible it's if you think it's cool maybe research it and I give talk for sure any other questions no going once going twice all right cool thank you guys I really appreciate it [Applause]