
I have fun thank you so for at the beginning I'm apologize for my frog leash because I've come from Paris so my name is Sebastian my under on Twitter is a sub driven if you if you like the presentation or not you can write me on my email or directly on Twitter I'm Marvel analysis reverse engineer and [Music] forensics cylinder and I I participate in many projects like uh net project vodka and many stuff open source like a Yeti eat eat stretch diligence platform to collect share my curation on IOC on entities on campaign whatever or stuff or directly on my github so you can you can share you can make pierre on my project or on
a Yeti project so you welcome to participate at the open source my opens project so this morning I talk about Marwar and specifically Windows Marvel because I'm not specialist on Linux bar where it's a bit magical for me but on Windows on Windows it's it's I reverse many many murderers and and here we talk about PE for much pee format is binary format on windows so you have different characteristic of of this format and it's very it's very useful when you make reverse and when you try to make machine learning because I say try we speak about clustering I define the difference clustering and classification is not the same the same approach and the same the
same thing so I defines I'm defining the difference between clustering and classification and the algorithm behind this this kind of offer go vision and agenda talk I assure you we're after machine learning how to create Yahoo's to make anything to discover another another Murray on on your families when you find with machine learning so beat a little definition about P formats and and executable formats on Windows to understand how to make vector features Astor's for machine learning so P format described by on image and by Microsoft so you have different layer the first year layer it is the most the most known it's a MZ format each try it starts with magic number MZ blah blah blah you know X if you make
an example on the on the file after you have those segments its for the compatibility of all windows you have PA format and after you have the table of sections table fiction describe the different section of your your executable and in section you have the code you have data you have import tables to a information to execute on the computer when you when you click it on the on security bar or directly map map it on memory because you you can exit on Windows you can execute an instable file directly if it map in memory without clicking on it in errors you have different interesting information you have the size of errors you have principal to errors of
generators and and file errors you have the number of section in median you have three four or five section in Windows if you have less or more it's a bit strange it can be a packer it can be an efficient method to ID code you have the address of one treatment there are three points it's the first instruction when you launch the exit in aiders you have the information if you your file it's a dynamic library each or a driver or injectable file because the pill format is used for of three kinds on Windows the differ on on drivers and executable file you have you have a main method to start the code of your exit in a driver it's like
that the alledged a dll main it's a first is the first piece of code when you when you start the exit so it's the same format but it's not the same characteristics to to use it for library you have followed the ll for the ll you have export function to use with another program like like on our Linux or my question you have types and the computation time stamp and when and the different different information about architecture if you XE is can be executed in x86 or 65 the linkers information and many information about execution the size of the page the size of section alignment on memory so in the header have kind of information just
after these Raiders you have data directory on data directory you have to information the size of of the table and the virtual address where in the XZ is stable restored and the sorting of the data directory it's the same we start at 0 e finish at 15 so you have exportable in portable its fortune early you develop a DLL it's your function to use by another program in portable in the function used by the XZ to execute correctly on your computer for example a create file or open process or us read all function is directly do with restore in importable you have resource table resource table it's the string of your software or the icons image if you what whatever
different of the piece of course the table exception certificate table if you are a Vinod and do you want to sign your your software the certificate is store in this in this table you have the work through locations tables it's a drug-related addresses of imports to use by eggsy because you have a dress on on your operating system and it's directly directly copying the regulations tables you have information about debug it's a PDB the path of the PDB it's it's in in stable to copyright the straight table configuration table so I add the acid address of the important if your system as as Alice is seller addresses it's a mechanism of to randomize the address
when you're operating system start to mitigate exploit and return to to program to execute Chalco in section you have the section is a piece of part of eggsy map in memory so it starts at the disk and finished in memory so you have many information interesting like sighs like characteristics characteristics if your section is readable executable or writable for example for example when a malware make running fail mechanism you create a memory page with the three permission read executable unreachable why because II when the mechanism starts e.push shellcode or directly on XZ on the memory page so you you must to have write writable rights to write correctly in memories the shellcode or sexy and
read to jump on directly and executable because it's a piece of code she'll call or xexif ID another another thing interesting in the name of the section because many many many Packers commercial Spikers as a name on section like leaks like a spec like van protect so it's it's interesting to have to have the the name of the different section and the addresses to map on on memory another information correlate with restriction is the entropy of section the entropy of section it's the information quantity in intersection in the different software calculates on through p/e the entropy is between zero and eight zero you have many characters repeated like knob knob knob or zero zero zero
it's eight it's your your section is encrypt because it information is total totally random in if the section as I'd so so is with an entropy we can make an Instagram on our file to detect the different variation of entropy or directly on this section in a good section or data section or overlay if you have an overlay with entropy with near eight you are you over is encrypt by symmetric algorithm encryption like OS or like our c4 so when you make reverse it's very interesting information to start if you have an entropy at six or seven its office occasionally methods like absorb because door is not really an encryption method it's just bit flip between a key and the
data but if you have an entre P is more at seven it's probably encrypt with with FC for or RS or Albertini name Kaka me make a good a good poster to detail all information on on P X P format so you retrieve those later section doors later very very very interesting too if you want to develop a password or to understand the format of off-peak it's on you can retrieve on khakha me.com different posture or it's it's very useful for the river floor and super interesting by file format so few word about machine learning algorithm machine learning algorithm on machine learning in general is a lecture many many people speak about machine learning to detect automatically automatically
zero day on a new new malware it's used by many many company like cross strike and game but now for many wonders machine learning used for the detection is not used for clustering or classify malware because it's it's very very difficult to have a good algorithm to to classify a data set without known the data in the data set and so if you if you take and the data set number for example of an game is issued for the detection not classification like that the result files of on VT for example it's a confident result it's not a name of malware or so but it's probably a malware I ninety percent so it's to score if
the software stressed or not in machine learning we make the difference between restoring and classification clustering you have a data set without known the data and you try to make to make different clusters and each cluster is the same family it's very useful for true in biology to classify flowers may as in the beginning this algorithm is used to classify flower not malware so the future using is not the same and we have not the same problem with flowers and malware the classification you have a data set and or data is lab ELISA you have a label on each data so if you take the number that I said I speak about on data set because it's open-source data
set so you can clone it you can use use it and play it so I prefer to speak about open so data set because it's it's a it's more simpler to to make sure your own your own algorithm and you have three level the labelled zero for the good words labelled one for the Marvel answerable minus one because it's no tomorrow is not a good way and the algorithm learn on this labelled and this characteristic so if you if you put a new element the result on the zoo classification it's the new element as 0 labelled one overall minus one if the algorithm can't classify correctly so the file so it's the classification the problem with the
classification if you if you play with Mario the first one is the placard because statically the backers change the properties of the binary of the eggs so the first one when you make a classification or cluster reason crystallization you cluster eyes Packers you you have to leave these IDs you classify Packers so for for the the IPT the different Appetit didn't use Packer so you classify correctly the real binnorie file you don't classify the Packer if you classify if you want classify Troyan bonkers or animals are like that it's it's a bit complex because you this malware use commercial Packers like you but we patch it commercial Packers on the XZ you have you have different
methods to download but it's not the real payload the payload is on the XZ or another another server the Packers not the real payload and execute the real Perot and another problem is different people developed XE with the same particularity of pikers with the same particularities of Marwa but it's not a malware is not a banker it just for joke and we have we have an exhibit famous on VT named VT fooder the the only function of VT flutters in when it executed by the operating system its to abroad at v NV t with a new new new hoshi so if you're monitoring with with owls this this this VT tutor you have 5000 you unix file in a day if you
take the data set of amber you have 30 you have 10 percent of the data set is VT food or but is not mara just a piece of code to to annoy the analyst so when you start when you start the classification and start series the crystallization you must know what you have in the data set another another two interesting it's a V class of a class make clustering it's a classification is not written use classification about VT report and he create families with VD report the only program if it's family is unknown it classifies the file on some singleton so you must create you own family on the on the new on the new
marwah and classic classification is very good to detection if you if you execute amber amber thatis on the Ombuds asset you have 998 percent of detection for it it's a good it's a good stuff but if you try to if you try to make crystallization with with the data set it's very painful very very painful because the data set is not made for clustering but it's made for classification and the class the clustering it's better for anything because when you make anything you don't you don't know some are where do you want you create different rules you can you can find a new families with an old rule Yara for example like Christine explained before me so for clustering is
better for foreign team the algorithm used for classification he supervised algorithm because you you specifies different labelled and different family of after data sets include for clustering we speak about and prophesied algorithm because we don't know the different labelled on the data set the input on machine learning it's vector features vector features it's an array to describe properties on the file for an exit you have the size you have import tables you have the number of section so you create an array with this different information and you you create a list with all arrays for all files so you have a matrix with reshape the numbers of your file all of the files and the dimensions of features
the name the number of fixtures so if you have 100 file and 5 features you have a mattress which shape 5 by 100 and we we we use this this is this concept if you have you have a mirror to transform your Mehran vector features you compute a distance between the vectors and if the vectors is very near you have a similarity so if you have a similarity the morale or near it is very it's very simple and we we use the oxygen distance so if you name distance in the distance we use on the life when you go to Belfast of jibril for example or is in France kilometers here he is miles and 2 in this presentation and to
understand the concept we have mixed the concept of lava lizard data set with algorithm unprofitable the the goal of that it's - it's to verify if the filtering is good or not because the key of machine learning is to make a good featuring to describe the property of your files moran clustering the goal of clustering is to create natural form accounting and minimize false positive I used I'm using that for hunting different campaigns Marvel the first the first case of a string is SSD everybody's know as his deep ashes okay the the big program with assess deep is the chances if your gem size is equal at zero there are gruesome acts as a result
yo-yo marled don't don't match the program if you have a section with many randomly one romney data and this section didn't he don't it doesn't use the crucian you have two different and wizard different chances but it's the same piece of code on so many many researchers compute SS deep ashes on the section of code because the section of course the normally didn't change another another concept it's a PHP is use a complexity of Kolmogorov the program whisper - is not a real you don't compute similarity or a distance between file he you have like you have us and you have one asked for a family but you can't compute distance between family another concept is impasse in
fuzzy is very very useful to - may contain but it's not it's not enough if you if for example you have an another function on the import table you have another another case named Polly shown my shark or ass through gravity so program on on G on this this algorithm is the complexity of computation because you compare two by two marvel so if you have one Android Mar where you have [Music] 10,000 comparison you have a quadratic curve complexity so it's for scaling its if you have a big data set it's a real problem so here we use the best can uncommon it's two algorithm you can you can use with secret learn it's a Python so it's
very very simple to use it and for the data set we use the zoo is a data set open source with different binaries Mars with different family the goal is to construct the best vector features and after the crystallization on each on each new cluster create euros so it's a it's mathematics so it's not it's not very interesting the start of the algorithm you you choose the number of clusters at the Aegean so you test you try you try a found a first group and if you have don't have a good results we will try you'll try so algorithm choose the Sun trees the centuries it's the center of the cluster chemins create groups and in groups you have a center
named country when the algorithm choose some trees by the creature inertia it computes or distance with this country and if if a vector is more than another Sun trade you say ok you I put this file in this family yeah exit painted better as if so you can you can create different different family the the probe the problem is the border because on the border it's it's not it's not simple to have a good good classification the base can in another a green use density to create balls in to make to make clusters so you have an in black it's the noise the noise it's a file the algorithm count classified correctly this piece so
it is very very interesting to know if you have noise or not so in pidgin and Python is very simple to use it's just that you choose a metric the number of jobs and the parameters so you try parameters alike a chef you on cooking or cooking you you choose your new parameters to have the best the best classification has wizard featuring here we have a very simple featuring the section import-export and size of file so we create an array with different reference featuring median of entropy number of sections as a file number of important number of export that's all oh sorry for Cummings we have many many families the first families is a family
zero and you can you have a big big clusters with a question group droppers just like that but you have different Rossum loci you have if you have mixed but in the first approach you have a good crystal a good crystallization in order in the families with less ten samples in so you are you have many many families with will you have one two or three Mars because the data set is very eight children a taraji news if you refer further French here it's family and here you have two big family with a question group so the first Parliament the first featuring as a good a good classification but it's not real group a good representation of the of
the of the data set the problem it's we don't normalize the Vectra the normalized is the lens of your view vector so you can divide by the max of each values to normalize the vector so size file / x max size of a file and you have or values between 0 & 1 if you divided by the norm not afternoon but the norm you you have the value between minus 1 and 1 it's a ball and all the value or in the ball and now after we have a better classification we retrieve a new family this family has just one family a family of Marwan here you have a big family mm a ghost but you have different look
it's because it's use the same Packer between this family but in the base can the biscayne is better so you have different family with only family on Marva you have a question group variety of sellers frozen a question group a question group and the different family between equation group is a different version of droppers so it's so now we have a good classification on the different families just normalize with simple vectors just normalize the values between 0 & 1 on this on this clustering we use Jarrah yard generator on the crust on the family 0 and you create you create a rule with this yo generators and on V chanting we found a new equation group
my word and this week we have found five new file on GT on across from group Marvel with the Yahoo generate on my clustering and the the best way we don't have false positive when I make a reference on VT I have no false positive so my my my hi years it's very very good to detect this family of troopers conclusion machinist is not magic the best way is to make a good good featuring and you must know mirai's your data before makes a story and you make many many many tests before our good results when with machine learning so thanks for your attention if you have a question [Applause] are there any questions now in the room
[Applause] no questions for now maybe you can take them outside if somebody has questioned just finding me as an awesome hat on it's easy to spot so thank you again thanks a lot [Applause]