BSidesSF 2026 - Demystifying File Similarity for Malware Detection (Udbhav Prasad)

Name: BSidesSF 2026 - Demystifying File Similarity for Malware Detection (Udbhav Prasad)
Uploaded: 2026-05-12
Duration: 40 min 20 s
Description: Demystifying File Similarity for Malware Detection Udbhav Prasad File similarity techniques help identify polymorphic threats that traditional exact-matching approaches struggle with. This talk demonstrates the performance of different similarity algorithms (TLSH, ssdeep, XGBoost) against practica

BSidesSF40:2022 viewsPublished 2026-05Watch on YouTube ↗

Mentioned in this talk

Tools used

ssdeep TLSH

Service

Stairwell VirusTotal

Frameworks

Cobalt Strike XGBoost

About this talk

Demystifying File Similarity for Malware Detection Udbhav Prasad File similarity techniques help identify polymorphic threats that traditional exact-matching approaches struggle with. This talk demonstrates the performance of different similarity algorithms (TLSH, ssdeep, XGBoost) against practical examples with real modification techniques and malware families. https://bsidessf2026.sched.com/event/90f66b83d96745424671f91d5c1cc672

Show transcript [en]

We got Udba with demystifying files similarity for malware detection. Uh drop your question to Slido. Uh if you don't know the Slido, go to bidsf.org/q&a. There we have a code. Uh over to you. >> Thank you. Uh am I audible? Awesome. Thank you. Uh thanks for being here. This is one of the last sessions of the day. Uh this gets into into the weeds of file similarity algorithm. So I I hope I won't put you to sleep. Um feel free to raise your hand at any point even though there's a slide. We'll answer questions then. But uh if you have any questions, raise your hand. I'll try to answer them as soon as I can. All right. So we're

going to talk about file similarity today. Just sort of the history of it, how it got started. I mean the more naive algorithms to do it and how it's done today uh in modern enterprises. So who am I? I'm Udb. My day job is actually building distributed systems. U I love building software systems especially those that scale past a single node. U I got interested in security purely because of the companies I worked for. U in in my past I've worked for Rubric. Rubric is a data resilience and security company. I've also worked for a security startup called stairwell. You can think of stairwell as like a sort of similar to virus total but it

also stores all the files. So essentially how I my introduction to security is through large scale systems and I found that that's what I enjoy now. Uh and the intersection of security and large scale distributed data systems and I'm currently at data bricks. I'm building distributed systems there but data bricks is also building a lot of security products now. So yeah, hopefully I can get into that too. All right. Uh this is going to be the broad agenda. We'll talk about why file file similarity matters and then we'll break it down into two broad categories. Feature based similarity uh which which will cover like edit distance and fuzzy hashing and also learning based similarity. uh we're not going to get

into modern language based model AI sort of things more classic machine learning with XG boost deep learning something that you can apply uh yeah I mean I I think these days you can just run a notebook on collab or whatever and create any of these models so very practical hopefully you can take something away from this talk then we talk about experimental results um again this I've a lot of these experiments uh I've run as a hobby I've uh and I'll walk through the experiments that we've done. We have some uh very interesting results. We're trying to publish some work based on that too and then we'll go over conclusions. So why does file similarity matter? So

everybody's familiar familiar with cryptographic hashes. So there's MD5, SH 256, etc. They're great as identity for a file. So if you run a cryptographic hash over a file and you compare it with another file and the hashes match, you know they're exactly equal. But similarity is a much harder problem. What we're trying to say ask is if I make a small modification to a file and I compare two files uh those two files, can I say whether they're similar or not? And more importantly, can I get a measure of the degree of similarity? Even I think four or five years back when I started getting into security even then it was very easy to generate

small variations of software. I think tools like cobalt strike let you do that very easily. Uh even reverse engineering software let you do that very easily. But with language models this is becoming even easier especially if you have the source code. It's so easy to make tiny changes to uh software that it becomes really hard to detect variations. And I'm not even and that's not even me getting started with uh supply chain and uh build systems where different versions of software are generated multiple times an hour and hundreds of times a day. So hopefully that motivates the uh need for file similarity. If you're trying to detect malware in a in a huge enterprise network, you're going to

you're going to be have to you're going to have to do it at scale. And uh malware analysis these days is is isn't just binary classification, right? We're looking at family attribution. What family does it belong to? So that you can look at behaviors and also do threat hunting and campaign tracking. We're not going to talk about threat hunting and campaign tracking in this talk, but I will cover family attribution and how this helps there.

So again to motivate the scale of the challenge um at least at stairwell we were looking at tens of thousands of endpoints with over a billion unique files and when you're trying to search when you get a new file and you're trying to see if there's a similar file in a corpus of over a billion files it becomes a needle and histack problem. So we'll start with signature based detection before I even go into feature based and other things. This is like the naive approach where you have a list of hashes. you have a and you create a database use using the list of hashes or let's say you do static analysis to get the list of IPs that the binary is

talking to or try to run it through cape or some dynam dynamic analysis tool to see what IP addresses uh it's talking to what registry keys it's changing what actions it's doing on the file system and so on you put all of that into a database and when a new file comes in you run the same analysis over the new file and then you do a bunch of database lookups maybe put it in a search index and so on. So this is a sort of classic signature based detection and if you if you take similarity hashes away this is kind of what virus total does is in what inte does and so on. There are more sophara rules give you a

more sophisticated way of doing this. You can do bite pattern matching with Yara rules but it comes at the cost of high computational cost at scale and it is hard to author Yara rules u and as threats evolve it gets very hard to keep them maintained and if you just take there are some very popular open rules like Florian's rules and so on if you take them and just apply to your enterprise it tends to be noisy so this again motivates the need for extracting features out of files and building sort of similarity algorithms based on those features. We'll start with the absolute simplest thing which is you take two bitst streams. You take the Hamming distance

between them which is just the number of positions where two bite or bit streams differ. In this example it's there are three positions where the bit differs. So it's three. And you can also do it at let's say the bite boundary or something else if that makes sense. Now, if this sounds too simple to be applicable in real life, you're probably right. I mean, you can't just do hamming distance on two different files and expect it to give you great results. But we'll see that Hamming distance is useful in more sophisticated systems. Once you do a bunch of feature processing, you can do Hamming, you can use the Hamming distance to get a pretty good similarity

metric, and we'll talk about that later. Uh the next more sophisticated approach is the edit distance or leavenstein distance. Here we're trying to see how many edits can we do to file one to make it file two. So that's the minimum number of edits. Here I think that diagram that picture is very illustrative of where Levvenstein distance is better than Hamming distance. You have two DNA sequences which are just off by one offset by one. the hamming distance because it's doing a position by position comparison. It says the distance is 10. But the edit distance is a bit smarter about it and it says you just delete C and add A and the distance becomes two.

So how does this scale? Unfortunately, for edit distance, I go back to this Lemonstein distance, you can I think there's a dynamic programming algorithm which does it in linear time, but it's still linear time in the input. So, if you think about scale, if you have a billion files and you get a new file, you're going to have to read through every bite of all the billion files. So, it's going to scale pretty poorly. So, that's where the idea of taking features out of files comes in. Instead of looking at every bite, you pre-process all the files to into bags of tokens. So tokens are usually created by some sort of feature engineering, right? You can take engrams, you can

just take imports and then jackard similarity is probably the simplest way to compare two bags of tokens. You take the intersection of the two sets of tokens and divide it by the union. The big advantage here is is that the search space is smaller. You're comparing if you have a billion files, you're not comparing all the bytes. You're just comparing a billion times the number of features per file. This is what is used in MinHash. MinHash goes a step further. It actually optimizes the intersection and union by doing like an approximate uh intersection and union approximate jackard similarity. a visual example which helps me at least uh if you want to take the similarity of

file C1 and C2 and you and file C1 is broken into these three tokens and file C2 is broken into the other three tokens u you can see that the similarity jackard similarity is 0.5 because there's two common uh elements and four total unique elements. Okay, so what are the limitations of added distance? It's already kind of clear. It scales pretty poorly. Even with jackard similarity. Uh if the number of features is proportional to the size of your corpus, size of the file, it's still going to scale pretty poorly. And there isn't an obvious sort of clustering algorithm you can apply to any of these. Uh and it's sensitive to alignment. If you move files around,

like I said, it's very easy to change files now. And if you have a big code base and you just move a few functions around because all of these except I think Jakar similarity is set based but the other edit distance similarities are they depend on the order. It can break your similarity and also it lacks locality awareness which means uh you can't pinpoint to which bits of the file are similar. It just tells you it gives you a number that says it's this similar or it's not similar. That brings us to fuzzy similarity. And here the idea, the motivation here is that we want small changes in the file to result in small changes in the

hash of the file. And we want block of feature based similarity. Here we're extending the idea of jackard similarity to be more smart about how we extract features and we want similarity comparisons to be faster and we want we want to be able to cluster these u files based on the fuzzy hashes as well. We'll talk about the the broad approach that we'll discuss here is just chunk based hashing. So you take a file, you break it up into chunks of segments according to the diagram. You take each of the segments, you hash it and then you concatenate the hashes in some in a in a way such that when you compare the hashes of two files, if the files are

similar, the hashes give you a similar distance. I mean the you compare the hashes, it says the distance says that it's similar. We're going to discuss three different fuzzy hashing approaches here. SSD, SD hash, and TLSH. All of them follow the same workflow, but they differ in each step. So, how the segmentation is done, how the hashing is done, and how the hashes are concatenated together. And again, I don't want to I don't want to make this too boring. So, I'm not going to go into the gory details. But hopefully if if if you take nothing away from this except that these algorithms exist and you can Google them or ask chat GPD that's that's good enough.

Okay. So let's start with SSD and I would say keep this in mind. Um this workflow is going to be repeated over and over. So SSD takes a sliding window over the bite sequence, applies a rolling hash on the window, and there's a sophisticated approach where it decides where to break the rolling hash. At the end of step one and two, you get a sequence of hashes and then you it takes like the last I think the six lowest bits of the hash and then concatenates them and it gets uh that's how it gets the fuzzy hash and it similarity is done by taking the lemonstein distance between the two fuzzy hashes and gives you a score of

zero which is no match or 100 which is very similar and here again we can see that As I mentioned earlier, Hamming distance and Lemonstein distance seem sort of too simple to be useful. But once you take features out of a file, you can actually use them to do faster comparisons. So what are the limitations? We're still SSD is still maintaining I mean it's still doing like an ordered concatenation of the hashes of the chunks. So if there is reordering it's going to cause problems. If there's compression or an attacker adds padding or changes the encoding, it's going to cause a problem. That brings us to SD hash. Again, the workflow is chunk the file, do do some

hashing, concatenate the hashes or concatenate in air quotes. SD hash applies a sliding fixed size window of 64 bytes over the file. It doesn't take all the hash unlike SSD. It doesn't take all the hashes. Instead, it computes the Shannon entropy of the hashes and puts them into a bloom filter. Uh it computes the Shannon entropy of the hashes, picks the most interesting one that is the ones with the lowest entropy and puts them into a bloom filter. And similarity is done by taking the ham hamming distance between bloom filters. Again, Hamming distance proving to be useful here. Now, what's the advantage? It's better than SSD because it's a set based um h setbased hash because we're putting

it into a bloom filter. We're not actually just concatenating them, but it still has some issues. The final digest, if a bloom filter gets full, you add another bloom filter and so on. Uh so the final digest becomes proportional to the input size. So there's higher storage cost, slower clustering, and it's still sensitive to sort of large scale structure structural changes. But we've done better. We've removed we've lowered the impact of reordering. We are only sensitive to large large scale uh structural changes. So we are we have done better. That brings us to what's probably the most popular fuzzy hashing technique today which is TLSH. The motivation here is we want to try and avoid all of these

problems. We don't want reordering to cause issues. We want a predictable digest size. We want to combat evasiveness. And most importantly you can see here that the previous algorithms are uh sorry SSD and SD hash are get they calculate distance based on Hamming distance or Levvenstein distance which makes it hard to cluster them. So TLSH tries to emit a metric based distance and I'll talk about why that's important and why it makes clustering easier. Again the same workflow, sliding window over the file. This time TLSH extracts engrams from each window and it generates a histogram of engrams. So it takes I think it creates a histogram of 256 buckets. So it's a fixed size. It

puts each engram it hashes each engram puts it into the histogram. So at the end of it you have like a histogram of 256 buckets and a count in each histogram of the number of engrams in that bucket. and it uses that to generate a compact hash. So it's a fixed size hash. In practice, this turns out it turns out it's more robust to changes than SSD or SD hash. But honestly, at this point, the comparisons get fuzzy, pun intended. It turns out SD hash is better at containment detection than DLSH, but DSH is better at is generally a more robust hashing algorithm. So it's it's very important we talk about TLSH being a true metric.

What is a true metric? It means that TLSSH distance between two files is non- negative. It's zero if and only if the objects are equal and if two objects are equal, it means the distances are equal and vice versa. And it follows a triangle inequality. Smart people have done a lot of work on this and all of the uh vector search comparison algorithms the data structures that are constructed all the modern infrastructure that's built around vectors uses this exact property which means we can put TLS the hash generated by TLSH into a vector database and we can use that to do comparisons clustering and so on. That's a small hint to where we're going with all of this. We are obviously

going to talk about machine learning and AI and how that also follows all of these uh properties and does better than algorithms like SSD and SD hash. Right? So that brings us to the end of fuzzy hashing as a section. Uh but there are still limitations to all this. DLSH is really good but it's still sensitive to structural changes because we're still doing chunking. What if there's huge changes outside the chunking? I mean across the code base or something. It still requires manual feature selection in the sense these are algorithms handwritten by humans and when an attacker knows how an algorithm works, it makes it a little bit easier to evade it and it doesn't really

capture semantic similarity. it DLSSH and SSD and SD hash sort of arbitrary not arbitrarily but they chunk files uh they break files into chunks without thinking about what's actually the meaning of these chunks and and so on. So that brings us to learning based similarity. So what we want to do now is learn with learning based similarity. We're hoping we can learn complex patterns automatically from raw features and we're hoping that makes it adaptable to malware detection, clustering and classification. And especially temp we're hoping it makes it adaptable to temporal changes as a threat actor changes their behaviors over time. We're hoping that the if we as long as we keep fine-tuning the models, we're still able

to make uh detections on new files. We're going to talk about just two models. Uh but I think these are the most sophisticated classic ML models. XG XG boost and just deep neural networks. How does similarity work with? But before we go into that, we need to talk about how similarity works with learning based approaches with uh machine learning models. So the workflow is going to be you train a classic class classification model. So you give it an input and you either label it as malware or benign or you give it an input and you assign a malware family to it. And you can then use that model to derive an embedding model which gives

you like a vector representation of the input which is a latent representation or a semantic representation of features and then you can use these embeddings for similarity. Again, this is you you give it a document. It can be an image which might contain malware, documents, files, whatever. You put it into a model, gives you a vector, you map it into an n dimensional space, and if they're close together in the space, that means they're similar. U in the context of malware, a classification problem is, is the file malware or not? for in the for multiclass problems, it's which family does the malware file uh which malware family does the file belong to and we're going to look at how XG boost and deep

learning can be used to do similarity for both. So XG boost is an ensemble learning method which is just which just means you take a bunch of different models and you uh put a file through those models and you sort of aggregate the decision based on that. It builds a bunch of trees sequentially and it's very fast to train. It's very it's it does very quick inference and it's excellent for table like features. So if you if you've extracted a lot of features from malware with either static or dynamic analysis and you have it in a table even if it's a billion rows, XD boost is great at uh training on that data. So I'm hoping this example is useful. uh

you can think of this as an analogy to the binary malware classification problem. If the question is is a child a good reader XG boost trains n trees where each tree is a different structuring of a decision tree. So the first one says the root is if does your child think reading is easy? Yes or no? And then is your child a girl or etc etc. And you have entries each of them just emits good or poor but they have multiple leaves. So how do you generate an embedding from this? So this is just classification right? How do you generate an embedding from this? Again, if you take this example, the first tree takes in the file and routes

it to node leaf node two. The second tree routes it to leaf node 3. The third tree routes it to leaf node 2. If you take each of these as a bit, you can say it's 0 1 0 0 0 1 0 and so on. And then you can do a Hamming distance comparison between two input vectors. Why does this work? It works because if two if two files are classified very similarly by the trees, it's more likely that they'll end up in the same leaf nodes. So that's XG Boost. I I don't think anybody needs a deep sort of introduction to neural networks these days, but they're effectively just multi-layer neurons. I mean multi-layer sets of

neurons. And the goal is I mean how people think it works is it captures nonlinear complex hierarch hierarchical representation of features given the input and for malware detection at least it's pretty effective for raw binary input or just extract features transfer learning is just a completely separate thing where you can take an existing model like ResNet and train it for something else. So how does embedding work with neural networks? Again, hopefully an instructive diagram, but if you have like a deep learning network with four layers, four uh an input layer, three hidden layers, and an output classification layer, you chop off the head, take the last layer, you get a vector representation of the input

essentially, and then you can do uklidian distance between two files, uh two vector representations to see if they're similar or not. All right. So that brings us to the experiment section of the talk. So it's worth noting that I've done a friend and I have done work on this. This is not part of our day jobs or anything. This is just like a passion project. We just love working on uh security related the intersection of security and data. As I mentioned earlier, it's very hard to get hand get your hands on a a million files to play around with, right? So, we're very thankful that u the Ember data set exists. It's an open collection of

features from a million portable executable files. It contains static analysis uh of metadata, a static analysis metadata for 400,000 malware, 400,000 benign and unlabelled files. We just took the 800,000 label files here. And we're also thankful to Crowdstrike for releasing for augmenting this data with similarity information. So what Crowdstrike did is they took this data, put it through virus total because they have all the money to do that. Uh they tagged malware with what's called AVclass, but it's essentially just what malware family does it belong to. So that was the data set we played with. uh to give you an idea of what features are in Ember um the right side is it it gives you a better idea but it's usually

like byte statistics some header information strings imports and more most importantly it contains SSD which is useful for our similarity comparisons unfortunately it doesn't have TLS and TLSH is you can't compute TSH unless you have the raw I mean the actual file so we couldn't do the experiment and compare the machine learning approaches with TLSH I wish we have done that because search is the best fuzzy algorithm out there but some scope for future work. So this is a summary of what I've spoken about in terms of classic ML so far. Classification XD boost and deep neural networks do that. SSD of course doesn't do that. But for similarity XD boost uses Hamming distance. Deep neural

networks use ukitian distance and SSD uses lemonstream distance. you should already be asking the question if all the all three models use different distance metrics how are we going to do a similarity comparison between across the three models right so I'll talk about that soon okay so most interesting part I guess so quick introduction to how classification models are evaluated there's true positive false positives true negatives false negatives so if If you if it's if a piece of malware is actually labeled malware, it's a true positive. If it's labeled benign, it's a false uh yeah, it's a false negative and so on. So hopefully this gives you a good idea. You don't even have to look at this

slide. Um all you need to know is that there's accuracy, precision, and recall. Forget about F1 and AOC as well. Accuracy is just if you have a bunch of samples, h how many of them have you actually given the right label to? And for precision and recall, I like the metal detector analogy that somebody gave me. If you have a metal detector and you're going to the beach and you want to find gold, precision is how often did the metal detector beep and how of the times that the metal detector beeped, how often was it gold? That's precision. And recall is of all the different pieces of gold on the beach, how many of them did it get? So it turns

out there's a bit of tension between these two metrics, it's very hard to get both of them to be very high. And so F1 score and AU essentially tells you how well you've done at getting high precision with high recall. And sort of in the context of malware or le let's take instead of uh metal detection, let's take mine detection, right? If you're in a minefield, you want high recall because you want to find all the mines, but you can sacrifice precision. It's okay if you're mine detector finds a bit of junk because you just want to be safe. All right, so the binary classification problem that is is it malware or not? It turns out XG boost does better is better

at it than deep learning. And you can ignore all the number. So I mean you can all you need to see is that the accuracy numbers and precision recall is just better across the board. And maybe a 1% difference doesn't seem like a lot but when you're working with a billion files it starts making a difference. And for the multiclass classification problem it does way better. 90% accuracy in classifying uh a p a file into its uh family uh into its malware family might not seem great but we were working with 20 malware family classes and random guess is like 5% accuracy right so 90% is pretty good and you can see that deep learning does

way worse here top five accuracy is basically let's say you label it with a certain family uh was that in the top five of uh fam uh families sorry top five accuracy is if the model says okay these are the top five families that it thinks they are is the actual family in one of them is the top five accuracy I'll skip over precision recall F1 but XG boost does way better than deep learning on these as well so that brings us to so classification XG boost is way What about similarity? So as as I mentioned before, right? Hamming distance gives you a number. Ukidian distance gives you a value between 0 and one. And SSD gives you a

number as well. How can you compare the different models doing different things? How can you compare? How can you tell which model is better at similarity than the other? That's where something called label homogeneity comes in. So what you say is for each file get its k nearest neighbors count the neighbors with the same label. So if you get 100 neighbors and 80 of them have the same label your score is 0.8. So the higher your label homogeneity is the better your similarity model is. That makes sense. And the great thing is this is metric agnostic. So it works for tails, it works for SSD, it works for machine learning models and so on. So let's we can just look at the last

one here because the results are the same throughout SSD. Of the 100 neighbors only about 20 of them have the same label. When you look at 10 neighbors it's 3.5 but the performance is not great. When you compare that with XG Boost when you look at 100 neighbors of a file 75 of them have the same label. So it does much better than SSD already. Machine learning model has shown that it can do better than like a fuzzy hashing approach and deep neural networks really really shine here. Of 100 neighbors 95 94 to 95 of them are the same label. So deep deep neural networks does better than XG boost and both the machine learning models do way better than uh

the fuzzy hashing model. So what are the conclusions here? Model task alignment is super important. You can't just take a deep neural network model and say it's going to work for classification and similarity. You need all the tools in the toolbox. And these days XG boost and neural networks are not that hard to train on your own custom enterprise data. So you need XG boost for classification. You need neural networks for similarity. That brings us to does that mean fuzzy hashes don't have a place here? Um I would argue that's where explanability comes in. And where we spoke about where I mentioned locality sensitivity earlier, algorithms like TLSS tell you which bits of the file are similar. And so if

you're looking for explanability, XG Boost is good at it, but these fuzzy hashes are good at it as well. So the conclusion that I want everyone to take from this is hybrid systems are essential. And when you're evaluating, let's say, virus total or something else, you want to make sure they they cover all these different sort of tools. Future work, I want to do this again with TLSH. It requires a lot of compute and a little bit of money. uh so maybe someday if someone's will uh open to collaboration on this sort of thing I'm more than happy uh to work with people on this and I want to evaluate explanability of different classification and similarity models as

I mentioned earlier. Yeah, that's it. Uh quick shout out to Sam Stewart who helped me with my presentation and uh I'll open it up to any questions. Thank you. Right. All right, thank you very much. We do have a question on the slido. >> Yeah. >> Um on the slideo from anonymous. How do the models handle samples with VM based execution i.e. VM protect 3 Firebeam binaries and encrypted VM instructions with uni unique peru obfiscation. Um the the only experiments that we did were with the data set that we already have. Uh I guess with offiscation I'm not really sure. Uh I don't think I have a great answer to this question. You have to figure out I'm not familiar

with the offiscation methods or how to handle offiscated files and uh yeah I guess depends. I'm only familiar with things like XR offiscation and how to detect that and how to reverse that. So I don't think I have a great answer to this question unfortunately. But if Yeah. >> Yeah. Let me give you a mic.

>> So when packers really came in vogue, part of what happened was that packers the feature sets of packers on files would start to be identified as part of the hashes regardless of which hashing algorithm you're talking about. So that could be used as part of the analysis. Uh then there's questions about what happens once you unpack and other things like this. But when we're talking about VM and any of these other transformations obuscations similar things happen. It's just trying to get to what is the root file that actually gets to to the point of execution at some point. That that's kind of like >> I see >> and what is the family at that point.

Okay. But when packed off use etc there are ways of seeing clusters still despite that. >> Yeah. >> Okay. Awesome. Do we have any other live questions in the room? Yes. >> Yes. Hi. Have you considered uh using astove embeddings? >> Sorry. What's that? >> So as to is a um maybe something similar to parking is a method to um extract embeddings from actually binary files. executables um apparently is prone to um offiscation um and anti-temperant. So for example, you have compiler tool chains that can offiscate the executable code and make it more difficult, right, to um to analyze and apparently prone to that. That's what I heard. >> I see. How do you spell that?

>> As to oh okay uh I think that falls under something like the autoenccoder model where it's like an unsupervised algorithm that just gives you an embedding from a file. Yeah. Yeah. But I I can again I I didn't have the I didn't have the raw files. Yeah. We didn't have the raw files in the experiments. So if we can pull the We're hoping to work with corpuses like malware bazar which have a million files and then rerun these experiments on those maybe we can try asto on that. >> One more qu or one more comment. One other thing to be aware of when using SSD is due to the block sizes that are arbitrarily chosen, you can have

overlaps uh in your data sets with clustering that when using TLSH, you'll see differentiation. So sometimes things look similar that actually aren't with SSD just if somebody tries to do this or even with your data set you're working from. Yeah, I think SSD has other issues like block size alignment also needs to be Yeah, >> one more. >> Which feature uh generally was the most important feature for uh XG boost? >> That's a good question. Uh it was the uh the string based features like exports and imports uh flagged up the most in XG boost. I should have added an explanability slide there, but yeah. Um it was the string based features headers imports exports.

>> I'll ask a question. How do you address false positive risk? >> Um yeah that comes to the balance between uh precision recall. If you if you are training a machine learning model you can explicitly make it have more re uh less more precision and tradeoff recall. So more precision means you'll have fewer false positives, but the risk is that you'll miss malware. So you can train your model to be sensitive to one thing or the other, but it is unfortunately a trade-off. >> All right, thanks everyone for your questions. >> Uh I have one more question. >> Yeah, go for it. >> So uh does the uh number of the data set also uh in any way correlation with the

accuracy? So for let's say we did it for like 800 million and then we increase it. >> Yeah, I think the machine learning al algorithms tend to generalize better with the more data you have up to a certain point. U so in general yes but I I I do think like a million files is probably more than enough. I will say a million files from 2014 to 2018 will probably a model trained on that would probably not do well with malware that we see today. So tempor temp temporality is also pretty important. You need to add new malware that's found in the wild into your data set and train it and train it on that

too. >> So we do have break even points. >> Yeah. >> Cool. Uh any more questions guys? No. If you still we we got time five minutes. No. >> Okay. Okay. Thank you. Uh in case if you guys are concerned, no malware were harmed to make this presentation, I think. >> But yeah, uh let's give him a big round of applause. >> Thank you. >> And we have a special we have a special thank you gift for you. >> Oh, thank you so much. >> All right. Thank you so much. >> Thank you. >> Thank you everyone for joining the session.

BSidesSF 2026 - Demystifying File Similarity for Malware Detection (Udbhav Prasad)

Related talks