
so hello everyone thank you for coming it's actually really interesting to see so many people in the room who have got a passion for TLS normally we don't see this many people interested um yeah so today I'm going to talk about TLS fingerprints and how we can compare them to be a little bit more effective in the front hunting [Music] domain so a little bit about who I am first I'm actually a recent SEC cyber security graduate I came to kind of cyber a little bit late in my career I'm currently employed full-time as a generic application developer full Stack mobile that kind of stuff uh got a broad interest in threat hunting as you
can probably guess mware analysis basically any kind of future web Trends and technology so I'm going to kind of launch straight into this here uh and I'm going to I'm going to introduce the topic of similarity to you but I'm going to introduce it with a little bit of a Twist so we know already that in the D domain of cyber security similari is used pretty extensively already but we mainly think about it when we're thinking about malware and malware analysis but I don't want to talk about its application to malware specifically I actually want to talk about its application in another domain entirely and that's chemistry I know it's a bit of a weird
thing to pull together but I'm going to give you an example of how we can use that influence to kind of push back our perimeter a little bit and help how we can use it to help us identify tools and domains that serve malware before we even have to worry about the malware itself so firstly I just want to give a little little brief overview of what I mean by similarity in this context uh so we've got two different kinds of hashing though that we do on files we've got cryptographic hashing at the top and we've got fuzzy hashing at the bottom which is probably what most of us are more familiar with and all I mean by similarity is
when you're trying to compare two files and they're not quite exact they're slightly different but actually how we establish that level of difference is a really hard problem to solve so the most commonly used approach which is used by antivirus uh that kind of thing is when it's called fuzzy fuzzy hashing and when they break down a file into multiple smaller files hash them and compare them to one another and they do that to get around obfuscation small changes in malware files it's actually really effective so we know we can make use of similarity when we're looking at malware and we know what TLS is I guess how do we bring them together and why is
it useful to do that but firstly I'm going to explain why I'm focusing on TLS at all so when we look at C2 servers fishing sites malware domains we can fingerprint what we can see really really easily we can fingerprint the content the headers the Frameworks those kinds of things fabric cons all really easy easy to do but the TLs gives us something a little bit more subtle and it allows us to fingerprint something that we can't see so some of the extensions for example can give us a really good insight into that underlaying operating system and a good example of that is the extended Master secret so if you're looking at the TLs of a server and it
supports extended Master secret you know for definite that it's going to be using op SSL 1.0.1 or more so it's actually a really subtle approach for profiling servers that's genuinely quite effective so here I've got a very very simplified version of the TLs handshake so forgive me it's very very simplified but I'm going to talk to you a little bit about the fingerprinting process and we're going to introduce a system called a jam I don't know if anyone's heard of it but it's used quite commonly in Showdown virus total those kinds of tools and all it is is it's a script provided by Salesforce that you can run against a server and it will send 10 curated
client hellos and from that you will get 10 specific server responses back and it amalgamates those hashes them and gives you that little fingerprint that you can use to find similar configurations out in the [Music] wild so why is similarity useful for TLS fingerprinting that's that's a really good it's a really interesting question because exactly the same problems from malware exist with fingerprinting threat actors can make tiny tiny configuration changes to their servers to get around known fingerprints so having a fixed hash approach is fallible but it's also incredibly useful because more and more commonly threat actors are using automation they using common technology Stacks so if we look at things through the lens of similarity rather than those
exact fixed hashes it allows us to kind of broaden our Horizons a little bit and find similar servers and this is the big question if if both of these things are so useful why aren't we already doing it and actually it's because it's genuinely a really hard problem to solve because unlike files that are quite static we have traffic so there's a whole bunch of things that any TLS similarity mapping would need to do well to be effective mainly those you know the negotiation parameters are complex so it would have to handle all of those also it would have to handle huge volumes of nodes domain servers but the most important bit for me was that it has to provide meaningful
graduations of similarity you know we all know about clustering there's a whole bunch of AI work that you can do around clustering but generally they're too broad there's too much collateral involved in clustering attempts also we can't exclude any element of TLS connection because we've only got a finite things we can look at so we spend a lot of time looking at this problem but what if a another domain had already solved it so I'm going to introduce to you something from the chemistry domain uh yeah it doesn't look great on there there's some detail that's been lost but uh tree map is all it is is a python library for mapping chemical modules so it really hunns in on fine grain
similarity of these modules but more interestingly it can it can handle data sets in the billions and it retains those uh nearest neighbor relationships really really well so this image is the natural product Atlas and all that is is a database of natural products derived from microbes colored by their origin so the little inset you have here is uh similar compounds grouped by the family genus funga fungi or plants so you can see these similarity patterns there this one is the same approach but 6,000 servers scanned and plotted by just their TLS configurations so in this one we've got red nodes are obviously the known malicious domains we've got Amber nodes which are newly registered domains
within the last two weeks and the green ones are the known good domains so you can already see just from that that there's even though these cryptographic fingerprints are very different that there is some some Trends in the way these malware and fishing domains are actually using their their server Stacks so I'm going to give a really high level of how this works so all I've done is take the jam tool run it against a whole bunch of servers on the internet but instead of hashing those fingerprints we create binary vectors from that output and that means there's a post-processing step where we take every TLS extension that we've seen across the entire set of domains that
we've scanned we create these column headers and then we plot against them what we have seen and what we haven't seen so the bottom here is the basis of how the visualization you've just seen works so this is a local sensitivity hashing forest and if we walk through the Google domain here you'll see we go right at the red node right again at the green Noe and then right at the final blue node so it puts our domain right in that far bucket we do the same with the malware one and we'll go right at the first nude left at the Green nude and then left at the blue nud so we can see it's in the
fifth bucket so instinctively just by where it is in that Forest you can tell that they have something in common they're not entirely different and basically this just demonstrates how you how you can see the more they have in common the closer they are in those buckets and if they're identical they're in the same bucket so what's the outcome of this so it gives us actually a really fast and efficient method to compare to LS fingerprints that we don't we don't normally attempt to do so the image here is actually a mixture of Technologies and this scan was entirely incidental um we've got a mixture of Technology Stacks we've got gay fish and blue we've got uh metas sploit in pink
we've got burp collaborator in blue and we've got some tactical Ro in there as well so we can see from this demonstration that the Tex STS do align even though these are specific like the cryptographic hashes are different but the technology themselves have limited variation in how they how the how their configurations are rolled out so it's actually really really useful for classifying servers as well so we're next with this so TLS itself is becoming a little bit less useful as techstack hosting is changing um so one way we can move this forward is we can enrich the vectors and that's all I've done here I've added in just a couple on the end to demonstrate how we can
incorporate some HTTP header data so this is really really effective when tless similarity is super high and we see that in cdns so this yeah cdns are interesting because you naturally get more overlap with good TLS fingerprints to bad ones because there's limited TLS configurations you can do on a lot of cdns they do restrict the kind of Cipher Suites and the version of TLS you can use so this is a data set everything's hosted on cloud flare and you can see genuinely it's pretty pointless this is just TLS on its own you've got red domains scattered in with good domains so there's not a lot of trend analysis we can do on this one this is the same data set with the
incorporation of that HTTP header data so you can already see that those malicious nodes are group tighter together they clearly do have a lot in common when we introduce the HTTP head HTTP header data and the technique becomes a little bit more effective but also it's not limited to just being HTTP header data like you can take this a really long way and in in some tests uh and some analysis like a introduced certificate data whether it's self- signed CS those kinds of things but also the HTML tags do analysis on like the first 50 tags uh the format of the website itself you can really introduce a lot to this because team can handle up to billions siiz vectors
really really well and really fast so there's a lot of possibilities here so I guess wrapping it up uh my main point for for bringing this here today is to try and encourage people to look a little bit more out of their domain so you know a lot of the problems we have have already been solved we could just look a little bit more outside our own boundaries but also this is genuinely a really useful technique that you can employ for threat hunting or monitoring your own domains thank you very much um I actually just want to say a big thanks to my mentor who's been excellent Matt and yeah thanks for listening I hope you
learned something and if you've got any questions I'm here for the rest of the
day thanks Amanda we do have some time for some questions if anyone would like to ask some damn it there hi congratulations for the presentation I have a question is this only working for printing or can it be used for example in Iris or any other biological anything if you can put it in a vector you can use this technique okay thank you I saw another hand up I think
[Music] yes uh yeah my question was so building that that kind of vector store or the the graph that you got there that's a kind of a point in time build have you looked at any ways to kind of keep that going like just as like a streaming engine of new domains as they come in and new enrichments going not yet okay but I think it would be doable and then the other question was on that the last uh one of the the last um shots you had where the malicious domains were clustered in the top that required initial identification of at least one or two of those domains as malicious before you would know that cluster was
absolutely so yeah if you've got a seed domain you know you you instantly know if you've got like five or six cing domains around it you should be looking at those okay uh but you do need you need something class yeah absolutely because otherwise they're just unknown yeah we got time for one more there's a chat there yeah yep genuinely yeah these this is just packet data that's all it is so it's pulled out of the packets fabulous thank you so much Amanda that was awesome and yeah