← All talks

Hunt At Scale: Fingerprinting Threat Actors Across The Web - Stephen Doyle

BSides London · 202528:0844 viewsPublished 2026-03Watch on YouTube ↗
Speakers
Tags
About this talk
Stephen Doyle demonstrates a large-scale threat-hunting pipeline that fingerprints threat actors by scanning domains across WHOIS, DNS, and HTTP layers, then correlates them to identify campaigns rather than isolated incidents. Using open-source tools and commodity hardware, the approach extracts behavioral fingerprints—server infrastructure, third-party contacts, page links, and technologies—to group malicious domains targeting financial services and detect previously invisible phishing campaigns.
Show transcript [en]

Thank you. Just on the mic. Everyone can hear me. Okay. Yeah. Cheers. Thank you. Haunted scale fingerprint and threat actors across the web. Everyone does a who am I? I feel like it's time for a PowerShell introduction. Incident response insider threat web ammons where I've been working for the last two years. I kind of had this conclusion. I have no experience in Intel. So I was like I'll join this startup and I'll be an intern and hopefully get good grasps. That's my introduction. Nothing special. The talk intro going to build an onrem and an open- source pipeline. And what this pipeline's going to do is we're going to take all the domains in the world. We're

going to scan them on the web layer, the who is the DNS, then we're going to do some magic to correlate them all together. A theme through this talk. If you remember, I'm an incident responder. I don't like responding to incidents. I prefer responding to campaigns. So that's what this talks about. Big part of the fingerprinting is being able to see somebody's targeting over here, somebody's targeting over here. Rather than us coming into our queue with 30 incidents each week, we can go that's one managed campaign. Keep it updated. Why? Again, I'm an instant responder. I'm lazy. I'm great at automation. That comes [clears throat] from my laziness. So if I was to tell you about a SOP that

I worked in a previous company. So we would pay recorded future for a list of domains such as crowd strike with two E. Then we'll pay URL scan to scan the 40 domains we get each day. And then if there's a risk we're going to go to a takedown service. In this case it was CSG or Mark Monitor. That turned into the incident queue. Especially when there was campaigns or threat actors targeting us. They were all single incidents. nothing was tracked at the campaign level. From an automation point of view, it could be automated. That's what this talk is. [clears throat] I like Amazon. I don't believe you need to spend 10 grand on an enterprise

server if you can build one off Amazon for we'll get to that. So, everything here to scan the web is Amazon. If you were to look at these, we got five stacks. We got a Ryzen 5, 64 gig of RAM, the motherboard, and one terabyte of SSD. If you're asking where the fans are, one came with the CPU. That was enough. This is the building process. There's a few people in here. There's a few Easter eggs on the screen along the top, but you're too late, too slow. Um, put them into a 2U server rack. So all of these I tried to fit it into a one year but I realized I needed a fan so I had

to go to you. Throw them all in here. Repeat that process five times. Jumping a little bit here but it's all right. On the left these are just compute nodes which is the Ryzen 5, the RAM, everything. We're going to use them for sandboxes. On the bottom is an 8 bay hard drive storage that we're going to use for open search. Why the left and the right? So the left was running on my home residential internet. BT BG didn't like me doing that on my home residential internet. So Steven decided to uh rent a we small space above a suit shop in Ireland that costs £500 in rent. But for this black cable, full fiber

premise coming into the cable. I'm not going to tell you how much, but it's three times in my rant. So, I'm not going to go through this, but this is the specs of our compute node. The five TU servers you've seen. It's literally just a midspec gaming computer without a GPU. I am going to be nice and when you tell me you've took the photo, I'm going to go next. He didn't take it. Okay. I was being nice. Open search. So, solid state versus hard drive. Everyone assumes solid state's amazing, and it is. For whatever reason, open search works great off hard drive. So, I was like, f it. Save the money. Very simple spec. We have 16 terabytes for our open

search node. Server rack config. We wanted to be professional, so we bought some Chinese UPS's like, "Yep, [ __ ] enterprise grade." Now, switches. We started off with a TP link. A good TP link. Just saying. I had to throw that out like the next day. So, we had to upgrade to an enterprise switch. Secondhand one, by the way. That's why it was so cheap. So, let's look at the right here. We have six compute nodes. The one at the very top we had to build our own router because yeah whatever. So that runs open sense. So this fiber coming in that's a full fiber to the premise 10 gig. That's why we needed to build our

own router to support that. This if we wanted to any management people here um the capex was 5k opex is 1k which was just the internet. If anyone here uses URL scan or any sandbox or anything, this is the influence you need to go on prem and start building stuff. So, we spent 6K, we've got all these amazing servers, and now we're looking at it and it's like, hm, let's do something with that. [snorts] This is URL scan. I'm going to be bold and say it. So, all we use is Pipen, Selenium, and a Chrome driver. That's its function and its bare minimum. We utilize our whole server rack by going let's just run nothing but that and scan

all the domains in the world. So this gets us the current capacity as of today it's just over 5 million. Don't laugh. Okay. My draw IO isn't this everything's on prem. Reason being is compute storage network egress. You're paying that in the cloud. Good luck. So we have all the heavy lift in onrem if anyone's familiar with AWS. Lambda functions are just like disposable code that you can trigger and run once. They don't have to spin up a runtime environment. So we just have an API gateway couple of lambda functions behind us and then we have a VPN onto our on prem. Over to the right we have the icon. Now there is a slide on that so I'll get

into that a bit more. But all the icon CZDS is is the centralized zone data service. That's where we get all of our domains every day. These API functions monitor search scan scan self-explanatory. Search scans our whole search is our open search. And then the monitor, we just use an AWS eventuler. We use this for our conmon. So we'll just set up a schedule to scan this domain every x hours or whatever. So these are the newly registered domains just for a random day in September. We can see at the bottom left there's 235,000 in total source of icon. And then when we scan them all with our sandbox we can see over half were hosting. Now not to

spoil them the majority of them are part. And for the ones that failed time a lot of this is firewall filtering the name not resolve self-explanatory. There's no a record. It's not hosting. This is our sandbox. We get 150 new data points, but these fingerprints is all you need to remember from this slide. So, we scanned all the newly registered domains. We want to threat hunt them. We want to find intelligence. But what's intelligence to us? That matters. Where you working? What are you doing? What are your industry etc. So, these are previous places I worked. When I worked in crowd strike, it was after Solar Winds, but I always seen the same attack vector. I was like, what if

somebody's supply chain the source code for Crowd Strike and that's on hundreds of millions of endpoints. Good luck. So, I'll scope for that. Synopsis is the application layer for the semiconductor industry. They get pawned. Every phone in the world gets pawned. CME Group just a Fortune 500 clearing house for financial markets. Today, we're Binance. So you ought to be somebody.

So depending on the country, we're either heavily regulated or unregulated. The big one in the past couple of years has been this the crypto apps the sim swapping fraud elicited activities and it's also consumerf facing app. So we have to accept the risks with that. So Steven Doyle CEO of Binance says we may protect our customers. Steven Doyle, security intern at Binance says, "Sweet. Three things to complete this pipeline we were talking about." So, we went through the hardware. We're now going to focus more on the software. So, we need something to get the newly domains. That's from icon, as we already said. Something to scan the new domains. That's above our suit shop. And

somewhere to store the data is also above the suit shop. If anyone pays for newly registered domains, it's your choice to keep doing it. This has all the domains for generic and new tople domains. So generic are like.com.biz.shop. They don't have country codes. So that's your gaps. Country codes are.uk, etc. So if you're paying, all I would say is make sure they include country codes. They don't they're just scraping it from here and putting branding on it. This is the limits as I said. So as of today there was 257 million registered domains in icon. It's hard to get an exact number of how many are registered because country code registers. There's about 1300. I know there's not that many

countries. There's no standardization. They don't publish their zone files like they do for the generic and new top level domains to but I'm going to make a guess. It's not a guess. There's a guesstimate educated 750 million at any one time. There was only 257 across the new and the generic country codes in icon. So that's your gaps. This is a GitHub by icon. It's just a command line for Python to download it. Very simple. You go to icon.czds. Create an account. Say you're security researcher. They give you creds. Download this. And there's only two variables you need to put in the top two. Username, password. Icon updates at 7:29 each morning. So if you want to be on time, create a Chrome

job at 7:30. I didn't know that icon GitHub existed before I went and made my own. So just throwing mine in there, too. >> [clears throat] >> So we've got something to get the new domains. URL scaner is selenium, chromedriver and python. Selenium is just a framework to interact with the web. Chrome driver we need for anything that's dynamic. So for example, if you went to a react website and you went to view source code, you're not going to see anything on it because it's not the rendered DOM document object model. So that's why we need Chrome driver and hyphen is the glue that brings us all together. Already showed you this. If we wanted to

make it better, if you're ever on Chrome and you right click or is it F11, one or the two, and you view the dev console, you see the network, you see the third parties, you see the resources, you see everything. That's what this blue box is. So if you want to get that at a program level, everything that's comes from the dev tools, we can use that. Now the scan output I'm not going to go through it's massive but there's over 180 data points everything like URL scan if you're familiar everything there but in our main differentiator that we use is the fingerprints now if we get into the fingerprints got the new domains scanned the new

domains stored in new domains these indexes are just to give you an idea of what we're storing So when we do a scan for bsides.london, it's in the scan. All their images, all their screenshots, everything is in the resources. All their third parties, so they contact Google Firebase, etc. We log their servers. Domains, same screenshots, self-explanatory. Okay, easy. Got the domains, scanned them, stored the data. Now if you remember we're Binance we're consumerf facing sim swapping heavily regulated depending on the country and we know web 3 is just a massive oh what's the childish term I cursed too much yeah thank you 3 is a [ __ ] show um to quote so let's look for Binance threat intelligence not

data um the traditional start I pay my vendor here's Binance, make some premutations, put it across domains, and let me deal with the app and the rest. [clears throat] Not intelligence. It's scope data. Because we've scanned every single domain, we don't care about the domain name. We can go straight to the page title and look for Binance as Binance. We can look in the document object model. Or since we can look at our third parties, we can go, okay, show me all the domains that contact B us because they're most likely just scraping here. Very simply, we just done show me websites with page titled Binance. Now, if you'll notice on the left, none of

these domains are similar to Binance. So, they wouldn't have been caught the traditional way. And thread actors notice. They know, right, domain name, easy way to get caught. Let's not do that. When we searched Binance for this date, these were the only results that came up. Now again, I want campaigns, not incidents. So I'm looking for similarities. The jerus yap is Turkish for login. So unos six, if I can count. Oh, good me. Okay, we want to see how many scripts are on all of them websites along with the requests and the domains. The domains are the third parties that they contact. So we enrich this a bit more. Okay, a bit more consistency. There's one

outlier. That request count is 25. But at this point, I'm somewhat confident this is the same thread actor. The fingerprints. Does anyone not know what a SHA 256 is? [ __ ] brilliant. So the report's just JSON and each parent field in the JSON you've got countries, you've got servers, you've got resources. So for this example, we have the ASX. This would be the servers would be all the third parties, first party servers they contacted, including location, IP address, etc. All we do is we shot 256 that part of the report after we sort it. Boom. So next time we scan it, let's say the IP changed in one of these five servers, the whole hash is

going to change. So we know something's changed in the stack. So for all of these, they share the exact same shot 256 for that part of the report. So they contact the same IPs in the same four countries, same ASN, and they're the same server types. Getting more confident it's the same thread after. These are the fingerprints for the links. Again, exact same across the whole. So now I've got most likely one campaign versus six incidents that would have come in. These are the links. So as I said, this is just a part of the JSON report page links. We sort it sh 256 it and we get them values. So we know for a

fact all of these use the same links on their website. Here we're checking scripts. Not a single one's the same. And when we're checking tech, every single one of them is the same. Now, if we look at the tech, all they're using is this version of jQuery and they're loading their favicon from this URL. So, we can track, for example, they always put their favicon and assets. Let's use a template. Great. We'll do that. They're always using jQuery 3.7. Great. So summary so far, we're Binance, both the CEO and the intern. We built the pipeline to pull all the newly registered domains from icon. We built sandbox to scan them all. We've applied our business risk and

logic being consumerf facing, we're an app, etc. And we found six domains with no similarity in naming to Binance. They all share the same fingerprints for the network stack, page links, and technology. Okay, I'm not going to have these answer this. It's just intelligence. [clears throat] We're yet to look at the website. We're yet to confirm anything. All we know is it says Binance in the page title. We haven't visually looked at it. Okay, we visually looked at it. It's a threat. if you didn't know. Sorry to disappoint, but no, there's no threat. Goodbye. Um, so this is our scope. This is our ticket so far for our campaign. We got this page title. We got these fingerprints

and some behavioral analysis is they always have 12 scripts. And if we remember the fingerprint for the scripts were different. So I assume there's some dynamic in these 12 scripts. And then they contact five domains, one first party and four third party. This was originally fixed to one date for this campaign, which was the sixth. So, we were like, you know what, let's take our indicators, fire them across our whole data set with no date, and see what we have. So, we've moved from six up to 31. And there's one outlier here, the fingerprint ASM. So, immediately that tells me h half of this campaign is hosted in a different area, a different stack, whatever. We'll get into that.

As I said, I can't let this go. This I want to see everyone's faces. I hope you can read it all, especially the small writing. Um, you don't need to read this. You just need to see the patterns. On the right is the fingerprints for the tech for all 31. In the middle is the fingerprints for the links and on the left is the fingerprint for the ASN. So, for all 31, same links, same tech. For 31, we can see switching behavior here. These are all sorted by time. So there's no consistency. Something's happening on their network that they keep switching between. There's only these two different ones. And when we look at that after we add

that new fingerprint to our scope, the difference is extremely subtle. Before I show you that this was hashed, this was hashed. Different hash. That's the difference. The only difference there was an extra 40 bytes sent in them requests for whatever reason. Okay, again I'm in a campaign level incident. We've already got 31 as part of the campaign. We got the IP address. We know the server they're using. I didn't fall off the stage, don't worry. So for the last 3 months going by Icon, there was 30 million odd domains. We've scanned them all. So, let's search that IP across our data set. I personally didn't know what a lot of these were until I godled them, but

they're all financial. So, this thread actor seems to be financially motivated. The top one is a news website. The bottom one's a bank login, and the one on the right is another bank login. There was nothing else that wasn't financial on that IP. So, I'm going to make strong assumption financially motivated thread actor, especially targeting Binance. If you want to do our continuous monitoring, we're just going to take these and build the query. So we could check our data sets every hour. That's a lie because the domains will only come in a half seven. So we can check them every day and see if there's new hits as the days go on. This is a lot of text. I'm going to run

through it quickly with you for you, but we pulled the root zone files from icon centralized zone. We scanned them all of our custom URL scanner. We stored them in URL search and then most importantly we cared about what is our business, what's important to us etc. From there we found an initial six domains fairing sharing excuse me same fingerprints and then we expanded our scope. We seen that one day came from in this case it was from August until the start of September. So that one day turned into a 3-we campaign with 31 domains. When we drilled into the server a bit more, this IP, we found 137 in total. Now, I'm not going to make an assumption that the

thread actor has 100% control over that server because out of the 137, I don't have them here, but there was some stuff completely irrelevant to this campaign. So whether it was a home server, whatever, or a shared stack, it wasn't a cloud server.

I prefer finding patterns in the data and then confirming if it's a threat rather than looking for a threat and then trying to see what matches that. If you have a big massive data set and you just do pattern analysis across it and then when you go, okay, these are all grouped together, look at the screenshot like we did at Binance. Okay, it's a threat. Find correlations, then confirm it's a threat. And by design and definition, that will give you campaigns. Correlated incidents. I didn't go through 76 slides already, did I? Wow. I might have up the timing. These are the resources. And that's me.

Thank you. >> Fantastic. Um, I think we've got time for some questions if that's okay. >> You need to have loss. [clears throat] >> So, who would like to go first? >> All right. I go around the wrong way. >> I'll come back to you second. Did you see cloud being proxy in front of a lot of those domains skewing the data set or was that not a variable at all in the findings? >> My hearing is terrible. >> Sorry. Did you see Cloudflare being in front of a lot of these servers or kind of skewing the data proxy? >> Oh yeah. I hate Cloudflare and I love them. Um it's a [ __ ] excuse me love

hate relationship. Yeah, Cloudflare. We found a way to get around it because Cloudflare Well, that's not it. We haven't got around it. So the sha to their response when we request it is always the [ __ ] same. It's just a blank page. So we can filter the Cloudflare blocking quite easy based on their response and the SHA being the same. And if you were to look at it from the screenshot level, it's just a blank page. If we look at it from the report level, we don't get the third parties. We don't get anything. Um, yes, it causes us problems. In fairness, not as much as I thought. I thought everyone was going to move on to Cloudflare when

all these LLM scrapers started. It's not the case. There is a lot, but yes, Cloudfire is a pain in the ass.

Thank you for your talk. Um, why did you choose to get your list of domains to scan from ICAN and not from the certificate transparency logs? >> Um, because when something comes into the certificate transparency logs, it's just telling us that a certificate has been made. It's a strong assumption that that domain was registered within the same day the certificate was made. So we need to still have a source of information to go okay we got the CTL logs random domain.com I personally wouldn't take the bet and go this is a new domain unless I you know it's just certificate transparency logs I understand that yeah probably 75 to 80% of them were registered today but

we don't know that we still need to check they could have been registered yesterday and then we made the certificate today so we're already a day off Um, yeah, it's just not a true source. I've explored it a lot, but it's like this isn't telling me it's a new domain. It's just telling me a search's been made for a domain. Now, I understand they correlate closely. My mind's just too autistic to accept that. So, I have to know. Does that answer your question? Thank you. >> Any more questions? Oh okay.

How do you deal with subdomains if you're only doing newly registered domain? Um, and that's not I assume a subdomain isn't in the list. Do you say subdomains? How do I deal with them? All subdomains are found through discovery, not forced or anything. So, for example, we scan google.com. We see that it's going to contact, I don't know, pictures.google.com. Okay. Well, we got the root domain. This is a subdomain. We'll put them together in our backend record. We're not going to be an authorative source for this is all the subdomains, but we can go these are all the subdomains we found through scanning that domain. And then let's say it wasn't even related. Let's say we

scan bsides.london.com london.com and that contacted subdomain Google. Okay, we'll still rip that subdomain and put that onto Google's data. So, does that make sense? It's all through discovery. So, all through the scanning and if any requests were to be made to a subdomain, we'll take the domain, we'll take the sub, and we'll store that. But if you came looking for all the subdomains, no, we wouldn't have that. >> Thanks for the question. anymore? >> No. To the >> Okay. Well, let's uh thank you one more time. Thank you, Ste.