Stop Paying for Domain Feeds: Build Your Own Threat Intel

Name: Stop Paying for Domain Feeds: Build Your Own Threat Intel
Uploaded: 2025-12-12
Duration: 33 min 44 s
Description: Stephen Doyle demonstrates how to build a lightweight, open-source pipeline to monitor the web for domain-based threats without relying on expensive third-party feeds. The talk covers scanning, resolving, enriching, and analyzing domains at scale using DNS, TLS, HTTP, and sandbox tools, then shows h

BSides Belfast · 202533:4467 viewsPublished 2025-12Watch on YouTube ↗

Speakers

Stephen Doyle

Tags

CategoryTechnical

TopicOSINT Threat Intel Tooling

StyleTalk

Mentioned in this talk

Tools used

dnstwist Elasticsearch

Platforms

OpenSearch

Service

URLScan

About this talk

Stephen Doyle demonstrates how to build a lightweight, open-source pipeline to monitor the web for domain-based threats without relying on expensive third-party feeds. The talk covers scanning, resolving, enriching, and analyzing domains at scale using DNS, TLS, HTTP, and sandbox tools, then shows how to derive actionable threat intelligence by identifying patterns, correlations, and campaigns in the data rather than isolated indicators.

Show original YouTube description

Security teams often rely on third-party domain feeds; passive DNS, CT logs, and threat intelligence subscriptions to monitor malicious infrastructure. But these feeds are usually lagging, noisy, and offer the same surface-level data to everyone. The assumption is, building something better is too hard. This talk challenges that assumption. It’s a practical walkthrough of how to build your own lightweight, open-source pipeline to monitor the web for domain-based threats on your terms, and without handing money or trust to vendors. The talk covers how to scan, resolve, enrich, and analyze domains at scale using open tools and on-prem infrastructure. You’ll learn how to collect the kinds of intelligence that threat feeds miss; misconfigured web stacks, malicious redirect chains, third-party script abuse, phishing infrastructure, and newly deployed shady domains. Topics include: Why paid feeds are limited (and often outdated) How to extract high-signal data directly from the web: DNS, TLS, HTTP, headers, and third-party resources Tools and techniques to crawl and map domains yourself Building an on-prem enrichment and storage pipeline Real-world examples of strange, fragile, or dangerous architectures discovered in the wild You’ll also hear lessons learnt from monitoring large parts of the web continuously, and why many domains break in surprising, insecure ways, often depending on dozens of external services just to render a login page. It’s a talk for engineers, defenders, and curious hackers who like building things and want to see more of the web for themselves. If you've ever thought, “Why am I paying for this feed when I could just scan it myself?” then this talk is for you. The goal is simple: empower others to stop outsourcing visibility, start building smarter, and gain more meaningful threat intelligence, on your own terms. #bsidesbelfast25 #securitybsides #bsidesbelfast #bsides

Show transcript [en]

Okay, we'll get started everyone. Um, so I'd like to introduce Steven Doyle with Build Your Own Threat Intelligence. Thank you, Stephen. >> Thank you. >> I just gave away the first few slides. Can you hear me? Okay, >> perfect. So, this is me and all my personalities. I ride motorbikes. This is me after a 10-day motorbike tour. That face is priceless. I love it. This is me and my new duck lips. This is me all the other time. I've worked in this place called Crowd Strike or Strike Crowd I believe was the last talk. Synopsis and Smith Group doing incident response security engineering and insider threat. I kind of got bored of working for other

people. This man who's not unfortunately here worked with me. Don't laugh, please. In all three of these places. And today, in all honesty, would have been the first time we ever met. So, I'm kind of disappointed he stood me up again. So, I decided I'd create a logo, create a website, call myself a founder because that's the cool thing to do these days. And if you can't tell, I've got some undiagnosed whatever. So, I'm not very commercially focused. So, I'm not a big fan of selling. So, I said, you know what? We'll call it open source. Make life easy. This is kind of the premise into our talk. I have a stick somewhere and a

chip on my shoulder when it comes to threat intelligence vendors, the procurement, the life cycle, and ultimately me being the end user having to deal with the output. What annoys me? People. Don't laugh. They're smiling. Okay. So depressing. [laughter] Um about threat intelligence in particular. No surprise. Sales, marketing, commercialization. We work in a security industry. I personally don't believe there's any room for ambiguity or confusing terms or making our job easier. or somebody coming in a year going do you do indicators of future attack we do IoC's zero day threat intelligence I'm going to look at wilco here because I kind of have my tail between my legs next gen so sucks the problem with nextgen is what happens

when the next gen comes out like oh last gen so suck thanks wo AI powered whatever I don't need to do that. This one is honestly where I ruffle the most feathers. I have logic behind it. I'm actually doing a talk on it.

Why do we need to say actionable threat intelligence? And the most argument I get here is, "Oh, well, we're just not going to action it today. We'll do it next week. We'll do it tomorrow." Okay. strategic threat intelligence, operational, tactical. If it's not actionable, in my view, it's data, not intelligence. I don't know if you can see the part at the bottom. If you can't, don't worry. Thanks Mom. So, what's threat intelligence? My definition, threat intelligence is actionable data. So actionable threat intelligence must be actionable actionable data. If I were to get into data versus intelligence, I never started my timer. Minus five. On the left, a lot of you have probably seen if you do brand abuse or monitoring

is a list of newly registered domains pulled from the icon 236,000. Cool. That's data, not intelligence. Here we're looking through the same newly registered domains, but now we're looking for domains that include Binance because let's pretend we're Binance. Still not intelligence. We're scoping our data. This is actionable data or some people call it intelligence. What is it? You have no idea because we've done analysis. Now, this is where the talk begins. It gets technical. I'm going to try make sense, be consistent. It's going to pick up the pace here. We take all of the 230,000 newly registered domains. And then we very simply sandbox them. So, URL scan, I assume everyone's heard of that. Cool. I know you all have. So we'll URL

scan them all. We get new data points. We'll call this enriching or data. And then once we get all of them data points, the next question is okay, more data, scope data, enriched data. How do we make threat intelligence? There's questions that need to be answered. What's a threat? What's a risk to your business? What do you do? Why do you do it? How do you do it? Who do you do it to? So for example, crowd strike synopsis group crowd strike everywhere I worked was internal. I never got a customerf facing job. Make up your own reason why. So crowd strike endpoint edr system level route more than root sorry system level privileges. Solar winds if anyone was unfortunate

enough to deal with that. Okay. Pone crowd strike. We pone the white house. It's not a secret that that it sits in the oval office. Synopsis is the application layer. I know there's some people probably here from Blackduck application layer for a semiconductor industry. I apologize if I get this wrong, but if Nvidia, Apple, Nvidia want to make their new chip. We're going to use Synopsis and we're going to go that's the best way to make that chip. Put transistors on the silicone. Full stop. Again, I apologize if I got that wrong. So okay imagine solar winds before the semiconductor supply chain let's pone synopsis whenever we compile that file for the foundry do that threat actor magic and insert a

hardware back door that's the risk there CME group cyber security vendors now win fintech awards we've threat risk models financial companies so much so I'm not going to get into that but today we're Bance So, we might get into that. And here's some of the risks. Simply put, we're a consumerf facing financial app. Depending on the market and who we are, we're either heavily regulated or heavily unregulated. So, we'll run through this. So, Binance says, "We may have an interest in protecting our users." No problem. We say, "Happy days. Let's build the pipeline, find these malicious campaigns." Then we're all in agreement. Comes the budget. No issues there. Here's your budget, mate. Get to work.

Okay mate. Can we maybe have a bit extra? And I actually am dying for a drink. This isn't for effect. It was so for effect. Great. We got our budget. Let's not ask again. So, this is what we need for our new domains pipeline. something to get the domains, something to scan all the domains, and somewhere to store our output. CZDS.on centralized zone data service. What's a top level domain?org.ai, they are your top level domain. What comes before that, popular belief, is actually a domain name. When you put them together, you have a fully qualified domain name. So Icon makes it really easy for all the generic and new top level domains. They basically say

7:30 every morning take your root zone file throw it in here and let the world do whatever they want to fit. Happy days. Easy. We have over 1100 top level domains here. Aaa arp.com.net whatever. This doesn't cover country code tople domains [snorts] in here. As of yesterday there was 260 million domains across the 1100 top level domains. There's about 700 million domains registered currently. The delta of that is country code tople domains. So big gaps but good place to get started. How do we get the data? Just so happens icon made a script. All we do is put in username, password, python icon run enter. Now we have all of the domains. If we were to do this every day, we'd want to

check yesterday. What's the difference? What's new? What's removed? So now we can create, okay, these are the new domains. These are domains that have been removed. Make life easier. before I realized icon had their own Python wrapper to do this. Yeah, I've made one better. Um, so if we go here, this is updated about 8:00 every single day by TLD and by monitoring output. So for the monitoring output, it's broken into two directories. Fortune 500 orgs and cyber security vendors. Why cyber security vendors? So other words so on the left we have the domain has been registered on the right is the brand. So these are just similarity matching. So citizen bank street Tesla etc. Cool. Got the newly

registered domains URL scan. It's no secret to what it is. It's selenium. It's chrome driver. It's Python, the superlue to put it together. If you want to take a picture, any value, this is URL scan in a function. So we use selenium which is a framework to interact with the web. And in the third line here, we can see we're importing our Chrome driver. And then very simply in our try block driver get URL get the title URL page source URL scan. Hi, this is Bides Belfast ran here at the bottom and this is the output unedited page title as is besides resolved URL page source. Now, one key difference I want to make here with the page source. You

go to a website, you right click, view page source. That's the page source. This page source with Selenium is the page source with the rendered DOM. So, if you went to a React website, for example, do view page source, you're only going to see three lines and then post it to the manifest file. This for the React website will be the full render DOM. That's why we'll do it instead of just doing a get request. making it better. Okay, the driver already has built-in functions. Get the cookies, get the screenshot. This highlighted part at the bottom, if you were to go developer tools, view the network, view the resources. This is just it as code.

Blue Peter style. Five, four, three. Too late. Scan output. Am I going to go into this? I might do.

Uhoh. Here's one I prepared earlier. So, if you used URL scan, it's going to be very simpler. So, this is our simple function we made earlier. Used all of the built-in functions that the Chrome driver lets us do. Pulled them out, done some analysis. This is our sandbox report. So we can see we have the certificates. So this will be for besides Belfast use a lot of Google tracking. So the three and four others are Google tracking cookies, domains, servers. The one thing we do and this is where we're going to pivot into is we fingerprint everything. So what a fingerprint is these are SHA 256 hashes. So if we took the cookie fingerprint for example,

it's this data object in completeness. So the first and the second hashed. If a new cookie was to be added, the hash would change. The most value one when it ever comes to correlating campaigns, which we'll do now, is this ASN fingerprint. What the ASN fingerprint is is this server object hashed. So four servers in here that are contacted when we contact besides Belfast. This one here is Gstatic sits on this server thatIP etc. I assume we know how hashes work. If any one of them servers was to all of a sudden go from United States to Russia, our hash is going to change. So we monitor that. We see stuff's changing. So we get to do that for the

servers, the cookies and a lot more.

So that's our open source URL scan built. This is the one slide for where to store data because elastic search open search literally the same thing same API. I assume people have used this. If you haven't very easy to get started you can literally download it make your indexes in your code and just fire it at index. Sorry. Fire at open search. It will reply going, "Oh no, the index isn't made." But it will still make it in the background. So then you do your next request. It made the index. That's why there's one slide. It's very simple. Download it. Make them indexes the sandbox from the previous post it right into it. So here's our pipeline. Getting all the

new domains from Icon, finding the difference. Then we're going to scan them all with our custom URL scanner. And then we're going to index them all into open search. This we started off with our actionable data. How did we get there? So if we look at the loosening query, we can see the page title Binance tag nearly registered domains. But how do we get there? So we're Binance. Let's look for Binance threat intelligence. We already know we're financial consumerf facing application heavily regulated unregulated a lot of illicit activities happens that should be enough to start a premise or a work in a theory this can I divulge this probably not people at Smith group will know about this so

there's some vendors out there that will go give us 10 grand a year 15 grand a year what do you look like 20 grand a year okay So, here's the 20 grand. You will get an email like this, although it looked like an email, not JSON. And it will go, here's the output for SMA, CME. Um, great company called ACME, a bunch of others. You've probably seen all the false positives. They get boring by now. This isn't threat intelligence. It's scope data because now the people in Smoke Group have to URL scan all of these and go hosting not hosting takedown worthy. Actually, a real Binance, but not us. We've scanned everything. We've enriched our data.

So we get benefits. So for here we're looking page title Bance newly registered domains 6th of September. Source not icon. It kind of is but it's our pipeline. Still not intelligence. So on the right this is the page title. I know. Does anyone speak Turkish from Turkey? No. Okay, neither did I. Google translate for the ones on the right that are similar. This says login Binance TR which I assume is Turkey. If we look at the left traditional methods small group wouldn't have caught this because there's no reference to Binance. Now they follow similarities. It's click TLD page title is the same. And I'm convinced I just can't work out the algorithm yet. just something that the

domain name is still I don't know somewhat too similar. Let's explore a bit more. So in our sandbox we get script count, request count, domain count. So let's look at some of the network behavior for these. First one, second one, same. Third one, the request count changes from 24 to 25. Same again on the bottom. Okay, looks consistent enough on the behavior side. fingerprints which we touched. These are all the exact same fingerprint for these six domains that reference Binance login in the page title. Now, we can see they're the same. I won't jump into what the actual JSON object is, but the network stack to load each one of these contacts the same five IPs, same

four countries, same four SNS, and three different servers. If any one of them was to change, we would see it on this screen very simply with a different hash fingerprints for the links. These are the links that we've extracted from the actual website. All the same. Taking a look at the links, they're just referenced to CSS stylesheets. When we look at the scripts, these are different. Now, as you do loads of these, this is normal. And the reason being is because of cookies and advertisers. So by design the script's going to refresh every time because your cookies based on your clock time. Therefore our scripts are going to be completely different tech exact same across all six same hash

looking at the tech's not overly complicated but we can see it uses this jQuery this version and this favicon. Now I personally year to date have scanned about 70 million websites. I looked at this and assumed, okay, well, that's definitely going to return thousands and thousands of websites. It's just favicon and not jQuery. But no, it didn't. It still only returned the scope that were included. So, summary so far, we've got all the domains from icon. We scanned all the domains from icon of the sandbox we built. We applied our logic from being consumer mason consumerf facing finance app found six domains with no similarity in naming to Binance in the actual domain name. However, the page

title says login Binance same fingerprints for network stack text stack and the page links. I actually do want a question or an answer here. Would you call this threat intelligence? Because this is usually where the debate happens. Do I ask you why or do I just go with no? >> Why, Jason?

Not going to repeat what he said. I'd call it intelligence. The reason being Jason said because it's not actionable. Good point. Where's the threat? So, we've correlated a lot of stuff. We know it says login Binance. We can assume stuff. Is there a threat? Can we quantify it? Okay, this is the screenshot for all six domains. Binance login themed. I may call it threat intelligence now. So, this is what we're working with. our scope are indicators of future attack that we can use downstream. If we remember from our behavioral analysis, the script count and the domain count were the same. There's no way we can use them just as alone and expect to find the same thread actors

behavior. So, we'll use all of these. We'll break down the scope. We'll go through one each individually and see what our results are. So, as I said, year to date, we've been scanning a lot using these indicators. We found 31 of the same page titles, 31 same text stack, and 31 of the same links. 19 only used the same network stack. First scene, our scope was originally just for the 6th of September for the newly registered domains. Now, our campaign, as we can maybe call it, is going from the 16th of August to the 6th of September. Now, me being an analyst, I need to know why this is 19. So, if you can memorize everything on

this screen, I'm joking. On the far right, all you need to look at is the colors. These are the 31 domains. That's the fingerprint for the tech. All the same. 31 domains links, all the same. Fingerprints for the ASN network stack. If we remember it was four servers, five ASN, three countries probably here it's switching and there's no consistency. I s thought maybe okay setting up the campaign first week we'll use this network stack. Okay, no we'll change it. As you can see from the 16th of August to 6th of September I can't really see any consistency. So update our scope what we're working with comparing the two fingerprints. So these are the difference between our network

stack. So yam ya.click and refund bin.click both fishing websites all look the same except for the response size. That's why the hash is different. Some request return something different. Okay. Don't have time to look into it. However, we want to expand our scope. So far we've went from six to 31. What else is on that server? Does the thread actor control it? Is there a larger campaign? Is this exclusively for Binance? So last three months for newly registered domains, I've scanned 30 million. So searching across the 30 million domains for that IP. 137 results in total. 31 we know were Binance. Nine was this ENF. I'm going to assume that translates to economy and finance.

QNB, financial bank login. Vakif Bank financial login. So about 40% of what I could discover of what was on this server. This isn't necessarily malicious, but there was nine of these fake news websites registered. I don't know if that's to promote their campaign or feed into the overall campaign, possibly, but I would say it's a financial sector threat actor. Continuous monitoring fingerprints. That's why we put it into the sandbox. It's the easiest way to see if anything changes. Has this changed? Great. Let's look what's changed. Bear with me. This is a big slide. I'm done in next five minutes. So the next slide for this is easy. It's turning this into methodology, but I'm going to go through it. So we

pulled the 1100 generic and new tople domains from the icon centralized zone repository compared them against that the same thing we done yesterday found the new found the removed stored them all in open search because I love open search and it's open source and for downstream capabilities detection pulling into other tools it just makes things a lot easier. So got our data, enriched our data, stored it in open source. We're Binance, apply our threat logic, risk logic to it. Now we're looking for correlations across the scoped data. We started with a low hanging fruit. Hello, sorry, hit the button. Um, we started low hanging fruit and just started with the page title. We could have done page

references, page backlinks, Binance. So, for example, if our fishing website just referenced all the Binance resources because they're lazy when they want to do their fishing and just steal them, we could have found it that way, but we stuck exclusively starting our hunt with the page title. Other similarities we discovered was tech, network stack, page links. At this point, all we knew was there was a correlation in the data. Is it a threat? Okay, we looked at the screenshot, very easily confirmed it was a threat. built up our IOC's our scope was stuck to the one date. Now it's expand our scope given us 31 more domains up continuous monitoring and then we looked at the

thread actor server discovered he's most likely running a financial themed campaign not just for Binance.

This is the talk in one screen. Data scope data threat intelligence. I've been guilty of it. I know there's some frameworks out there to do it. It's let's look for the threat and then let's look for the correlation. I've personally had success with just dealing with the data, looking for the correlation in the data. Okay, we got correlated events. then confirm if it's a threat. If we do it this way, by definition, we're going to end up with a campaign. If we were to look for singularity threat events and then take a step back and go, now let's look for the correlation. I I don't like it. So the takeaway if we can find the patterns in the data

first and the correlations in the data then if we confirm it's a threat it's a campaign not an incident not a singularity event and campaign thinking is definitely how you want to think about this I think that's the last slide yeah so resources I'll put it back questions anybody Okay. >> Yep.

In this code? No. Do I as a researcher? Yes. Um like homographs and them type of attacks? >> Yeah. Um, what I would tell anyone just on a security level, do you ever use DNS Twister? DNS Twister. So if you Google DNS Twister, put in whatever brand or domain you want, it will spin up basically the highest fidelity what you should register to attack a company hypothetically. Um, and they include all them different type of homograph attacks. And in fairness whenever I can't speak about this when we done an internal team sorry internal tabletop in crowdstrike I registered one for firefox.com but whenever it hits your email gateway or hits any logs it doesn't translate

like that but yeah DNS twister would be the best thing I'd recommend. >> Yep.

Um [sighs] that means you have to sort of obey the rules. [crying] [groaning] What we're doing is honestly no different than what recorded futures do or domain tools. They have one stipulation which is they'll have a lot better relationships with the country code registers. But icon, yeah, I mean there stipulation is just I'd say the icon centralized own repository generates at least 100 million in revenue from people feeding it like recorded future threat intelligence. That's where they get it. So we're open source, we don't go down that line. So we just kind of take the approach if they can do it surely we can. I know that's not the best answer, but I've had no problems. I scrape it

every day. their framework or sorry their compliance is make sure you got a static IP and always comes from the same IP address. I haven't even complied to that and I'm still okay. Good luck. Anyone else? Oh, welcome. Yeah, Wilco. How are you? Welcome. >> I'll sing >> domain takeowns. Sorry. >> A lot of people are talking

>> I don't like AI and everything. AI and it's a new [ __ ] verb I'm going for it. Um, you don't need AI to do domain takedowns. That's the wrong approach. I I it's too static, I think. Like you can look at the who is abuse contact. You can look at the register abuse contact. You can send a form. I wouldn't do AI for that. the risk scoring. We delayed building the risk score for so long into the GitHub repo because I didn't feel confidence making an algorithm to tell you that this is a risk because again what is the risk? You know, do you care about new APKs? Okay, so I've thought about that and how I

plan to at least solve it in my way is build a mutable risk score. So whether it's open source waiting scripts countries contacted embargoed countries you sect the waiting between one and five this is your business risk. So even if you were to scan something it's okay it's a this 20% risk but it's our business 80% risk downstream how do I improve it I wouldn't go any further than domain takedowns honestly so with this you get the full life cycle from the detection of the newly registered tracking the fingerprints or whatever else through the continuous monitoring. Yeah, once it gets to that point of brand abuse, just automate the takedown. >> Love you. >> That us

sign of life.

Silent or sign? >> Okay, thank you. >> Was there another question? No. To the Pope. Yep. Sorry.

The 19 was the network stack changed. So for a 31 that had the exact same links, same tech stack. Probably use this one. So top three the same. Bottom one was our network stack. And our network stack was a contacted four servers, five ASNs, whatever. Whatever. So we can see for the difference 12 I think who knows we're different. And then when we investigated them on this screen the furthest to the left with the rectangles around it is our network stack. So we can just see that out of the 31 domains part of our campaign there was two different network stacks being used. Does that answer your question? I apologize if it doesn't.

Thank you very much, Stephen. All right.

Stop Paying for Domain Feeds: Build Your Own Threat Intel

Related talks