← All talks

Making OpenINTEL open up - Szymon | BSides Cape Town 2025

BSides Cape Town14:18119 viewsPublished 2026-01Watch on YouTube ↗
About this talk
In this lightning talk I explore OpenINTEL from a offensive perspective. Digging into the terabytes of data made available, I examine whether it can be useful for OSINT purposes. I will also share ideas, tools and scripts that could help with handling this data set. ========================================================== Its been a while since Rapid7's Open Data was closed off to anyone for the taking. The data it contained was rich for OSINT and vulnerability research purposes. I found OpenINTEL and wondered how it compared, whether it could be as useful. In my case, it was. First we will look at the structure of the data focusing on forward DNS. Then we will unpack approaches to downloading and processing the data to store it in a meaningful way as well as provide tips for those who are limited with storage space. I’ll also share off-the-shelf tools, scripts, and commands so others can replicate the process. Next we will explore how the data could be used to help pentesters and bug bounty hunters. The data as it stands spans a range of 10 years when you combine the various top lists together. This is a pretty long timeline for some domains, which could be helpful. For instance, consider a target fronted with WAF, you could look at the data to try find the origin IP. I used bug bounty targets as a test and was able to find data for about 250 targets. From that list of 250, I found that I could access the application directly for around 60 targets rather than through the WAF. The data could also be used to perform reverse lookups. While there are many such services online, reverse lookups on certain record types is harder to come by, for instance TXT records type. TXT records often contain ownership verification values, and administrators sometimes use the same tenant or subscription across multiple domains. With the ability to perform reverse TXT lookups, this can reveal related assets and broaden the potential attack surface. We will also unpack CrUX and how it can be helpful to regionally significant targets. The data in top lists is somewhat skewed since it focuses primarily on the global top one million sites as ranked by Cisco Umbrella, Tranco, etc. As a result, smaller or regionally significant targets may be underrepresented. CrUX, a recently added source could be useful here since it groups data by region and highlights locally popular sites, possibly making the data within OpenINTEL more relevant to your specific context. I will conclude the talk with key takeaways and my opinion of OpenINTEL and whether its useful. Then I will provide the relevant links and move to Q&A. The takeaways for my talk are: * Historical data sets continue to be useful. * OSINT shouldn't be limited to services built for it — we should keep exploring new sources.
Show transcript [en]

Welcome everybody. Uh my talk is making open Intel open up for the visually impaired. That's an open door. Oh my goodness, the darkness is bad on it. Okay, that's an open door. Uh and we're going to unpack what Open Intel is about and how it can help uh pentesters, bug bounty hunters, etc. First, who am I? Uh you've noticed possibly a misspelling of my name. It's not a misspelling. I'm Polish and that's how you write my name. Uh you can call me Shimon. Uh but I don't expect you to get it right. One of people in SensePost only found out seven years later that my name is Shimon and that he thought that others were joking.

It's not. Uh you can also say Simon. I'm a pentester at SensePost team uh for cyber defense and I enjoy recon apps and sec ops. What inspired this talk? Rapid 7. I'm not endorsing them. Uh so they had a thing called open data which was based off of project sona um and you could easily download that data like five six years ago I should have listened to villim I didn't listen to villim now you can see that they've got about 50 terabytes of data but you can't access it you need to submit a request they've also changed their terms if you're an individual researcher don't bother submitting the request they're not going to listen to you uh if you are

a company or you come from a company, there's going to be a commercial thing. That's what I found out. Uh, so every year I sulk and I look around and this year I found Open Intel. I think I improved in Googling. Uh, so they've got millions, billions, and trillions of data. Sound like the president right now. Um, but we're not going to focus on all of that. We're only going to focus on the forward DNS data that's available. Uh I took a screenshot at that time it was 2.6 terabytes of data that I downloaded. [clears throat] [snorts] Um like skinning a cat there are many ways to download this data. I first started off naively with a

Python script. Uh then Michael Roger back there did a silly bash script. And then Leon one day asked me why didn't I just do wget- mirror and I felt really bad at that point. Like legit it took me an hour to recover from that. Uh and then a month ago I looked at it again and I saw oh hey it's actually an S3 compatible bucket. We could just get a client and sync. So that's what I did with rsync. And we didn't have to code up any logic to like kind of uh get it. We just told it that and it would sync the entire thing right? Cool. So the data is about 3.3 terabytes, 67,000 files each day of each

month of each year from the active sources which we'll touch on just now. We take a we store that data. They go and do these DNS measurements and then they go store it and it's publicly accessible. Maybe after this talk they'll take it down. I don't know. Um and it's stored as a parkquay. Did I get it right? No. Damn. Okay. Park whatever guys. uh and it's column orientated. Uh it has about 90 columns and there's a asterisk caveat uh because they started in 2016 and they probably saw that they needed to expand it and make changes etc. So now in some of the files you'll see like 95 columns. So it differs when which which year you got it and so on.

But there's a way to get around that. It's not going to be an issue for you. the current sources that they have. They base it off the top 1 million uh for like Cisco, Tranco, Cloudflare, Alexa no longer because uh AWS stopped monitoring and giving that data. And then uh Google Krux was something for this year together with Majestic Million. We don't care about that. We're going to look at Google Krux a bit later, but we do care about Majestic Million in the grand scheme of things, but uh Krux will get a better kind of recognition later on. Cool. how do we work with parkquay or whatever you call it. Uh so you've got various libraries for various popular

languages. Uh and then you've also got clients that can support it. Uh the one that we're going to look at is duct DB. It's a pretty cool tool and we're going to shine that spotlight on it. I hope you use it too. Uh so reading a file, a parkquay file with Duck DB, you can just do a SQL query uh give it the file name and then it will return everything to you. Uh, if you do that, don't worry, duct DB will save you. It's going to try and spit a lot of data at you, but duct DB goes and truncates it. So, that's pretty nice. Unless you do dash list, then it's not going to truncate it. Uh,

now, if you've got more than one file, which is what you're going to end up with because there's 67,000 files over there. Uh, you can do globbing and it will support that and will go and try and read everything for you. Now, I did tell you guys that um they differ in columns. So there's certain files with 90 columns. Certain files with 91, 92. Now when you do globbing, going to probably get confused. Oh, hey, there's some more columns here. There's not so many columns here. Uh if you do the union by name, true. Then it's happy. It's going to stop complaining. Oh, sorry. I double clicked. Uh you can also do remote files. So you can go and read

a remote file. And this is pretty nice because again there's a lot of data. Maybe you just want to go after txt records and so on, right? So you can grab the remote file and then use some SQL to say call where txt where resource record is a txt type etc. You can store that and then nice thing with duct db is you can just add JSON and you're using SQL query to greet the file but then it comes back in a JSON format. So then you can pipe that into any kind of JSON compliant tools. Um, or you know, if you are pretty bad with SQL, but you're good at jQ, then you could just pipe that

into jQ and do your filtering over there. Cool. So, you know what the data is, you know how to get that data. Now, what what are we going to do with this? Right? Um, so we're going to take a quick pause. Forward lookups. You use something like a host, a dig or NS lookup to try and resolve a host name. And in the end you should get a IP address. I say you should depends on how they configured it. Like here they did a CN name and then they made an A record and so on. Um so we're just focusing on that. You get an IP address at the end. I just want to establish this concept of

a forward lookup. Cool. Then in DNS you've got PTR records or pointer records uh or also colloally reverse lookups, right? So you can do host the IP address and then there will be maybe a record configured for it. You'll see here it's not tipdub.fas.com facebook.com but some edstar mini shv whatever right so it doesn't correlate and so on but the point is there's a ptr record that allows to do like a reverse lookup and I'm going to be talking a little bit about reverse lookup somewhat similar to this but not exactly this we're going to pause again name servers right can we do a name server reverse lookup right so can we find domains that

are configured with the same name servers right you can do host-tnsfas.com you'll get all the name servers for facebook.com Okay. Can we do the backwards of that? Can we look up a ns.fas.com? Yes, with a star. There are services out there that can help you do this. View DNS, who is XML API, security trails, hostile. Those are the four that I've listed that I like to use quite often, but this isn't the exhaustive list. You can find out more on the internet and so on. Why how do they do this is because they go and collect all this data and they've got a database and so they will go and look up on their database. Cool.

which other domains have these name server records. Okay. And this is what it looks like. Cool. Uh this is who is XML API. Uh you'll get a JSON thing and uh you can see cool. Uh one count kit is part of facebook.com. Two count kit is also part of it. I don't know what these are. I didn't look further. Uh but there's a problem with it. Uh eventually you'll run out of API credits. the demo is pretty limited in how much you can get and so on, right? And we don't like that. So this is where open intel comes back. You can do it against open intel and you can maybe get around that. You

have your own data. You can query as much as you want to and see if you can get that right. So over here when we did that lookup we got now 259 rows back. It only shows 20 because it truncates it etc. But now we have 259 whereas the who is XML API I think gave me the first 50 or first 100 without logging in and so on. Cool. So it's a nice way to use that. Taking it further um and this is kind of one of the better points or aspects. Uh so we've got txt records. Most internet based services do not do reverse lookups on txt records. So it's really hard to find it or maybe I'm just

bad at Googling, right? and I couldn't find that yet. Um, why is that important? Uh, one of the good examples is, uh, verification, showing that you own a domain. You set it up with a TXT records and, uh, cool. Uh, this is like when you sign up for an online account uh, to say show them that, hey, you own this. And then what people like to do is they have more than one domain. Uh, so they'll put on all the domains that same txt record. Okay. So let's do a reverse look up on the txt record for amazon.com. Uh in this case we did the TS17 etc etc. And again we got now 54 rows back. Right? So it's pretty cool.

We're able to find and expand our scope if we're like kind of enumerating recon target. Right? Another benefit of that is if you're a researcher and you're looking for new targets etc. you can monitor uh the TXT records and you can see cool what other new services are emerging what's becoming more popular and so on so you can focus and drill in on that which is also pretty nice okay change of direction uh this data has been spanning over many years uh if you have a good memory Alexa started 2016 and was retired in 2023 this is Cisco started in 2019 and it's still ongoing right so historical data right we've got access to do that. Here's an example. I

went and looked for all of GitLab's uh tipup.gitlab.com, all their IP addresses. And here I got a whole list of it. I did truncate this and so on, but I also highlighted day, month, year that these were captured. Why is this important? Um, so nowadays it's getting more and more popular, actually very popular that people use Cloudflare or some kind of W. they want you to go through that and you're pretty much stopped when you're doing scanning etc. So you want to find a way to maybe find the origin IP address and [snorts] see if they've misconfigured that you can hit it directly. Right? And so that's what we're doing over here. We're going to see if we can if this is

helpful, right? If this open Intel data can allow us to do this, right? So it's time to fo. Let's find out. So, uh, I found a repository in GitHub which goes and, uh, queries all of the bug bounty platforms for targets. I ended up with 34,282 targets from, uh, these various sources. And I found that 1,765 of them were behind Cloudflare, right? I only checked Cloudflare. I didn't look at things like Akami and so on. I just wanted to see, cool, is this good or not? I found that 68 of them were directly accessible. Right? So I found the origin IP. I went and connected to it. I saw I did a similarity check. I saw that 68 of them

were 100%. I don't in this count I do not have things like 99.5%. I didn't know where to draw the line, but I wanted to show you that's 100%. Okay. The image is supposed to show a bypass, but it's either my prompting is bad or LLMs are bad and they don't know where to put the security guard. It should have been in that curve. We're going around it. Cool. Um, if Open Intel doesn't work out, you can also use Virus Total. I find that to also be a quite nice site uh for looking for uh historical IP addresses as well. Just going to throw it out here because it's pretty useful. Um, but you do have the

same limitations as I mentioned earlier, like you've got uh a set amount of API credits etc. The one problem with open Intel is that it's based off of the top 1 million of certain DNS service providers. Um, and why that's a problem is let's say South Africa, its user base on the internet is maybe much smaller in comparison to the US or to Asia, right? So our take lot might not reach the uh 1 million right. Um so what happened uh is that this year they released Krux Chrome user experience reporting. Uh that's a little snitch combo with uh Chrome logo. Uh because uh Chrome if you set it up it can snitch to Google and tell them oh

hey this user visited this site this is the response time etc. This is more for a performance thing but they collect that data and that data is freely accessible to anybody. There are caveats over here. Um they have a certain threshold to be listed in that site and also your Chrome needs to be configured to snitch on you. You need a consent. Uh and the other caveat it's only started this year. So if you're looking for like historical data etc. uh it's not going to happen now but maybe visit that site in 5 years time and you'll have 5 years of data collected right so you can use this and this allows us to you know make uh attack

targets that are more regional right so if you're here in South Africa uh you could be more successful and that's pretty much my quick lightning talk on open Intel any questions