← All talks

GF - WHOIS the boss? Building Your Own WHOIS Dataset for Reconnaissance

BSides Las Vegas23:10332 viewsPublished 2024-09Watch on YouTube ↗
About this talk
Ground Floor, Tue, Aug 6, 17:30 - Tue, Aug 6, 17:50 CDT When it comes to OSINT and penetration testing, WHOIS data is among the prime resources for uncovering and examining apex domains. Unfortunately that data is typically locked up behind rate limited systems, third party APIs, and expensive bulk purchases. In this 20 minute technical presentation we give our experience building a 15MM+ WHOIS dataset for recon, setting up notifications on newly acquired domains by companies, the intricacies of WHOIS and RDAP, and hunting for archival WHOIS data. Finally, we will cover open source tools that currently fill in the gaps of this process. People Will Vandevanter
Show transcript [en]

thank you everyone uh so the title to talk is who is the boss uh building your own who is data set for reconnaissance just real quickly about me I'm a senior staff security researcher at sprocket security always want to thank you know sprocket for not just opportunity to come out uh but to get to talk about this research project was a fairly small research project over the past past year but I think it have value and I'm sort of interested how other people are approaching this problem uh I've worked in offensive security since about 2008 uh it's my second time speaking at besides Las Vegas uh the last time was 2013 though so it's been a it's been a

minute uh so onto the content because I know we have 20 minutes um I think most people in this room have probably registered a domain before uh and when you're registering a domain you're required to provide ownership contact information so that's going to be first name last name address phone number fax number uh email address I have a name cheap screenshot there um and as part of that process for most modern registar you're given the option of uh applying uh redaction or privacy to what you put in there if you decline that and somebody does a who is look up on the domain they'll get back the ownership information and that's by Design if you do apply the redaction for

privacy or the Privacy feature uh then if somebody does a who is look up on the domain they'll get back redacted for privacy for the fields and then usually there's a URL that you can fill out and get access to the information depending on the registar so it's very common practice for uh to use reverse who is as part of the Recon process so for example if you go to your terminal right now type in who is bankofamerica.com you'll get back the information about Bank of America so as we said name address phone number email that sort of thing at the bottom I've sort of boxed out the domain. administrator at bankofamerica.com so for red teamers pentesters in the room

sure you've done it it's very common to use a data broker service to do an API call with or to the UI as what other domains have been registered by domain. administrator at bankofamerica.com or whoever we're doing reconnaissance on and if you do that call through like waxy you'll get back about 400,000 domains and typically the process from here is you sort of filter out the domains and see what's valid in the case of this domain uh haxy will also do historical lookups so these could be domains that have been long gone but were at one point registered by this email address but really the idea from here is that we can pull out other

domains that are owned by the company in scope so this is at least some subset of these domains are also owned by that company so again for the red teamers I'm sure you have stories I mean I give an example in the past few months uh we do what's called continuous pen testing so it's on an assessment uh there was an apex domain provided by the customer single Apex domain did reverse who is found additional Apex domains uh within one of those so you do the full reconnaissance pipeline after that so subdomain Brew forcing Port scanning identifying Services found a Jason environment file with Azure creds hardcoded by devops In This adjacent Apex domain and that gave us the

foothold into the environment through this other Apex domain so pretty common process I think many of you have done this before this is one of my favorite quotes on it uh by Jason hadock so I'm sure many of you guys know him I saw him walking around the con earlier um and that is for every new Apex domain we find we forx our chance of hacking the Target right so we have a we have a multiple multiplicative effect by finding a new Apex domain and sort of in my mind it's like it's almost like the branch of a tree right so we found one Apex domain now we find another one and that whole pipeline goes off of subdomain brw

forcing Port scanning all of the things that go along with potentially gaining access just through this additional ape domain that's found so although there's some filtering that we need to do some process that we need to go through it's so worthwhile that we end up doing it because it increases our chances of potentially getting in uh as I mentioned there are some data Brokers uh really common data Brokers there are many of them these are three examples uh haxy is super popular uh really reasonable API cost to do reverse lookups you can also do regular lookups and bulk lookups lots of tools on GitHub to automate that process if that's what you want to do uh security Trails

another really popular one they're more broadly focused on Asm but within their documentation they absolutely have reverse uh who is lookups through the API uh more expensive but excellent data and then who is XML API well I I don't use them as much um from what I understand it's more focally more focused on like ENT or potentially maare domains but again very who is focused lots of data so that sort of lays the foundation right we've talked about uh the importance of reverse who is and now we're sort of moving more into the meat of the conversation which is which started out with the research pro project of what if we managed our own who is data set so if we were like all

right we're not necessarily going to use these data Brokers for a bit what does it sort of look like to manage our own who has data set to Aggregate and collect that data and so the rest of the talk is really Lessons Learned building out that who is Recon data set that hopefully you can take with you if you want to try it out um alerting on newly register domains which I would probably argue is the most valuable thing about managing your own who is data set and then some advantages that we just probably can't overcome but are worth sort of discussing clock for a second okay so uh we're building out the Recon data set

for who is the first thing we're going to do is Source a lot of domains I start with the Cisco umbrella top 1 million domains 1 million domains is not a lot of domains but what I really like about this data set is that it's it's valid it's well organized um and it's really good for like unit tests because as you build it up out uh you can you can be pretty confident those domains are solid along the way I tested a lot of different free domain sets and I have to say this is probably by far the best it's a TB odan domains project um his goal is to have the largest set of free

domains on the internet so he has 1.3 billion domains available through GitHub right now um You can bash scripted as I did and pull out in in the millions of Apex domains I'd also say what's really nice about this domain set is it's organized by TLD so if you want to focus on uh BR Brazil or you want to focus on bank because you're into the banking TLD it's all organized and you can sort of go from there the other one which is really important and I would say the first two are like historic domains right because they depend on the last time they were updated but we have this whole kind of problem with our data set where there's

like 50,000 to 400,000 domains registered every 24 hours so there's a lot of domains that are consistently being added and potentially on a continuous scale by you know one of our customers or whoever you're working with who is DS provides who is ds.com I should say provides a newly registered domain set so every 24 hours they put out a zip file inside the zip files a text with all the domains registered from the previous 24 hours so it's really helpful if you're charting domains on a daily basis um as part of this project I wrote an open source tool called who is Watcher and we're going to come back to this a couple different times but in there there's a flag D- nrd

if you put that on a Chron tab it'll just every day for say once a day it'll download all the domains from the previous day so you don't even need to go to the website it'll just automatically do it for you okay so we're trying to build our our data set we have a large set of source domains now we need to begin to do who is lookups on them unfortunately we can't just do that from like a residential IP or a Home IP there's IP rate limiting by certain regist there certain reasons why you just wouldn't want to run it from home essentially um I'll talk about three different ways to do lookups uh the first one is is IPv6

proxying so I actually I I didn't know about this technique till about a year ago um but when you get a VPS and you enable IPv6 on it they don't just give you like one single IPv6 it's not like ipv4 they give you a range of IPv6 addresses or a net block so like digital ocean I think is like maybe 15 but if you go with a $5 line they'll give you a SL sl64 which according to Chad gbt is sextilion million domains um I just wrote thousands of millions because I wasn't really sure it's a massive number of source IPv6 addresses that your single system can take on so Black Lantern security has a tool

called Trevor proxy which sets up a local socks proxy to rotate the source IPv6 so if you have that set up then any request sent through that local uh socks proxy will modify your Source IPv6 and essentially rotate it as part of the process who is Watchers supports socks proxying so you can use it um with this uh it's a it's excellent tool works very well the one downside is not all registrars support IV I IPv6 so what you're going to run into is certain subset of domains just can't be looked up with IPv6 you have to fall back to ipv4 uh or another technique um all right second one who is lookups so that's IPv6 proxying uh you can I

clocked it about 100,000 to 200,000 domains uh per 24 hours pretty good but not the best um next would be uh art app so um there's there's been like many protocols there's a history to it who is goes back to the 1970s in a rapnet um Elizabeth finer team and she was like pivotal and that she also helped develop s they helped develop who is as part of rabet um originally it was sort of like a white P White Pages uh and then in 2005 it was updated to uh what we sort of look at is the current who is protocol so that's on Port 43 it's unencrypted um it's human readable not machine readable uh and around 2017

started introducing artapp which is uh a rest based API for who is lookups it will also return Jason this sounds amazing right like oh wow now we can just do http you know rest based lookups on domains and get Jason back unfortunately uh not all regars support art app um and here's like the list of those that do and don't and again like IPv6 you end up having to fall back to a different technique to do Mass lookups probably the most effective way I've seen do to do it is through serverless Cloud so again I mentioned who is Watcher it's a small go tool you can easily deploy it into AWS Lambda um so you can use like the AWS cdk which

makes it really easy to create a function and a function URL um that was my preferred method for a long time and then last month I don't know if anybody saw the tool Lemma did anybody has anybody heard of this tool really interesting it allows you to deploy like uh offensive tools into Lambda function URLs very very easily and helps you uh use them at scale so it's excellent tool um def Pam is the author if you want to look it up um and in there there's a script called tools install tools. sh you can add who is Watcher into it and then it'll automatically deploy into a function URL so here's really the meat of this

point I probably could have started with this full bullet point but if you're doing it through serverless Cloud 400 concurrent invocations which isn't a lot like most people's accounts allow for a thousand concurrent invocations per uh per region um but with 400 concurrent invocations you'll complete 1 million to 1 and a half million uh complete queries per 24 hours so it's more than enough to do newly registered domains um who is Watcher will also respect IP limitations it has a built-in back off so if it detects that the registar is like hey you've made too many requests it'll back off of it and stop the request I'd also highly recommend U someone on the team

uh pointing this out you can use AWS event Bridge so if you set AWS event bridge to run every minute and do like a modify of your Lambda function so like almost doing nothing to it but just modify it then the AWS infrastructure will redeploy it and so you receive a new source IP so every minute you have a new source IP by doing a simple just modification through a vent bridge this is the command at the bottom here so um like cat domains this could be you know the past 24 hours domains piping into Lemma with 400 invocations calling who is Watcher and then saving the results so pretty pretty uh pretty pain painless way to do these Mass

lookups that need to be done all right so we have a set of source domains we have a way to uh to look them up now we need sort of an application on top of that unless you just want to take the domains and analyze them um one of the best applications that I found is alerting on newly registered domains so if you're doing like red teaming continuous pentesting bug bounty hunting then as these TW these uh domains come out every 24 hours you may want to be notified that a domain created was by a customer that you were following along with so if we go back to that Bank of American example and we have a register an email

of domain. administrator Bank of America we can create a yo file then we just call I call it watch list here it has a key of email if the email contains Bank of America then I want to be alerted so we're taking that Lemma function or however you've saved all the domains we set a watch list so we get notified if any of those domains are a interest and then immediately we get updated so we know hey you know something of interest just got registered within the past 24 hours let's kick off the reconnaissance process from there another big Advantage I would say um so with the data Brokers your reverse lookups are usually uh scoped down to

email company name um and actually organization so it's email and organization uh and then domain name but when we run the data when we own the pipeline we can do multip we can do reverse lookups on multiple points at once so on the left here we have tesla.com they've redacted everything but they have a registering organization of DNS to Nation um and one thing I think probably seen if you're into reconnaissance is companies tend to use the same registar for all of their domains and it kind of makes sense right you wouldn't want to register half your domains with Route 53 with AWS and uh you know half with name chep or something so they tend to use the same

registar so we can use a combination watch list so we have our combo of domain contains Tesla and register contains their registar uh and then we'll get alerted on other domains that are they're potentials they're going to require a little bit of work but they fit into that 1% you know grouping that takes us over the line to maybe find a really impactful finding so these were two example domains like I ran through like a large data set I'm not sure if Tesla actually owns uh Tesla Union fre but it does match up with the name along with the registar and obviously you can combo on other points too you could do zip code address right you can do

different ones um I can think of an assessment from the past six months where I identified a domain it had a very specific name to one of our customers and then it had a registar of AWS and that led to like a it wasn't a big finding but it was like a WordPress site that had just been sort of stood up and forgotten about a little bit so the technique works it has a bit of a high noise uh value as you might guess um but it's absolutely effective and especially when you have the pipeline in place I mean it's easy big disadvantages I would definitely call out um number one is the data Brokers have historic who is

records going back well over 10 years I think I saw 2007 for one of them I wasn't sure which one it was I and that's a really big Advantage because if we think of like a company like let's say we're doing Tesla and they've redacted there might be a point in the past where they didn't redact that information where that email was available and then they can cross reference on other historic emails and find other domains so data Brokers definitely have the advantage there um on the upside you know it may be a limited number of API keys that were burning in order to make calls on historic data um but the you know overall like they do have excellent

data second disadvantage I would call out is a berse build um which you know comes into play with all of this sort of stuff uh the data Brokers will sell most of this data wholesale so if you were like I want every uh historic record for 50050 million domains you can buy that from um some of the Brokers that I talked about it's quite expensive but you can buy it um you can also get as a service obviously apis or the uis uh who DS in particular if you wanted to do newly registered domains you could pay to have their service of every single day instead of just the domains I would like all of the results from who is

uh but again you'd find the pricing on the website it's not necessarily cheap um second this one I maybe as a wash but you do need to manage and curate your own data so as you're building this out you have to make sure uh you know have I been ipate limited on certain domains do I need to recheck them um so it's really important to sort of manage your data but again you also get a really good glimpse into how the data looks I mean I have seen from the data broker cases where they are IP restricted in some cases and we'll have better data in our own data set so there is definitely high value in terms of the

size of the data it can like easily fit into click house post SQL it's not a ton um one thing i' also call about call out about who is Watcher is rather than being like human readable it'll give you Json so you can get all of your domains in Json and then stick it in like click house and then immediately you know the columns are made out for you and you can analyze the data that way if you want so I guess in closing um couple takeaway resourc this isn't online yet I'll I did I was a little sketched out by the uh the Wi-Fi so I didn't post it yet um so it's a

private repo but it'll be up tonight um and it's basically as this research has gone on just taking notes about uh um like blog posts and interesting things I've seen from pu is a lot of really uh interesting work that's been done over time so it's cool to see that and I put that in there um certainly if you're into internet scale scanning I would love to talk to you like I find this stuff super fascinating and coordinating it scaling it uh thinking about it so please come talk to me I mean that's part of the bides Mantra uh and then that's all I had so um I don't know if there's any questions or uh questions from it

yeah so are you picking on Bank of America because they're on the other side of the wall are they a sponsor oh my my bad no no I did not do that I I have noticed a little bit the banking domains I don't know if they have a legal requirement to expose who excuse me who is Data but that could be part of it but yes I absolutely was picking on no um for the uh collection of uh list of domains uh I can has a uh centralized Zone data service which uh allows you to sign up and just get a Zone file for every TLD uh and it's updated daily did you try using that as a source for

domains so here's where it got a little bit off the camera but um is that a researcher required uh it's it doesn't need to be researched they they say it's available for uh research education uh um security law enforcement anybody can sign up I have an account for it and uh it they just provide you with an FTP and you can download the Zone files for every TLD uh daily and is it and so it's the newly registered domains sorry newly registered domains no it's every domain registered under the TLD that's what I thought yeah it'll give you every every domain registered under the TLD uh along with the name server I believe okay yeah

cuz I looked into it a little bit and I wasn't I wasn't sure about the terms of service that's the only reason that was my hesitation but that if if every did anybody need me to repeat that uh if you want to just repeat that really quick because it's valuable what the where to sign up and get oh yeah you can go to Ian it's uh the it's called the centralized Zone data service but you can sign up there um you have to like provide information and you're right there is a terms of service I think it they just say you you can't use that information for commercial purposes which is kind of fuzzy so you know you

can as I guess as long as you're not selling it I think it's probably fine but you know I'm not a lawyer so no lawyer appreciate it thank you another known question in a similar vein uh there's also open Intel which uses um um the certificate transparency PR certificates in a log to see what certificates were issued and they provide data sets of domains that are in cctlds so domains that are not don't have the I can uh contract which forces them to publish the data and I think they also have a kefka stream of recently registered domains so they may also be very interesting to explore and for everybody who's into this topic may

be a very interesting resource cool what was the name of the resource again I'm sorry open Intel oh open Intel okay yeah yeah absolutely yeah the transparency log stuff can get super interesting as well so what percentage do you think um of your DNS lookups and stuff are you getting missed because of per se like a vendor uses um a third party hosting provider that then gets all that stuff registered underneath them but it's actually managed by that main party so how much do you feel like you are missing out by misdirection of this project was was stood up by a third party vendor and all the DNS records are pointing towards that third party vendor instead of the main vendor

that contracted that so we're talking about like a square space or an example of yeah um that's a really good point uh what what I like about the data is we can sort of slice and dice it a little bit more so I I my the first thing that come to mind is like two data points probably like so we're taking Squarespace plus some other piece of information maybe we can drive a little more but percentage wise I'm not sure about um not sure about how much we're missing there uh thanks we're on break for about an hour uh and uh we'll see you back here uh after that thanks B thank you