← All talks

BG - Modern Internet-Scale Network Reconnaissance - underflow

BSides Las Vegas51:47332 viewsPublished 2017-08Watch on YouTube ↗
About this talk
BG - Modern Internet-Scale Network Reconnaissance - underflow Breaking Ground BSidesLV 2017 - Tuscany Hotel - July 26, 2017
Show transcript [en]

please welcome underflow huts first off I'm really sorry I couldn't be there in person this week at a medical emergency come up over the weekend and have to scrap all the plans this week so I wish I was that hacker summer camp with you guys and I'm sorry I'm not but thanks for thanks for coming all the same this talk is kind of a roll to the last two years of work I've been doing in the background I kind of quit speaking a few years ago so this one doesn't count but this isn't a chance me to kind of present some of things I've been working on for a while and hopefully you know be able to share some

tools you guys can use as well I tend to have a habit of talking quickly and mumble a lot so I'll try to go a little slower but if for some reason I have the audio gets garbled throw something at the laptop and I look at the point so with that I'm gonna get started my background is doing pen testing work vulnera search work kind of hacking all the things for much of years probably like 20 at this point lots of time running exploits running blogs white papers doing research all that kind of fun stuff I've spent about six seven years off and on doing internet scanning projects of various sorts but for employers and his personal projects and

you know getting arrested or threatened to be arrested quite often as a result and I started a little pet project lately called fathom six which I'm putting that name on these these slides this is this is a little different your standard like here's a really cool two week tool we can go up in use because it's not just a tool you can download and install it's more of a whole platform you can build instead of it being something like you know recon ng or one of the newer tools out there like aqua tone this is more of like a giant bucket of Legos you can just pick up and play with and build your own stuff with

I'll walk through some of the ways I use it and kind of some of the cool things you can do with it but it's really gonna be up to you guys there's a whether you get any value out of the tools and the data and so on it does require some set-up time and it does you know hurt a little bit to get all the servers and they also been running to run this thing but you know overall it's not super expensive to get rolling and it is really valuable once you start you know making use of the system one of these I really don't like about the way that most recon works these days is that it

really depends on having access to third-party services you can and I don't know what you I don't like telling every third party service in the planet especially ones run by random hackers that I'm testing a particular Bank as part of my work so I'm really kind of gun-shy about using this third-party services for lookups as a result I've been pulling all my data locally for a number of years and I've been finding better ways to pull Beit data and that's cheaper faster better that covers all the things I want to get out of my own reconnaissance that have to go through and either you know brute force DNS for 10 hours or send queries off to a thousand random sketchy web

sites so the goal of this whole thing isn't just the red team work it's also helps you with blue team with monitoring with pen testing with even legal objectives I'll be using this stuff for everything from ma all the way through to monitoring from of my own domains for typos squats the company I work for in kind of everything in between so we'll talk about some stuff there's almost shells in this talk it doesn't she drop shells but pretty close to shells so really that's it it's how to make your security work better by using lots and lots of local data that you can get from mostly for free or cheap when I first heard of this talk to besides I you know

did it under a random handle and I didn't really expect it to get accepted so shame on me and I've had to bust my ass to actually keep the talk up and running all the tools and to get there in time but there are three new tools are released in the last few weeks that are pretty good that I wasn't aware of when I submitted the talk and definitely worth checking out so if it looks like I'm ignoring these tools when I go through the rest of it it's because it didn't exist when I started on the deck x-ray evil socket aqua tone from Houston risk Riskin and website from Christopher Grayson are all pretty neat and do a lot

of stuff kind of around this space one of the nice things about recon tools in general is that they don't really replace each other they're all is really more complimentary some tools are going to be better at some other things if you're looking at DNS discovery versus you know third party port scans you know getting more data Stoll is better there it's not like you know showed and replace everything else you do is a great thing to bring in to everything else you do so the problem space around this and kind of where I really got started is that most companies I do pen test work for don't really know what their footprint is I can probably see

you guys in raise your hands how many y'all have done a pen test for the customer didn't even know what they had

the BMI frozen I'm gonna assume some that I'm not getting updates all right so one thing I often run into is customers will say ok there is one video a lot of times customers say well we've got these five websites and these two applications and that's we want your test and we're like that's great what do you wanna do about these other two slash B's you have over here and all these little tiny site-to-site VPN s and external subnets and broadband networks and this stuff they go hey where'd you get that from and so a lot of lot of the day that we got those the subnets most networks problem is coming from these I'm gonna talk about today and

oftentimes that's actually what gets us in one of the corollaries this is companies who haven't been testing all their assets often do a pretty good job on the ones I know about and then absolutely terrible job the ones they don't know about so anytime we do a pen test and they've been getting quarterly pen test by some large consulting practice everything's pretty much straight clean there's no mole hanging fruit there's no unpatch stuff it's basically custom a few custom web apps with some may be blowing you know web at Budds but nothing really easy to break into we generally have to use you know zero days and commercial products we find as a result of doing the test or we

have to go find assets they didn't know about the bait didn't know they should be testing so I think one of the more recent ones who worked on it was a you know company with a couple /b so tons of IP space to start with there wasn't really much on though they weren't door there's pretty much completely firewall off what we did find however it was a whole bunch of little small broadband networks and those broadband networks we're a little bit difficult to to identify in the first place because the air indeed was out of date in some cases the right data was out of date in the European but we ended up finding a

subnet below Cisco routers where there is a high root shell on a root shell and a high port in Cisco to the states there was no idea why that Cisco router had a root shell on high port it was related to the Patriot FPGA piece so that was come with one test were this really good company that had been tested every quarter forever thought they had everything locked down and we still managed to root their data center through this obscure Cisco bug that we just happened to find doing a full court scam so oftentimes do scope it's not really clear a little bit of echo on the mic any chance you can mute the local microphone laptop cool

thank you if you don't know it still pissed

cool okay okay I'll wave if I can't hear you and you're saying something so generally like most customers don't know what their pen des cope is you can't have to tell them and then you have the fight to convince them that's actually their scope so in a couple cases we were you know had went back and forth 20 and 30 times at the same customer like we really think these IPS belong to you they said no no they definitely don't belong to us like they have your servers on them they've got your host names on them we can get a shot on them they probably belong to you anyways long story short people generally don't know

what they actually have especially companies that have bought other companies here they're the bottom two companies and so on once you get to three or four levels of you know M&A inception there nobody who's the left of IT team has any clue what's still still around so this complicates future M&A it complicates IT management security testing monitoring and Blue Team Red Team everything in between there's a lot of pretty good solutions out there for how to discover assets for a given company DNS DB which is good by Rob Tex past a total of Korus Farsight Pasadena's Open DNS is umbrella product which is mostly used for instant response but as a lot of asset discovery

as well and there's tons and tons of really great open source you know OS n tools out there you know but not just new ones like aqua tone but things like recon ng even Metasploit does some of that stuff as well you can always do manual lookups as well you know querying third party web services checking who is it someone so that's really the problem space I'm trying to solve with the tools when I'm talking about today is how do you find everything that belongs to a company and how do you find stuff that's you know really obscure and released have to find through any other mechanism so the biggest challenge right now with how

reconnaissance is done today is this really dependent on third party api's and services you have to go sign up for a thousand things and every time you use any of these tools you're sending off the identity of your client or information about your own company to others running these tools and I don't like that and they don't really work very well and if you're trying to you know find a needle in a haystack try to find the one server that happens belong to a company that may not be part of the primary domain that guy's really difficult you really just can't do that there's nobody to get there so a lot of what I like to do is you know sift

through every hostname on the whole internet looking for a reference to a company name or to a product name even if it's not tied to that particular domain name it gives me a much you know I get a warm fuzzy that I actually did a better job with discovering all those stuff in that case and you can't really do that with third party API is and tools where you're doing you know individual those cases another challenge of these tools they really really their ability to provide comprehensive results is drastically different based on where they get their data from what technique youth is to get it and so if you need to go correlate data from you know shodhan

and farsightedness and this and this and that like tools like recon do you do a pretty good job of gathering data from lots of third-party sources but haven't emerged all those together and trying to figure out what's authoritative what's not authoritative it gets pretty complicated these tools especially when they're related to kind of third-party assets sorry irregular stands or regular lookups they don't really give you a real-time picture I don't wanna pick on a particular vendor but there's a wound vendor that gets used a lot of this stuff where they've got three months of data for any given company sitting in a database which is great except for half the data's no longer useful when you go

do a pen test hat those web apps aren't even on the same IPs have two servers earn up anymore stuff is just changing way too fast these days to be able to depend on tools that have a three month time window for the results that you query the minute count leads to the next piece which a lot of the kind of rapid changes to infrastructure and applications is happening because of deployment of clouds and do the things like automated deployment tools like units and doctor and so on so the more kind of a dynamic infrastructure gets the harder it is to actually find this stuff at a given time at least without credentials to all the different tools

that people use so if you're on the DevOps team you probably have a good idea of what your DevOps team handles but you probably no idea what your other teams your marketing teams your cloud teams and a lot of IT teams have no idea what their DevOps teams are doing in some cases so that's kinda where we are today you've got kind of your old-school solid infrastructure that doesn't change very often but it's pretty hard to find kind of the squirrely bits like the broadband IPS then you've got the whole other pile of it which is the dynamic and rapidly changing cloud infrastructure and you know dynamic deployment tech tools so this kind of goes into kind of my

philosophy for the stuff which you know I don't like hitting lots of third-party services I don't want to have to fight to get access to stuff I want to have it all locally all the time and it's you know computers got fast hard drives got big like I can get a 10 terabyte hard drive for like 300 bucks right now actually have lots of them so I'm it's it's pretty cheap these days to actually keep a pretty sizable amount of data locally and if you process the data and right away it's actually pretty fast to look up as well you don't need to have a server with tuner gigs Aram just to be able to query you know stick it on

Postgres because one don't use Postgres it's a terrible tool for this stuff so that's lovely saying that if you start pulling all these third-party data sets into a local server and you crunch data in a way that makes it really fast to look up for your particular use case you don't need to have a massive amount of resources to do it you can do this for a couple bucks a month or the most and of course you avoid leaking target information out to other third parties and it actually confidence all your active discovery efforts a good example of that is I like doing using things like sonar DNS specific transfer ANSI some of the SL sort stuff that comes out

of census die out to get a list of all the Hosting's for a company and the cool thing about that is you end up seeing patterns automatic with names you say okay this is mail - 0 1 - company not combo ever or you start seeing a different patterns based on localities or names and so I end up focusing all of my directory brute-forcing against you know ok it's a one let's try go to oh three oh four so using just enough with the kind of passive external data sets to get a feel for where I want to focus my after discovery efforts and of course when you start funding grid stuff like hey you know there's a strange pattern

for these names and applies to more companies than just the one I'm looking at you actually have a puppet dignity here by you've got all the local data to stare at you can go dig the whole internet trove of data and figure out it's just a common thing is this uncommon is this because of a particular third-party vendor and how they manage these types of assets so you can really you know get to the root cause of some of these weird configuration that you run into and it's fun and there's lots of weird interesting stuff that you can find digging through these giant ropes the data you just can't really do if you're playing with third-party services

so the whole goal of this thing is to actually build a platform that you can use to do pretty much anything with um we want domain names what who is we want DNS data TLS or it's anything we can possibly get to help us better understand a particular organization or our exposure or hurt our best exposure let's get as much of it as we can in one place instead of tearing apart and start correlating between different pieces of it we want to be able to collect data on regular intervals so whatever frequency that particular source updates it's probably the best if it's too frequent it's a low bandwidth maybe we didn't once a week to win on but generally like

let's go p-- is basically grab lots of data as it's being generated pull it all down correlate it you know discard old stuff we need to and don't spend a lot of money getting this stuff we don't wanna go spend you know ten thousand dollars a month for a password EMS provider we'd rather just go get all the host names from all the various third-party sources that are already doing lookups with the stuff storage is pretty cheap computers have done pretty fast even the newbie desktop what a sky like X's have like 10 cores now so getting a multi-core machine with 16 gigs and more RAM is actually not that big of a deal anymore

so on that note if we're gonna start looking for data to use to build up a splat forum I've kind of gone through sort of doing the work there and this is where I've ended up so far sonar is the the rough seven project that includes like burstiness for DNS using a big massive pipeline to generate the types of data UDP scans to speed TLS HDPE all the maps they've got tons and tons of data it's all published for free on scans that I know you can download it play that do everyone um since this I know is the output at the University of Michigan team that in a parting with Google to store all their data in a bigquery

instance and you can the cool thing about a census is they have an API they've got a search engine in they've got a kind of fancy website you can also download the one terabyte raw lz4 files that are like entire snapshot of Internet and that's was awesome mixing actually down with the whole thing have your own mini shodhan you can query with like zero overhead on your local box and no one knows what the host and your creepy on or what domain name you're searching for it's all just right there and locals so i really suggest getting census data and getting the ipv4 version of that given to fill out before raw dumps those are generated daily and they

were returned by the pop so they're a little bit tough to do more than once a month or so but they aren't really useful for getting kind of a quick snapshot of a big network at once one of the data sources we have spent a lot of time a little bit later is to forget transparency or as it's here just CT CT is really cool and for much of reason that will go into yet but essentially it's a giant database of all SSL certs that have been observed on the net as well as SL sorts that have been issued and kind of moving on past that you've got CCDs which is the new global teal DS

it's all the zone files for those this is free you sign up with ICANN and you can download all the zone files whatever frequency that makes sense so there's something like eleven hazard zone files you can download and each have all the newly registered domains for all the new crazy T of these other a lot of fun free and easy to play with Aaron's little bit trickier if you want to get the bulk Aaron date the sign and Mail a snail mail form it's one of the only things in this whole set that actually took like more than a month to get signed up because you got the mail demo form write down what you're doing with the data they may call

you back and like ask you questions about him and grill you about it and you like you know answer all these questions and there aren't a lot scammers from typing all the email addresses out of it but in reality it has been kind of a pain for which isn't which is really just saving them bandwidth on hitting their Whois servers so both current date is great but it's kind of a pain to get it just takes a month or two to get lead time on it if you're trying to one of things that joins me crazy when I'm doing either like is the response stuff like if you look you know an instant in trying to check out what the ip's did or

you know what was happening even time and date is who actually owns that IP changes depending on the month from the year and the the day sometimes depending on how quickly that IDs being reassigned the only way to assist get basic at historical assignments for Nike but I hope it's cheap and easy is the kata prefix to AS databases they do a daily dump of BGP prefixes and advertise prefixes to providers so if you see a crazy IP know log looks like us iran looks like it's a you know egypt or some you know Middle Eastern country that doesn't really make sense for where you're seeing it you can actually go through the store achill data for the

time you saw the event and see wasn't actually that country or that particular assignment at that date there was a an attacker that I ran into a few times that was going after a customer of mine and they're using a satellite range that was pointing to Jordan but for like six to seven hours a day I got reassigned to Iraq and then they would do their to whatever they're doing was in Iraq and then they would get swapped IP right back over to Jordan again so it was really strange where all the standards come IP list out there thought that he was in Jordan but if he queried at the right time you'll see it I should get

reassigned brief later in the middle today to this other country so one of the great things about the k2 databases you can actually get pretty granular dumps about which owner o is which a a sandwich you know prefix and so on and those are pretty difficult to get otherwise unless you like monitoring BGP full-time just as kind of a fun add all the US government domains the UK government domains there's different posit or ease out there whether you can get a full list of all the domains that you know should exist and those are those are those aren't zone dumps but they are text files are updated pretty frequently and just like area you've got the regional IP allocation

things like after Nick laughs Nick right all that stuff those are generally pretty useful but the downside there is they don't give you contact information in the bulk data you not to sign up for it but they won't give you things like who is email addresses and who is data but they aren't good at picking out just who is what block and so on moving on a little bit you've got premium drops which is a weird kind of SEO site where they sell TLB dumps for ComNet into orb is X X X X K in the US so for like 25 bucks a month you get a list of all the TVs you can't download once a day and

there's another site that popped up recently which is w WS that IO that for nine dollars a month will give you about two hundred million registered domain names once a day so if you're trying to do just like a like type O squat matching this w WS at i/o one's probably the best bang for your buck if you need to have these own files to have two ns the nameservers link to each domain which is useful for a bunch of other work you definitely want the premium drops one and the CCDs data set that's dead and then finally if you're trying to get a cheap source for bulk domain Whois data there's no world way to get

it cheap well you can do this for building up a new dataset from scratch so the Whois XML API site will sell you a a daily dump of all who is for all newly registered domains and then as you started building that up over a couple years you end up having a pretty good picture they also have a I think that's 250 bucks a month you can get a 20 month backlog history of all the Whois so you could sign it for a month get the you know 21 the backlog and then get the recurring names I've been reusing this product a year and a half two years I'm up to about 50% of all registered domain names I've got

the initial coolidge information for without having to do my own massive Hughie's proxies and brute force and all that stuff so anyways long story short if you hit the URL at the bottom github fathom 6 on a data you'll see a kind of a breakdown of all these tools and were to get them and so on but these are the primary data sources I'm gonna talk about today and how to use them so this stuff is a little bit drive a spry gets more fun so if you actually want to grab all the stuff the quick and dirty way to do it is hit the fabulous ticks on a data repository you know cloning modify the sample

consideration file tweak it to whatever you want to do in our API is credentials anything you'll see signed up for in that first list it does a lot of stuff without having parentals like the sooner our data is downloads by itself it also needs a set of processing tools which are I need data parser is already going you can grab this from 2006 I'm data parsers and those are really parallel processing friendly so the more CPU cores you can throw at your project the faster this stuff runs I run it on a 16 core box attorney Higgs Aram that works great I also run it on cloud instances that are much smaller and it usually on

one to spot gently the process works you run the downloader first and then after the download is complete you rather normalizer the downloader will be to grab whatever the raw formats are Fe different providers and the normalizer goes to be Cooke's those formats and something you can quickly query for common use cases so if you just run on a data band daily - l it'll do all that stuff for you and again more RAM and more coarse help generally when you first start this process you run daily nut shell once any wait maybe a day or two depend on how much stuff you're pulling down and then once it's kind of the first chunk has been pulled down

then you start adding to your cron job running and actually Bailey it took a little while but we'd strapped up so a lot of the raw data is great I mean most these data files that you pull down or just big text files and compress lists so you can use all the standard UNIX tools like sort and grep and all that fun stuff z grep and went on and that's great the challenge though is if you're trying to say give me all the data for this particular subnet range it's this particular slider like 88 8/24 or something like that or give me every reference to any do we need give me every record in this particular domain

name so whether it's on the six that I know whether it's besides LV org give me everything within that the cool thing about the way this is structured it doesn't really care whether you're asking me for us some subdomain or an entire TLD it's generally fast enough that you can get all these results pretty quickly so if you want to do things like research into a particular countries tl these domains or an entire block of like slash eight or bigger the tools support that just fine so in terms of a word actually starts crunching this word starts crunching the raw data into much faster formats sonar DNS is one of the ones that really focuses on that

allows you to do things like give me every sub domain within a particular domain like and instantly you can also do things like show me every deep domain name that points in to this particular subnet so the way that they I know data parsers tools and data package works is it will create an inverse and create a forward lookup table an inverse lookup table so ends up creating things like to duplicate databases there are key backwards and that lets you be able to find those inverse relationships team a domain name and IP or IP range and set of domains and so on it also takes a lot of the raw data files and sorts them and

lets you correlate for example all the name servers for a given domain name will get matched up on a Mac place for the process so I'll sort of sonar it also does that for a c ZT c ZD s premium and drop CT log census io if you wanna make a mini Shogun and makes they just turn all of ipv4 census data into one big queryable file this massive em table database this side it so you actually wanna set this thing up the best way to do it is get a box with at least you know four cores 16 gigs of ram if you can afford to make bigger great you can also take a large instance like you know

get something with like sixteen there's 24 cores with you know half a terabyte of RAM but you want to bootstrap the whole thing on that and then downgrade it or snapshot the image and then bring it back up with much smaller resources later on so if using a non-local server for it that's that's one way to get there I'm in the middle digging a four foot trench Rama entire house to run ten gig five of everything and become a new server and being built so that's my solution to AB it's a little different I know everyone else doesn't go quite as crazy with their home IT care I keep running out of power so I have to

run a service line to the house to it's it's been fun so I know every else doesn't necessarily like doing that stuff and so if you need to use Google Cloud for it for unit for example there are n1 hi amen for it since it's pretty good and 122 bucks a month get off the ground you can also once you've got the initial bootstrap data you can always down clock it to something smaller and go from there for storage space it's about you know 1 terabytes about the minimum for hard drives you do wanna have a fast working directory for to use like SSD or nvme or RAM disk if you got tons of RAM just does that speeds up a

lot of the stuff significantly as far as compatibility of tools a bunt 264 will save you a lot of trouble if you use that you could probably put the stuff to everything freebsd all the way to whatever but if you're lazy and it's wanted to work that's easy better do it if you enable every single possible data source including like the terabyte census I'd be for data sets you get credits every possible thing out there you probably looking about two weeks to bootstrap which is about three four days on a good to mention to download it all and about two and half weeks with four to eight cores to crunch tall into something can actually process

but you only do only do it once at least so it's kind of pain that to get off the ground I mean look at doing a four dataset that I'm allowed to share me put up a like BitTorrent for the process datasets just to be strapped these things to be transparent a particular takes a long time to bootstrap took about I think four or five days last time and about eight hundred gigs I think a storage space so and the output of that is pretty small thinking only ends up beating about I don't know 100 gigs total of crunch dot but when it's done but it needs a whole lot of temporary space to pull it all down anyway so if

you use kind of the lighter options is pretty much couple hours that we're staying but you should be good to go like the sooner our DNS is probably biggest one pulldown I think let's see I've got a little incense your that's been chewing away on its own or DNS with four cores for a few hours now and I don't want be done couple hours it just does a full full-blown like dump conversions sort split flip the CSV from value key to key value roll up on those values to generate two M tables back out but we'll go to somebody tells that a little later so one of the challenges you wanted to doing any kind of

large-scale data processing these days is that I op like Iowa speed is way more expensive to to prove than CV course if you want to replace all you know ten terabytes of hard drives of spinning drives in a server with ten terabytes of SSDs it's pretty freaking expensive compared to just putting in a you know CP with more cores or getting a lot more low-end machines together so the way I've been looking at this is because I ops are so expensive to provision and because I don't wanna spend like 30 grand freaking flash drives I just want my stuff to work for you know a couple of bucks on a server cheap PPS I end up

using inline compression utilities for every step of every process and this in this thing so if you if you dig into like I know data batch scripts for a process source files even the sort command lines and abusing Pig Z which is a parallel gzip utility to compress the temporary files because it's still faster to spend all your cores compressing those temporary files and it just write them out to your hard drive because hard drives are so damn slow compared to it so it's one of those kind of weird things where the further I dug into this port the more made sense to really compress all the data between the pipelines as opposed to trying to just

you know depend on the fast George as you started looking to the native formats for this lz4 is one of the best ones out there today for parallel processing support and for a compression rate bees of two is great same as SD XD and they both a parallel utilities for it the challenges though is that those compression formats are not widely supported within other third-party tools so you want to take a file that's been parallel compressed with B zip to you can't give it to do or any of the Java parsers because it doesn't support the parallel blocks loop block format so end of the day you're kind of stuck using gzip is your your

internal format for a lot of this stuff if you have any compatibility concerns whatsoever or unless you're willing to rewrite your whole stack of some third party stack to do any kind of big data mining or data link stuff with it so the kind of the Magic Bullet I've been using for a lot of the fat look-up tables is a sorted string table implementation which basically just a big key value pair created by for our site security they used M tables for passive DNS databases so storing with massive amounts of DNS records per day I use it to basically create giant key value data stores for each of these data sets we download and I do have a

basically tanning taking any data set that I get and turning to key value pair by structuring the keys very specifically so IP addresses for example I just store the normal format and there are simply easy to query from that but for things like domain names you have to reverse them you have to store them is calm that's something that something does something and so if you want to find everything within a common space you do calm domain name dot star basically to find everything below that key because you can only do four lookups with these key value databases so as long as you structure your data right you can find really creative ways to create a key that makes really fast look

ups but sometimes it's cheaper to create more than one database at forward lookup an traversal look up as opposed to creating as opposed to using some more like a Postgres database or standard relational database I can tell you that when I started really diving to this a couple years ago nothing really scaled past a billion records on the single machine without falling over was probably the worst thing I looked at and that everything else was some level of crap below that Postgres you can technically get it there but you're probably not going to get there in a single box you can put on your you know on your desk one thing I like about M tables is after you finally built

these damn things you can run them just fine on a laptop a gigs a ram and you can create a one terabyte database and there's no head there's no running demon there's nothing it just and that's the database the database seeks into the index and looks them up real quick and you're done so there's there's no real overhead using em tables there's no running process no running demon there's no you know maintenance there's no backups you bases copies in the house around you're good to go one of the great things about em Table two is it has multiple compression formats the default snappy but you can use LZ for so it ends of compressing a

lot of the raw strings and the data very cleanly and really very small as well so oftentimes the M table output is at half the size of the source format even with you even if you have create multiple versions of it I'm let's see so the challenge of them table is not like any real quick ways to query it and so I have built a query tool and goal line called MQ that's what it unde it did a parses repository and this handles all the weird newest differences between the key values format itself and be able to do things like a domain lookup so unique pass it like - domain something calm it's gonna flip that key to the reverse mode and

look it up and reverse to east or it's gonna assume the domain spent the domain name itself and the host names in the store reverse order and then do that automatically for you same thing with the Siders even though internally and store it is like a text 88.8 it'll actually break that up into they say pattern matches and then look at the right keys below that and so if you give it like a single IP for example just look up that individual key record but if you give it like a slash 24 it'll look up a type 8 to 8 that star and then if you give anything smaller than festival just last week for like a 27 it

will do integral filtering on the side or there so it's probably too much detail for this but long story short you can use this view sighted lookups to main lookups if any of these data set screen talk about pull out just the keys just the values raw format or JSON format and it's generally kind of a Swiss Army knife to searching all the output that we're going to talk about here so if you take something like the census io ipv4 data set which is like probably 1.4 1.1 terabytes right now depressed lz4 it'll probably turn into about a 2 terabyte M table database but you still only need about 8 gigs of memory to query that under laptop and even if

something spinning hard drive it's still pretty fast because the index itself will fit into Ram so anyways long ways saying there that you spent a whole lot of time getting the databases the right format and really optimizing the build process but once you actually use them they they're really fast they work well on commodity hardware so I was worried I was gonna burn to my talk too fast and I think I'm a little behind so I'll try to catch up here so we talked about this already basically if you want to grab census i/o make make your own version and show them locally go grab the raw data install LC force it live LZ for tool and then to

run the download that shell has since died before converted it and may take some time to do it when it's all said and done there's example down here the bottom where shows doing a lookup for the 888 0/24 subnet all at once just the single lookup so you can pull basically the census i/o results for a whole Sun that instantly on a local laptop without much Ram once you've basically baked out this M table file things that aren't in G things that aren't in table format or probably JSON L which is kind of line tool in a JSON structure these are kind of ugly and they these of a lot of space but they end up being a lot easier to parse it

something like XML you can grab them you know why based breath you can pipe in things like jqj sock tap so unattainable matically convert all of your air and xml data if you've got that if you sign up for a subscription turn that into these JSON files that you can grab that really quickly so you're trying to get a list of all email apps just for any group point of contact with an errand you could dump them all out really quickly to JQ and in this case I found all the cities of the anybody who had a Microsoft comm appointed contact address in the area itself so it makes it super easy to grab this stuff since one of the other

formats used for a lot of that I data work and then for everything else is a text file you got the end of the day you want to build you want to import some into Excel you want to send it to a friend you want loaded in my sequel you loaded it Postgres if I stick in a you went through an elastic search almost every day to file at least the raw data files end up being stored in plain text format and make it really easy just to kind of go from there so on that note move on to little bit on storage if you're looking at these particular data forces Aaron for example we use about 8

gigabytes per day which is a whole bunch because you're turning XML to JSON L it could be better compressed of course but that's where it's at sonar is a little bit bigger because you've got the reverse D nastiness and then you in reverse and for lookup tables and the raw data and they CSV files that it creates this process at all you get a ton of value out of these things the end of it being about 200 gigs a week you don't really have to keep old ones around if you don't want to the most recent ones yours is the best one I usually keep about a month or two for my kind of hot systems cz DSS is

about 1.5 gigs prima jots 4.3 WS that is books a 2.5 census is really volume up it's really big that's about three terabytes per snapshot and they do daily snapshots and that's putting the raphael cimber fandom tables intermediate artifacts if you want to go kind of pick and choose which ones you which a particular data file you want to mess with or download or crunch you can basically just give - yes and the particular source name or - to a local list of sources and will give you a list of all the options there after doing this - about two years or so about 30 terabytes in I've got a whole bunch of nasa rays and back as my

backups and cloud back my back as my backups but you don't have to go quite so overkill I've set up the system for three clients now and they were able to do it on the box with I think 10 terabytes and the first one two terabytes and another one and one terabyte on the third one just by having it on a cheap old laptop in the corner so you as long as you trim your data you don't have to go over kill all the stuff you just basically keep track - most recent data sets and then process them once you download them and delete it when you're done so the whole point of this kind of long diatribe without a

bunch of scripts that download stuff and process it is we know that's actually do cool things with it so the whole goal this is be able to do really fast look ups by domain name by IP range and there's lots and lots of common data sets like I use this thing probably five times a day for work all the time whether I'm doing a pen test whether I'm helping a customer scope something out whether I'm just doing market research or something there's I used to do it all day long every day so that's that was kind of the driver to submitting a talk about it as I feel like it's useful enough other people should she can

access to play of it so easy thing of course is if you I'm trying to find all all host names than a domain name I can do that MQ query with - domain inquiry the CT files the Sun or DNS files premium drops WDS that IO lots of different you can query the list of all the domain names out for a given thing same things with IP addresses one of the things about taking those raw data sources and flipping them on their head instead of doing like you know most these days for us to start off being a host name followed by you know the IP address or some other chunks of information that point to it

when I do it with most the data set so far as I keep that forward lookup then I split it and take all the values and then sort the values base and the keys are associated with and now you can sort saying okay what domain names I've ever pointed to his IP so the reverse lookup part of that as ends up being really useful you can find every domain that uses the same DNS server you can find every domain host on particular IP address so you want to list of all the videos so they give a subnet it's it's super fast and it works like a charm and you can find from really obscure stuff

that way like third-party domains that point to the same server or old marketing domains that you weren't aware of fun stuff that same thing for a subtle TLS sites if you're doing a typo and keyword matching you can run your scripts against all the premium drops and wwas that io data set so every morning so your chunk finds someone setting a phishing site against your presentation that's a quick way to automate that and those tend to be really cheap needed poll if you splurged on the Whois XML API you can go through and scan all of those downloaded files for any register with the same email address so there's a kind of sketchy demanding a register if you go find

although the domains as part of it but I'm gonna go give domain tools for thousand dollars and you know one things that I really like about and the reason why I keep so much this data around is the historical access I really want to know who owned a domain at a given time in a given day and who we know what I like who actually own this IP address or what is P was associated with what service wrote what services were open on it what its reverse DNS was that information changes so quickly that a lot of the kind of threat intelligence tools out there do a terrible job they assume that once you've seen an IP that

IP is all is that IP they don't tell you that hey this IP he's no longer it doesn't belong the same ISP three months later Sally historical ownership has been a really great way to tear apart some of the terrible threat intelligence research reports out there or to figure out what to be reports that we're seeing around she at which ones aren't but you can't really do that unless you know the the time window that they identified as a threat and you also have correlating ownership information but the same time with him that's basically it once you bootstrap this thing basically running daily shell once the cron job adding new scripts you want to monitor

mash notify diamond doll that you know will dive in all the jiggli scenarios your going forward but this stuff just ends up being super useful so I like you recommend is you know grab an old clunky server shell some tools on a run for a little while and you know ping me if you have any questions what we can do with it or how to how to make stuff but I'm gonna keep working this thing for the indefinite future so have to collaborate in things as well so that I can move on to the next thing here which is terrific transparency so I don't know how many people expend a lot of time on certificate transparency before in

previous kind of small local clocks have given but wasn't something a lot of people really knew about or turn that much about see T's a Google project to track all visible SSL certs and to last sort some internet so some ISPs watch sorry some CAS are required to submit all of their newly registered certificates to replicate CT CT is a whole bunch of independent servers the internet that people can submit new logs to a new new certs to and they added kind of like a blockchain but more of a Merkel trained Merkel change that their append only list you can then query and say you know tell me everything you've ever seen this is kind

of a public I write only archive as a whole internet cos infrastructure and it's really useful for a lot of stuff it's good to make sure that CAS aren't you know issue inserts that they shouldn't be and selling off by intermediate resource efficient there's been a couple service certificate authorities kicked out of Chrome and Firefox now because of things that they did they were caught by C T so C T is pretty much the invit enforcement width for google chrome and to some extent Firefox as well in terms of how they look at what CAS little support Komodo even though they didn't have the best track record going into this they spent a whole lot of money and time building

this thing called CRT shell or CRT de SH which is a kind of looking-glass search engine into the certificate transparency backend so if you want to just be able to query the stuff that I'm to go set it all up yourself and pull it all down so your TSH is a great way to do that and you can do wealth card queries there as well you can do % domain name and it'll tell you every SSL serve that domain name the first ever seen so it's going to fun third-party service if you want to get a feel for where they there's like that doing jumping in with both feet so the cool thing about CT is

anyone can operate a lock you can build your own log server you can take the standard source code run it and then advertise and say hey seven your certs there's a bunch of known logs out there that are have very clear purposes so for example of a pilot log as far as I can tell it's only used for let's encrypt that's a purple Schmitt submit all of the newly issued sorts to the pilot log so if you want to track what's being issued to let's insert let's encrypt you pulled up a log it's look at the known log list they'll give you an idea of what the purpose that some of these ones are each

these log service supports a bunch of API endpoints and there's no authentication for this stuff it's just they'll they may reject your submission and if it's not a valid sort chain or it doesn't match the purpose that particular log in a general anyone can query any server one of the reasons why the letter certificate transparency is really important is that for every extended validation certificate every extended validation certificate generated by CA it must be submitted to two different CTS or chrome will not support that CA so my definition every one of those high-value sites out there that has an IV cert has to be in situ by definition otherwise Chrome will stop trusting the CA so between this and

let's encrypt that's the main drivers for why these CT logs are really important for discovery and for lots of other stuff what I'm into in a second but the cool thing about this is you can see all these staging sites pre production sites development environments all the stuff that most people don't realize is being exposed because they're getting the cert before they turn the site on and the site's not really ready yet and that's kind of a recurring theme as we go through this a little bit so just like extended validation starts all s encrypt sends all of their new certificates to pilot CC server and as the market share of let's encrypt keeps increasing that's

gonna happen more and more and more and as people are adding integrations to less encrypts or things like docker cubed aids and Cibola like all different you know puppet all these various tools that depend on having HTTP communication or deploying servers that require excuse communication are using let's encrypt because it's free and cheap and you know easy to manage long as you handle renewals properly the great thing about this though is as dynamic infrastructure tools start using let's encrypt they now become discoverable in real time we can now almost in real time identify new assets as they're being deployed oftentimes before they pick it figured the way they should have been so it's a quick little graph with the growth of

let's encrypt it's not changing anytime soon the little drop years was the change methodologies there's a highlight rocket but you know let's encrypt is position to take over pretty much all of TLS this is number of domains and this is number of searches per day so if you want to one of the fun tools that's part of the eye net data repository is I need a CT tail you can literally just run this utility and it'll just spew out all the sorts as Irving register directly to your karma and anything pipe goes out to lots of other fun things so I'll put the size in line later you can with the reporting but essentially grab the tool

you know start running it and start graphing for whatever weird thing you want to look for and you need to put a bloom filter in front of it to deduplicate it by eating lots Ram but that's pretty easy it'll be in a slide notes how to do that and feed all this out fits all the output of that directly to whatever scanning tool you want so you either whether you're looking just for your domain or are they looking for a customer's domain or everything on their net and that's a lot of fun stuff so the cool thing about this is because let's encrypt Stuart's are often issued before the application itself has been fully set up there's a spun little rates

window or fun little race condition in the timing a lot of web apps provide admin access the first person go to the site when it's being set up but the site does invest they'll sir yet so what do you have to do wait to go to let's encrypt to get that cell sir before you bring the app up when you bring the app ups not been configured yet so if you're able to basically snipe these websites before they've been configured as it starts being submitted directly to let's encrypt from the CT logs you can show you race the admin steal their server backdoor reset it and then to permit access to the server and this works

really really well I'm actually running it real time right now so here's an example of all the sites that are deploying things like see there's a wordpress site that was just launched with still misconfigured are still unconfigured another one up here which is Drupal sites a bunch of trees there's a bunch of in there anyways there's all this various third-party web apps out there that are known for having lots of abilities all generally do a first-time setup process and those things are easy as hell to to scan and exploit using the the CT tail tool followed by some quick fingerprinting it's this big ugly nasty command-line we'll go grab the h-2b title every site being pulled out at the

CC tail and then basically running through end up and that's HB title grabber and all you just basically tail them look for the last thing I'd and just log in and steal the server it works really well soon example for WordPress box that I ran into earlier this all these demos from earlier today they're just running real time down the background as well yep just gonna steal it and then a bunch of other web apps I found an hour ago just tailing directly from CT logs so there's something definitely broken with how people roll out their web apps today especially some of these different ISPs that will enable let's encrypt to get an SSL cert but then don't force it you

know there's no limit access to the application at that point until it's been set up so there's there's lots of there's a fun little race condition though you can mess with so with that I'll move on I know we're little short on time so I'd go and move into discovery stuff there's a fun trick you can use with a juror if you look under the cloud app Azure to calm domain you can find all the lock ec names for azure assets so you're talking figure out where a particular ask that's being hosted or what type of store boots on you can then pull it directly out of these save me for super e transparency there's also highlights that the sooner

rd nest is that has no eighty-five hundred or so of these but the CT data set is about twenty eight sixty five and they don't actually overlap that much for some reason the way that sonar is gathering its data set and the way that CT starts being issued you kind of to combine both of these together to get a full view of what's going on you can also sometimes find leaked hosts named for internal at cloud op that net which is how internal DNS and Azure works and you can basically use that to attribute they give an address at back organization that was it so move on ahead to domain fronting I can't see the

audience but I'm assuming some of you guys know what domain fronting is it's a fun way to use respectable domain names to run your c2s out and bypass most exploration tools or expiration of stopping tools you can use all the em tables that are dari behind data to find all those so here's how to get all your address stuff it's a couple command lines you gives just big old dump of all the host names you can use for it so you want to find though the more obscure less or overused domain fronting domain names you can use pepsico you can use Shama all kinds of stuff same thing for cloud front easy to grab those just basically grab them

tables scans for clock runs on net pull it out there you go and again for fast the same thing so you can do this for Google in two of us I didn't finish up all those examples that I would have already left but if you do domains running a lot this is a quick way they get all your domain fronting names you can ID you have kind of other use cases for it in the case of discovery for M&A if you look at these individual sources got 17 names out of f Tina's for out of Rd nos and 10 out of transparency but he combined them all together you get 20 host names so it usually makes sense

when you're doing discovery for either M&A or as part of discovery or scoping tasks to use multiple data sources combine them all as part of how you generate your initial scope if you're looking at a domain with a lot more domains like McAfee for example this kind of demonstrates that there's not a whole lot of overlap right now between sonar and circuit transparency and other third-party sources I didn't get the stats that pops up total but in general all these different data sources can have totally different results and totally precise results and things that are missing and not missing does it make sense to combine them as much as you can so quick summary is like you know having

a giant local database of all these internet data sets to really improve your security stuff everything from discovery monitoring exploitation exfiltration the costs are really low compared to how much cost the value all the stuff is especially gonna pay someone else to run it it's easy to keep you know your client information safe you're not submitting for third parties so just roadmap stuff lots more data sources more normalize errs more real-time streaming sources an idea to see t-tail PDS performance improvements and so on I would love help on this if anyone's interested attributing to the stuff drop me a line at the talk and happy to like collaborate and who builds what one where it's all under developers

licensing it's MIT at the moment halves in Ruby house and golang right now probably all eventually be killing at some point yeah I use this to build out your own API and build up your own thing you know monitor your company stuff or do much research to do a talk on it and with that I think that like a couple seconds here I can use your quick demos five okay so I'll go ahead and get some demos up real quick just kinda show you it looks like and just to get them into these this is kind of a why of CC longtail we actually just run the CT utility let's say things like screaming yeah there goes where's the fault

command line for CT tail but we just runs CT tail itself and then let's run that through a bloom filter here's basically a live run of all the host names being registered to CT in real time so these are all those things being pushed directly out to the SL infrastructure as they're happening which is awesome so it's kind of cool to see those it's a couple other examples here for you note one say let's see fines may have another Trimble's Pat yeah so you want to say find me every domain server on the Internet where it's ends in where the domain server self NZRU you can get basically all of Russia's domain servers in one shop and all the domains that

point to them so lots a little fun things all the tools here that work on domains also work in entirety all these at a time so you want to do like internet scale kind of research stuff this is all very very well set up for it you can do small domains big de Maynes lots a little bit to do it that's it I think that's probably for demos as far as time goes this might make sense to switch to questions all right HD can you hear me can you hear me HD

so you covered that all you're covering all footprints of any company have you find have you found any case where you missed one domain or one IP address because it was not captured in sonar or it was not captured in certificate transparency or any other place you know like it's not all its most of the things you are covering through 30 terabytes of data was there any case like that yeah sorry this is the cases of some sockets

thank you yeah there's definitely cases out there where the things have been throw party databases where there's pathological sonar or whatever they do is yelling this stuff that's why combining lots of source if you see a better perspective on it and you still generally want to use some active discovery for doing a you know a brand new pen test and you haven't looked Network before definitely grab your CT data your sonar dienasty the would also get past a total else do your outfit numeration your your go busters your fears your sub group all that because you definitely miss things that way all it takes is a couple drop packets for those large data sets to

totally miss a domain also next box [Applause]