← All talks

Turning Domain Data into Domain Intelligence

BSidesROC · 201853:42222 viewsPublished 2018-04Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Talk Description: DNS is a locked system - you can’t model the domain space at scale unless you get an AXFR from every authoritative nameserver there is, but you might be able to get a good model going if you attempted to discover and resolve all FQDNs. So, we’ll do the latter! dnstrace is a volunteer-supported, free suite of software that harvests, analyzes, and visualizes the relationships between domains so we can finally turn “domain information” into “domain intelligence” for everyone. Ever wanted to generate better domain reputation so grandma doesn’t get sent to the 200th .ru domain registered today that serves Flash malware? Or wanted to evaluate patterns in cybercrime at a global scale using domain data? Through big data and careful analysis, we can push the security envelope until we’re ahead of the curve for the first time since Creeper. Bio: Chris “tweedge” Partridge is a 3rd year student working on his Bachelor's in the Computing Security major at RIT, Black Hat 2017 alum, and BSidesROC regular. He’s extremely passionate about making sure he doesn’t have to take any more 11pm phone calls from his family about their computers being infected, and has been putting a disproportionate amount of time into making that happen. He believes that writing and enhancing security technologies coupled with better security education can change people from “easy targets” to “not worth it.” As the core author of dnstrace, he’s starting to bring those dreams to life, one caffeine-fueled music-blasting coding-session at a time.
Show transcript [en]

hello everybody hello yeah that's the only time I'm gonna use that but I guess we'll get started it's about 1 p.m. this talk is gonna probably go along because there's a lot to talk about and I really hope that you all have interesting questions for me at the end without further ado let's get this started so first of all who am i my name is Chris I'm doing a talk on turning domain data into domain intelligence I'm the founder of DNS trace Pro which is actually the project that does all of this so I'm pretty much just talking about my life for like two hours I'm a third year RIT student in this computing security department which I'm

very happy to be participating in it's an awesome place awesome people shout out to 4 professors in particular Robert Olson Bill Stackpole Garrett our Karachi Jonathan Weissman I don't think any of y'all are in the building but can you know thank you acknowledgments I would applaud but there are literally none of them here so ok so VA and I'm a big geek I run snort on my own network I love guacamole I'm a Dungeons & Dragons GM and what little spare time I have and I'm not really here presenting as an expert and I want that to be acknowledged from the outset at the end we won't really have a question-and-answer session so much as a

questions comments concerns and corrections section I'm coming to you as somebody who spent a lot of time working with DNS and domain data you're free to tell me I'm wrong about something and we'll figure it out I have a notepad up with me and a pad let's figure it out so the contents of this talk are broken down into seven little sections first of all a quick refresher I know I'm sorry second of all the reactive threat intelligence problem really evaluating what kind of security issues there are with turning DNS into a secure service for consumers and then we're going to start trying to fix that problem of course so we're going to look into

scraping and ingesting DNS data we're going to look at anomalies and analysis that we've found so far we're going to then try to figure out if we can chief proactive intelligence instead of just reactive intelligence we'll talk about that in section one technically and then we'll talk a little bit about limitations future improvements and then open the floor up to discussion so this is going to be a very domain centric talk and yeah everybody hates DNS that's okay I do too I work with it all the time it is not fun so just for the people that might be new in the room please work okay thank you so if you have a link you

have HTTP colon slash slash www.facebook.com/hcmedvedikharkiv comm which is a fully qualified domain name and it's broken down into certain parts facebook.com breaks down into three distinct parts top-level domain or suffix in this case com registerable domain which is Facebook that's the one that Facebook actually owns and then subdomains that's the free domain space that Facebook gets for owning facebook.com I want going to tell you DNS houses a lot of information it tells your computer how to get to places using ipv4 and ipv6 addresses it tells your computer canonical names like where you should actually be going instead of going here it tells you MX records for mail NS records for name servers text records if

you really want to you can configure geographic data to be served by the domain system why I don't know but you can and of course you can use a request called aw or sorry any to try to get all of this for a given fully qualified domain name we'll talk about that later and how does it work you make a request to your dns this is assuming a recursive query that's the two-year DNS which is yeah a recursive query that's then solved by iterative queries you send a request out to your DNS your DNS goes to the root DNS and you're like hey you know this guy wants to go to facebook.com how do we find com

and you know you get a reply saying this is how you get sitcom your dns asks com how to get to Facebook's name server your dns then goes to Facebook's name server and says how do I actually get this person to WWF ace be calm and it's a hierarchy this is a hierarchical system and there's no point that any of these DNS servers are obligated to expose any more information than the information that you specifically requested they're not going to tell you everything about Facebook or everything about com it probably will tell you all the top-level domains but we'll talk about that later too but anyway it makes this kind of close since there's this

hierarchy and nobody's obligated to tell you anything that you haven't already figured out something about for example a fully qualified domain name it's hard to get some data out and now we're going to talk about how DNS security works for the consumer for example if domain is seen doing something bad like distributing malware for example you know what kind of security intelligence do we have there to protect the end-user and we have some real problems first of all and let's consider for example that let's realize for example that reactive security is not great security and let's consider a file-based antivirus file signature based antivirus that is just 100% based on signatures malware author generates some malware the antivirus

authors catch a copy of that malware they analyze it they get a signature they publish that signature to all other users of their product in the meantime the malware authors generated 5 10 50 more files with different signatures by making small changes to code or making morphic code so it's really really easy for people to circumvent and it's purely reactive so it's difficult and costly to maintain for every new file that comes in you have to develop a new signature and you have to publish those and in that time your users are getting infected all over the place this is the kind of stat or status that you know malware was add in 2000 2005 it's not

great it's better than nothing don't get me wrong but it's it's not great and things did get better better for anti-malware programs they got a lot better we now have heuristic analysis but do we have that for the domain space here's a quote from Symantec an IP address earns a negative reputation when Symantec detects suspicious activity such as spam or viruses originating from that address that sounds like purely reactive security to me now when do you think that this was published if you guessed the last night the answer would have been you know not surprising this is the kind of status that a lot of security companies are at when looking at the domain space and trying to proactively

block threats at a network level here's a little bit better if the IP address is changed frequently and if the site has an IP address that was hosting malicious Continent content in the past it can result in a poor web reputation now this is from Cisco Talos and this is actually a really great step forward because they're starting to use heuristics so they've noticed that malicious domains tend to change IP addresses a lot more frequently than non malicious domains do now that is a good start and domain safety heuristics are available and they do help you can look at how recent the registration was because some malware authors tend to burn through a lot of

domains very very quickly you can look for frequent address changes as Talos does you can get estimates on popularity if you have that kind of footprint on the internet which of course this being I am a college student we do not and what can we do to improve this so consider the following there exist systems that are relational learning systems these systems can predict unforeseen or future relationships between entities based on past observations and you can see really good examples of them down here they tend to be directed or undirected graphs with a very set structure and they are typically very interconnected now we can't really do this for malware malware that has too many dimensions it can be

too many things you can't classify it as easily as you can other data so malware X does 50 different things and malware Y might do 200 different things and just classifying that much is going to be a huge investment and then what are you gonna get out of it you're not going to get a huge amount unfortunately but don't names well domains have a very fixed structure it's very very strongly ingrained into the domain name system there are only a certain number of things that a domain can be it resolves to certain IPS resolves to certain other domains it has certain records of certain contents and that's pretty much it so what if we try

to apply relational learning to the domain space well we're going to need a lot more than just a you know say a list of some domains you need to have a huge quantity of parsed domain data you need to have fully qualified domain names you need to have subdomains you need to have pretty much all TL DS you need to be getting everything that those resolved to as well because you have to build the interconnected graphs and you're not going to be able to connect facebook.com/ to yahoo.com without a little bit of an IP and some collect this passively there are big repositories of domain data for example DNS DB that's run by Farsight we don't

have again the resources to do that we can't establish a passive tap where we get you know 2.7 billion data points per day just by observing the kind of DNS traffic that goes through the world we have to acquire this kind of data aggressively we have to perform the queries we have to figure out what goes where on our app and we need as much threat intelligence as possible to go with that it doesn't do you any good to notice that yahoo.com and malware dot are you go to the same place if you don't realize that malware dot ru is malicious you need to be able to figure that out so you can see this information

into the relational intelligence systems that we create and then of course you need to make it responsive because it's not very useful to have a graph database that tends you know 30 minutes to render a small graph with upwards of you know like 500 but less than 1,000 nodes and then you need real methods of analysis you can't just say well this touches one other malicious thing it's probably bad you see a lot of stuff get flagged you're going to have a lot of things going to certain places for example if you're looking at a hosting provider there could be a ton of domains pointed you know one IP and if one of those domains is malicious and you

automatically flag everything else you're going to have a huge false positive so you need to figure out in a quantifiable manner how you're going to evaluate these graphs to generate real intelligence that is not just okay well it's getting somewhere near something malicious so I'll put that so let's get into it the first thing that we need to do is acquire a list of TL DS well you can go to da da da da an org and get a list of all TLDs congratulations that's the easiest part of this entire talk currently there are 1543 top-level domains active this list is available by HTTP and by FTP if you do so desire great ok this wasn't really a challenge

now acquiring domains is going to be a little bit harder so each top-level domain has a zone file and that zone file contains a list of all the domains and also all of the name servers for those domains and you can buy access to zone files through certain providers for about 300 dollars per year which seems to be a reasonable price for anybody who isn't a student and skin broke and that's my ramen money so we're not going to do that you could alternatively request access to zone files from those registrar's directly so pretty much circumventing the middleman that aggregates that zone data and going straight to the register and being like hey I'd like to see your dot-com zone

file please and a lot of them will say yes that's fine as long as you're using it for a real problem solving thing are you doing valid research for your an academic and a lot of them will say great some of them won't such as Yahoo Zappos etc which all are on their own top-level domains and will not let me touch them and if you wanted to do this kind of thing in book you can go to I can see zds which is the centralized zone domain service zone data service my apologies and you can actually bulk request for I think it's 500 or more zone files with like a single form now we get into the real problems so we've

gone down the hierarchy from the root to top-level domains to domains now we have to figure out with all of those domains how do we get those of the man and I'll give you a hint they're not so kindly you know regarding people requesting access to their zone files unfortunately that's just how it is for a lot of companies Facebook doesn't see the real value in exposing the facebook.com zone file to me a researcher even if it's for a good purpose they like to keep stuff private and that's fine so we're just going to discover things that you know anyway now there are certain things that are gonna be out of scope I'm gonna just talk

about like how you could do this if you wanted to and why we're not going to do that and give you some examples of tools that you could use there are going to be other things that in scope that will actually do demos of I have the demos pre-recorded because I don't trust the Wi-Fi so they're really nice gifts I promise so first of all you could try to brute force look up something you know you go to facebook.com you point dig at it and then you write a quick little script to generate every possible subdomain from A to Z zzzzzz absolutely not that is a great way to have somebody knocking on your door asking who the

legal resident is because you're being served papers you know a good lawyer can make this look like a denial of service attack and if you tried to do it at scale which we are going to try to do a good lawyer actually a bad lawyer could make it look like a denial or a distributed denial of service attack you don't want to do this and if this is the first thing on your list to try shut down y'know take a breather get a monster you can figure out a better method a good way of doing this in normal engagements would be to crawl a website if your domain that you're investigating has a website online anywhere that you can see you

might as well just go on there and follow as many links as possible it's fairly complex and resource intensive at scale if you're rendering the entire document but you know it's definitely a possibility and you can do this with skip fish web scarab etc there are a lot of great options to do this and I do recommend it for your own engagements but not here and we could also do you know looking things up in search engines and of DNS and again this is great for real-life engagements I actually strongly recommend this as a starting point to just identify possible areas of scope for any engagement that you might be going into and you can use sub Lister

which is built into Kali for this it's phenomenal just give it a domain and it will find everything Google Yahoo Bing passive DNS etc but this depends on external services and those external services are going to blacklist us if all of a sudden we have a thousand queries per second hitting them so that's now let's start to evaluate what might be in scope now brute force wasn't a great idea but what if you did a probabilistic lookup you know there are certain domains and subdomains that tend to show up more frequently than others for example www or Mayo or things like that so if you were to perform lookups on things that you suspect or more likely

to be present than not that could really help cut down on the search space and you can combine that with anything that you've generated say a word list off the website that you just scraped to cut that down even further and you can actually enumerate my site using some software that I wrote for Robert Olsen's class with 596 queries total just based off rapid Evans forward DNS data set so parse rapid7 down I actually provide this in the github but you parse it down to the most common subdomains you count them you sort them by the frequency and then you run through until you run out and so here's that software you know available here's that available software

running in default mode with the default list I'm just saying hey we're going to enumerate this domain using a records then you give it a couple seconds thanks so please there we go and that's all of the subdomains that were present for my domain at the time enumerated and five 96 queries exactly takes about 10 seconds it's a single-threaded program I wrote it in like an hour you can write something much better to get that time much or much much farther down and things will work out for a loophole for you now you can also do reverse DNS which is a simple little dig query just the dash X and it's useful for ipv4 addresses pretty much for taking an ipv4

address you're saying hey domain name services what is what domain goes here but unfortunately that usually ends up in ISP assigned address is not really anything of use so you know the Digg plush short for 1 to 1.7 3.6 2.2 ends up with something dot cable dot telstra cleared dotnet it doesn't tell us anything so it's something that we're not going to not use but it's not something that's going to give us a huge amount of data now let's talk for a moment about DNS SEC DNS SEC is a set of security extensions for DNS it provides three main things origin authentication so you know that the reply is coming from a name server of authority it

provides data integrity so you know that the reply that you're getting is actually correct reply there's been no errors in transit etc and you also get authenticated denial of existence denial of existence well how does they make that work so what they actually ended up doing for denial of existence is they implemented a record called n sec and n sec is pretty much the next secure record which in a lot of cases tends to be the next record that exists so if you request test example.com which clearly does not exist the name server will reply that the next secure record is WWE example comm oh boy that's phenomenal and you can see you can test things out right

here and if you look very closely you'll see 164 in and SEC www.example.com and you get exactly what the crews going to look like and then you can just push for that and then Krieg for the next thing you know probably doesn't exist and either you'll get a response and cool it exists you don't get a response and you understand the next secure record and then you have the next target to enumerate it's awesome and you can do this with tools like n SEC 3 map which for being a proof of concept tool has a lot of really nice options this is available on github it's not mine but it's available on github and it's

awesome and we're going to per say yes and sec to enumerate and would you look at that there's all of the subdomains for PayPal and oh that'll take I don't know two more seconds one Mississippi two Mississippi okay it's closer to three but anyway with a mere six hundred and eighty nine queries we've now enumerated the entire sub domain space of PayPal now to circle back a little bit and just reiterate how bad root for us is if you wanted to do that just plain old naive brute force that would have I believe the word was Septon des Cillian I are there any mathematicians in there it's like 10 to the likes 55 it's ridiculous the amount

of queries that you would have to perform and that is the query throughput of Google ferb several millennia out probably until the end of time it's a lot of stuff so techniques like this can really cut down on the resources required to enumerate a domain in its entirety now some people very quickly noticed that that was a bad idea and so revised II NSX standard came out that introduced the insect three record and that improves privacy but doesn't solve the problem completely so pretty much instead of when you request test example.com instead of replying with WWE example comm is the next secure record it says that there is a secure record before this and past

this and it presents the hashed values of those subdomains now that is frustrating doing an DN n site three look up on 0x0 potato de which clearly does not exist but is secured with insect three produces nothing of use I mean if you can figure out just by looking at that you know screenshot what the next secure record is in plain text you're a god and you should be presenting not me but it's hatched and these tend to be fairly short because WW even if you're guessing every possible character like possible combinations for three characters on your local machine is pretty low so what we do is we enumerate the N SEC three secure domain

figuring out all of the you know there is nothing between X hash and Y hash you can and still enumerate all of those hashes not the domains themselves but the hashes and then what you can do is you can throw that into hash cat if you really wanted to and I did really want to so I compiled OpenGL for my Xeon CPU and I threw hash cat at it at the dot de top-level domain zone file that I've just enumerated I'm after a mere couple seconds we're already up to 11% discovered this is using a very very naive mask it's not very useful it's not very fast but with a little bit of this could actually be a viable way of

enumerated domains especially if you have per se several GPUs which the system did not but that still wasn't enough for some people so if you are CloudFlare and by the way if there are any CloudFlare guys in the audience you're the bane of my existence because they've implemented something like this where instead of assuming that the next secure record is going to be a record that exists it will generate records on the fly that are garbage records and you'll see running through this again we get nothing there's an existing domain ABCDEF what are these they're not mine but you know CloudFlare says that this stuff exists and all of a sudden enumeration is no better than brute

force great job CloudFlare whoops for me but going through a recap here if you are an engagement with somebody who has a DNS SEC engage DNS SEC enabled site enumerated you lose very little by attempting an onset or an SEC three walk worst case scenario you find out that they've secured it with CloudFlare or similar improve DNS SEC methods where they return garbage and SEC responses and you know oh well you've lost five minutes of time but if it does work you've suddenly enumerated their entire domain space and that's phenomenal and SEC five is on the way and SEC five introduces certain improvements that makes it such that people cannot do walks whatsoever by default and you know

for my project sake hopefully it doesn't reach mass adoption anytime soon because I like being able to do in psyche loss but another thing that you can do to enumerate domains is of course a zone transfer you can politely ask the you know zone owner Facebook's name server like hey can you please give me the zone file and there are very few reasons to have this enabled but some people still do between 1 & 7 & 1 in 10 domains has a name server that will actually allow you to perform a zone transfer so all of a sudden you don't just get you know a list of subdomains though in this case it was very funny to

have a list of all the North Korean domains as of September 2016 which is phenomenal by the way one entire country 28 domains right but you can perform an expert with very very little resource investment and all of a sudden you get everything in the zone so here we're going to do a little test zone transfer against zone transfer Emme which is a great place to test your zone transfer skills and we got everything we not only get the domains subdomains we get a records MX and a server everything everything we could possibly want and we are starting to get into the next stage of the project which is not just sorry not just fully qualified domain names

but actually getting domain data where does this go and that's great but we're really only halfway there now we have to resolve the domain space so we have a list of let's say a billion for example 1 billion fully qualified domain names we've parsed them down into sub domains in the domains into suffixes but we need to actually figure out if there's any more going on as well as trying to figure out well where does this go what a records what MX records what NS records you know what are all of the characteristics of this domain so we can link it to other domains and the way that we're going to go about this is

we're going to DIY it we're going to try a zone transfer we're going to try any queries which again cloud filler people in the audience you're the bane of my existence because cloud fleur has deprecated the Annie query also known as they've stopped responding to them which apparently if you're an Internet company that's large enough means that you're officially deprecating it so they are not responding to any queries anymore that's no longer a viable way of enumerated any sort of data they just reply with another garbage response which is to be fair I I might be complaining about cloud floor a lot I understand the decisions that they've made here and I understand that any

queries are really really useful for example denial of service amplification but you know it's frustrating for me as somebody who would actually like to use the any query instead of having to enumerate you know option number three iterating through every single desired query type for every single fully qualified domain name that we have in the database but if that's what it comes down to that's what it comes down to and we're going to thread this because each domain is independent and we're going to geographically distribute it such that we might be able to pick up on interesting little things like miss configurations or places that should reply but aren't replying and things like that and of course there is really

no excuse not to use the data that others provide for example rapid sevens sonar project they actually publish several billion data points per week of scans that they have performed off Amazon AWS resolving all of the forward DNS names that they know of looking into a continuation of SSL observatory and also looking into reverse DNS which is something that we talked about previously they do all of those queries on a weekly basis why would we we could just harvest it from them and they permit non-malicious non-commercial use now I'm using this for non malicious purposes what you do with the service is entirely up to you but we'll try to keep it clean and then of course we have to

store everything so we're coming into possession of several hundred gigabytes of data every week how are you going to store and organize that in a way that you can actually query against it the answer for us the scalable database solutions we ended up after many trials and errors with Percona 5.6 my sequel drop-in variant with topo DB running as a back-end which has actually scaled fairly well to our solution unlike nodb which pretty much folded after a certain period of time toku DB has well resisted all of the challenges that we've thrown at it so I do recommend it for large data projects that you might be looking into a storage method for and compression works wonders

this is all textual data you have no reason to not use compression I know you're like you've compressed a movie once it didn't do anything this text data it works awesome we use low compression settings we get about 2/3 space savings it's awesome no impact on performance time either great and if you're querying against it index index index index anything you're creating against you should have an index because otherwise if you're doing a full table scan and you do not want to do a full table scan on 1.3 billion rows of data for example so let's get into the really fun parts we have a lot of data let's analyze it whoops I should have probably

deleted that oh well anyway so first of all we have some interesting domain errors that we've come across things that we couldn't parse using our internal tools and I'm not really sure because I am NOT an authority but I'm relatively confident that there is no top-level domain called devil one five two one nine four five nine three three any Corrections no doesn't seem like a registrable top-level domain and so you come across interesting issues like that again dot a EO 152 no idea how this ends up happening then we got into an interesting one DC - WP prod - f5 if anybody has any ideas as to what that is I'd love to hear them the idea that came

closest to my mind when I was looking at that was f5 big IP application accelerators ok I'm getting I'm getting some nods but ok who knows and then we actually came up against some very odd results that actually really baffled me for a while Nick thought LLC and Mac dot sport but that seems like it should be valid and I am says yes Diana says yes for LLC it also says yes for sport these are TL DS that exist nick is a fine domain name that meets criteria right here very low but it meets them but Mozilla's public suffice suffix list says no and that's actually a problem so we use Mozilla's public suffix lists internally to figure

out what a registrable domain is because if you take the naive approach of saying like WWE facebook.com/ w w is the subdomain facebook is the register bull domain.com is the top-level domain well we're just going to explode on the periods and then take the last value and that's the top-level domain that's the suffix and then anything after that directly is the registry bool domain sounds great and then you hit a site like dark net co uk if you apply the same logic you're going to end up with doc how is your registerable domain and dark net as your sub domain so you actually can't split them naively you have to use a suffix list that will

allow you to separate things into what is most likely and what is not most likely a registrable domain based on these suffixes that are currently available the largest of which is the Mozilla public suffix list it's available for free you can find it on github or public suffix level or things like that and it's actually the list that a lot of rousers used to determine where users where user could ease can be set because you can set a cookie for your subdomain or for your domain and if Mozilla's public suffix list is having some trouble figuring out what the registrable domain is that should be the maximum level that somebody can set a cookie for all of a sudden you can set

super cookies that expand way beyond what you should reasonably be able to set data for and that's very bad and I've actually poked the dragon and I submitted a github issue and it's the first github issue that I've ended up submitting ever actually to any company that is not me so the public suffix list is actually missing a couple I am a recognized top-level domains and I want to figure out what's going on with that they haven't responded yet find out we've also seen some interesting invalid types who I notice I'm running a long time we're gonna burn through this so we've noticed some interesting invalid types for example a query type was returned to us as a 104 - 100 - 169 I

can keep going it doesn't really seem like a thing that exists when normally a type for a DNS query is a or bought a or C name who knows it returns some interesting values that actually returned a link to an ec2 instance and then said that it was an a record interesting and then we had DC - WP prod - f5 I appear again and it appeared talking about Khanum WS if any of you want to investigate into that and tell Kanab WS that they might have a f5 big IP application accelerator that's acting a little haywire you definitely could let's look into stats so with the ipv4 address space the DNS a replies we

actually found a fairly significant use of private IP space so this these are private IP replies on public dns space which is not necessarily good and you honestly it was a very small percentage of all of the replies that we got you know hovering around 0.1% but that's pretty huge when you consider that 0.1% can be you know several million replies that are exposing private IP mappings to the public IPS or the public dns space so there's actually a huge amount of use of loopback the follower was 10.0 well 10.0.0.0 slash eight I met a couple other notables that had a really good amount of data being exposed so definitely something to look into and

you hear a lot Google's touted you know twenty percent of people visiting our service are using ipv6 that's great null only about 1.1 percent of domains that we knew about actually resolved anywhere in the ipv6 space everything else was ipv4 and that's a holdover to how slow the DNS system is with updates how slow people might be moving their services over to ipv6 enabled servers data centers etc it is what it is oh and it's an interesting looking glass into the ipv4 to ipv6 transition we also got some interesting cname mistakes just for reference if you're putting HTTP into your canonical name DNS record that's not how it works unfortunately you have to keep things

within the DMS space only so you get some interesting responses here we actually saw a huge number that had a cname that seemed to be some sort of hash like the 6c a ofe etc responds again oddities if any of you want access to this data access to the database you know you are free to do your own research with this kind of stuff we also get the occasional little keyboard smash and it's at the bottom FTF d DF DF DF great good job an administrator got very bored and we also get the occasional little mishap where somebody forgets the dot and dot com it happens and we actually see that a lot of people point

their MX or NS records to localhost which doesn't work out on the public internet but that's the kind of stuff that we see that amounted for nearly 1% of em X + NS queries that we actually got to apply to so very interesting there it's probably a default on certain systems which is why we see this level so high but again if you want to receive mail you probably need to set something other than localhost so now we have a huge amount of data what do we do with this how are we going to work with this we have to then map this together into these great spanning graphs and how are they even going to tell us stuff well

first of all we need to add threat intelligence I talked about it a little bit at the beginning and then you probably forgot about it but by adding more and more threat intelligence we can really get a greater view into the security of ganna space and we can get better information about what clusters of Deanna's information we should be labeling our bad versus good and we ingest as many feasts as possible we ingest phishing feeds from open fish and other sources like that domain reputation feeds I mean just IP reputation feeds we ingest the bogan reference from teamsim Roo if theirs is that even how I say it I don't know it's awesome though and we are considering heuristics like

the approach that Cisco Talos had there's really no reason not to try to evaluate heuristic data that we can get for a domain we're not going to be able to get our hands on popularity data for a domain but we might be able to figure out things like when was it registered who is it registered to you know what kind of activity has it had like moving around this kind of stuff can be used to more accurately establish what is and is not malicious and filter out both false positives as well as false negatives and we want to be able to tune this so we checked the Intel's reputation like itself for example we're more likely to

trust you know a fireEye threat source than a you know wall-e 3k github io mi kind of stuff and we also look at the threat type and severity and we rate both of those on a scale of one to five and then for users who are eventually generating lists with that we'll add a user bias which is a multiplier so you can say well I don't actually really care about not going to the Pirate Bay but I do definitely care about grandma getting hit with an expert tip and with all of that said we finally get to the graphs with like five minutes remaining here's the graphs so this is a map of a hosting provider the redder nodes are

malicious the yellower nodes are less malicious we're actually going to zoom in a little bit here so you can see that there are a lot of malicious node it's pointing to give an IPS with given C name and then you notice one yellow node Adobe updated table which is not flagged and that's the kind of domain that we're looking for this is the kind of proactive intelligence that we want to generate we want to be able to figure out Adobe update dot top is malicious before grandma goes to it and sometimes that's going to be really hard you can generate more graphs the very intricate they're frustrating to work with and there are times when you look at the graph and you

get a net just this absolute cluster of information that's very very difficult to interpret and we need to build a machine that can interpret that effectively and honestly we're still working on because there sometimes where the threat intelligence that we get is not so great J max mine comm probably shouldn't be read it's used by some malware to figure out the geolocation of a given target it's also used by legitimate services to do the same exact thing why is it blocked who knows and we need to be able to do this at scale and reduce false positives this is a WS by the way this is and this is a subset of a WS I could only generate a

subset of a WS this is not all of a WS because the graphing engine that I have for this crashed so this is a subset of a WS and this is a subset of github I oh and we have to figure out you know there are some red dots in here there's some things that we definitely do need to block but there are going to be others that get caught in the crossfire if we applying naive approaches so what we're considering very strongly is frequency you know what percentage of nodes resolving to this IP are malicious and that's a really great starting point for us and it won't solve all the problems and there's a lot to remove here

unfortunately the data is far from complete we have about 2 million or sorry we have 2 billion data points in system right now but that's not enough we're not covering enough of the domain space we haven't enumerated all of the possible subdomains we're not looking at all of the possible data that we can possibly get and it's the same problem with our threat intelligence our sources are good not great you know we're working on a we have to work off three intelligence we have to work off three sources this is an academic project it's not got the insight that fireEye has and our response time due to that is actually fairly slow of course the graph

engine needs work like I've just said and it's really only useful as a post-mortem investigation tool that's not really what we want for it and actually here's a distribution of where we do our scanning from Sweden Canada United States not that's not coverage we're not covering the entire world and we really need to this is needs to click too and so we have a lot of stuff to work on we need to talk to a lawyer I need to talk to a lawyer I really need to talk to a lawyer because Robert Olson do I now see has shown up in the back hi Rob Wilson there was a round of applause waiting for you at the end it's half

yours it's half mine we need to talk to a lawyer to figure out can we actually use this data in a way that it's useful because I am a lot easier of a target for somebody to go after then per se rapid7 rapid7 has lawyers I do not so if somebody wanted to say hey you're not allowed to do this give me your money what am I gonna say hey no I will see so we also need to scale out cover more geographic areas increase our query throughput we have capped out at about 1,000 queries per second which is pretty solid until you consider the scale of the domain space you know we need to ramp up those

numbers and you can help we're actually releasing software that allows people to use their systems as resolvers for our system it'll take some of that extra internet throughput that you have use it to hit cloud floor DNS servers I've talked to them about this they says it's they say it's okay you can hit CloudFlare get list from us of things to resolve you some of those resolved lists back cool we ingest more data and more data we ingest the more of an intelligent system we can create and we need to implement certain things that we talked about today such as n sec walking and transfers going to implement n psych walking in a way that doesn't

appear like it's a denial of service attack I want to implement acts firs in a way that you can actually parse them because parsing domain data is actually really frustrating and the end game of this entire project as I have two minutes to be and we want to make DNS trace a proactive tool for geeks like me like you like all of us here we want to generate firewall configurations we want to generate DNS block lists things that you can toss at the pie hole that you have sitting at home that you can block threats before they actually reach your network which could be awesome and most of all we want to make this tool a

proactive tool for everyone we want to make this protecting grandma for free if we can hopefully we want to potentially create browser extensions down the line to identify malicious sites from an independent source we're not AVG you know it's just me and we also want to create possibly a DNS trace powered DNS server like quad nine or such things that provide a secure browsing experience for people that can block threats at the network level before they get to Grandma because I don't know about your grandma's but my grandma is really bad at figuring out what a good adult flash update is what a bad adobe flash update is and if we can protect my grandma we can protect yours too thank

you so much I wanted this to be an open discussion we have 60 seconds left really yeah all right awesome Wow I thought your guess we're gonna cut me off cool we have ten minutes please discuss with me I've not slept I've very receptive to questions comments concerns criticisms issues general and really all right yeah yeah yeah we have not what's our PZ hmm oh sorry have we interacted with DNS and our PZ okay okay so that's the kind of stuff that piehole would do okay simple I got a better scale oh that's rad I am noting that down that's why I have a like I said at the beginning I have a pen and a

notepad are busy anybody else please now I'm just standing up here you can ask me how my day's been stressful but it's okay now yes so newly created like a DNS recency for example it's so like looking at how recently like something is mapped to a given IP yeah yep so we actually condense all of our data into the time that we have first seen it and the time that we've last seen it so actually it would be very easy to implement that kind of thing looking at the recency of well you know what time is it right now and when did we first see this result so yeah we definitely could do we currently know

it's a great thing to do though so we will definitely be getting on that yes

so yeah so SSL cert transparency so rapid evans datasets actually do that so rapid7 has it's a continuation of SSL observatory that yeah they provide all the SSL certificates that they can find for I think they do an ipv4 scan of all sites exposing port 443 and then they pull from that so yes we we do in just that kind of data and it's awesome other things or stuff my League of Legends rank has plummeted because of this nothing else all right well this is the last slide thank you so much for coming to my talk it means a lot to me keep in touch I'm here all day I'm just gonna be on the a tree I'm probably

buying things because I haven't acquired stickers yet it's 2 p.m. so I'm getting uncomfortable you can contact me by my email Kris at Partridge Tech you can add me on LinkedIn please that's a great resume booster for me you can chat with the Audion I guess we'll just like create repositories and then type into the repositories and then submit pull requests it works for me or on Twitter but I never use Twitter so good luck thank you so much you enjoy the rest of besides [Applause] Thanks