← All talks

BSidesSF 2015 - DNS Spikes, Strikes, and The Like (Thomas Mathew)

BSidesSF · 201539:2516 viewsPublished 2023-12Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
DNS Spikes, Strikes, and The Like Thomas Mathew Analyzing traffic patterns for trends can be a rich source of information for investigating potential malicious domains. This talk will be an examination of spikes in DNS queries and how they can be used to find potentially new threats. Malicious domains that appear as spiked domains usually belong to Domain Generation Algorithm (DGA) or exploit kit families. However, not all domains that spike are necessarily malicious. One challenge is sifting through the large data set and extracting the potentially harmful spikes. To accomplish this goal we rely on unsupervised learning methods such as clustering to help us explore and eventually classify the data. https://bsidessf2015.sched.com/event/2xaV/dns-spikes-strikes-and-the-like
Show transcript [en]

DNS traffic today but first uh who am I uh I'm a researcher at Open DNS and the majority of my work is on pattern recognition for uh signals intelligence so what's going to be kind of the main focus of today's talk it's going to be about how can we model DNS traffic and what type of information can we mine from DNS traffic and the last part of the talk is then going to be focusing on a variety of techniques that allow us to accomplish these goals so you can think of the talk being broken up into three sections the first is going to be kind of the motivation behind what made the team decide to

pursue uh these problems the second is going to be a set of methods where I will be discussing uh Spike algorithm as well as a clustering algorithm and then finally we're going to have a brief discussion of some of the results that we've obtained so first of all what exactly is DNS uh if you want to if you want to do traffic analysis we need to have a pretty clear understanding of what DNS is uh many people liken it to the internet's phone book because it helps us create a mapping relationship between domain names which are easy for humans to understand and IP addresses which is really which are really easy for machines to understand and of course you

can imagine that remembering a 32-bit number is hard for a human but easy for a Comm but easy for a machine DNS helps fix this issue so we can think of then DNS as being set up into a variety of zones as this diagram over there neatly shows and so when you have a question to say go to google.com you first you send out your request and your request goes to the do com Zone the do com Zone then has its information about the underlying subtes and then sends an other request to the Google Zone So eventually in order to get an answer your question must Traverse the entire DNS uh Zone area for your path and once you receive the

answer it's cached in your browser so there are now two types of DNS services and this is actually quite important there's an authoritative service and a recursive service authoritative Services you can think of as as the actual phone book itself there would be for example a server that's in charge of maintaining all the information for the com Zone however when you're a normal user you actually interact with the recursive server the recursive server is the server that's doing all the question and answering for you so where does Open DNS fit in this picture well we are a recursive server we have currently have two IPS where you can point your DNS server to uh browser

to and we will then basically send you a set of answers back for any domain question that you have so we have 24 resolvers worldwide we use anycast which is actually important as we'll get to later on and we see about 50 Bon billion queries a day which comes out to about 12 terab uncompressed so it's a a pretty large data size but at Open DNS there's two types of data that we collect there's what's called passive DNS and then what we call recursive DNS so this distinction is important when it comes to modeling some of the traffic patterns that we're going to be talking about later passive DNS just maintains a historical mapping between a domain to an IP so that means

with a a passive DNS system I can see 10 years ago what google.com was mapping to it gives us a very good understanding of how domains and IPS have been changing over a period of time however we're not going to be talking about passive DNS today a lot of work has already been done on passive DNS by professors at Georgia Tech and dumbala and open DNS passive DS is really good for certain types of data analytics however it's not the greatest when you're trying to specific types of threats and that brings us to the fundamental question what are we trying to model and more importantly what can your data actually model whenever you're attempting to use some sort of data

algorithmic method the most important thing is determining the quality of your data and figuring out whether your data has enough information to actually make the type of inferences you wish to make in order for you to create some sort of classification system or pattern recognition system so crime bear is I like to think of crime bear as an ecosystem you have like your Bot Nets your exploit kits your spams spammers and it's constantly evolving and passive DNS is great for catching fast flux servers and DJ because they make a lot of noise in the authoritative zones uh a DJ creates a lot of NX domains which are easily noticeable and uh fast flux domains are

constantly changing their name servers or their domain to IP mappings so we have a set of really kind of the community has created a set of techniques to really easily use passive DNS to capture these types of threats and as a result if you're a network security operator at any large corporation you have a blacklist which is many times fed by using and Mining data from the passive DNS and the best method that's currently used is the idea of reputation that is a certain domain or an IP might have some sort of historical reputation I.E I might know that domain X has a history of Hosting Russian spammers and as a result when I see outbound or inbound traffic coming

from such a domain or IP I want to block it so people have come up with a set of techniques using passive DNS and some magic to give us the answers to these Blacklist questions the problem is this was great in the early 2000s getting IP space is cheap domains are cheaper and subdomains are the cheapest so what this means is if you're an attacker you can easily buy a little bit of Ip space from a VPS provider HST whatever malicious content you want to do and then quickly tear down the service in a matter of hours passive DNS is too slow in order too slow and many times does not record these type of transactions correctly

because you don't get to see the actual flow of new IPS being created unless you get an entire Global reach which no individual or company has furthermore compromised domains are a new threat a compromised domain is when you have a totally benign website that has some sort of flaw that can be exploited this is a great way to get around IP reputation or domain reputation issues because you would expect that a benign website that has a very clean history will not be hosting a zbot dropper however many malware authors have found a variety of techniques to find comp compromised domains so passive DNS isn't good at catching these types of threats so what's coming exploit kits exploit kits are the

primary source of botn net and spam infections in the crime ecosystem so if we could figure out a way to model their behavior we would have a very good method of identifying how ATT attackers are moving throughout the IP and domain landscape furthermore exploit kits use use a large distributed network not unlike a Content delivery Network as you can see by this diagram the majority of droppers andot Nets that you're familiar with are using exploit kits as the initial Vector so this is a classic kind of example of how the exploit kit flow goes you have a bunch of poor guys who visit a set of websites and these websites redirect them to a set of

malware distribution servers these servers then send them through further redirections until finally they reach what's called The Dropper page and this can include a variety of domains along the way rotators routers and then finally dropper pages so what I want to focus on is exploit kits because I think understanding how exploit kits work give us a very good way of understanding how botnets will evolve and how botn Nets will also emerge in the future because in order for a botn net to even infect users it needs to use an exploit kit to exploit the user first and put the botnet software on that user's machine so let's start with a very simple idea exploit kits have a set of

domains that are all connected in a chain you might expect a user to click on perhaps one of those domains on purpose but not the entire set this gives us an interesting idea how about we have the following hypothesis DNS query pattern s can help us identify domains hosting exploit kits not passive DNS patterns but instead recursive DNS patterns that is we can look at how clients are interacting and sending out requests to a set of domains and then from that information be able to mine significant patterns out of that signal that can be associated with exploit kits then we can use passive DNS to help us catch even more exploit kits because we know that exploit kits like to reuse

large infrastructure but in order to gather the seeds we we have no method of using passive DNS to find the initial exploit kit domains themselves and that's why we want to use recursive DNS uh data to help us catch these new domains so right here is a traffic pattern of two exploit kits that uh came up in I guess mid-February uh as you can see for about the last the previous two weeks there's absolutely no traffic and then suddenly in a matter of 1 to two hours there's a large Spike and this kind of makes some sense because if you're an exploit kit operator you don't want a longlasting service at a at a domain you want to

have you want to you want your domains to be easily put up and then easily put down so you don't get uh the authorities on your back furthermore you don't want to register an exploit kit domain for a very long period of time because that's going to cost you money so what many exploit kit authors like to do is to do some form of domain tasting for just a matter of days and then dispose of their domain this means that expoy kits are disposable cheap and very shortlived so really we're looking for a likee pattern within the uh the DS query pattern because as you can see on February 17th that around 12 a set of

users suddenly put out a host of DNS queries to uh an exploit kit domain so with this idea let's form a very basic algorithm to help us extract spikes from domains let's take two consecutive hours of recursive DNS traffic recursive DNS traffic contains information regarding a Tim stamp the client IP that is requesting the DNS uh server I'm sorry the DNS domain and the domain name itself and a set of R codes and Q codes C you first want to then count the number of queries received per domain per our slice we then perform a join where the key is the domain and that allows us to see the domains that have been spiking in hour one and in

hour two we then finally filter out any domains that are been receiving an abnormally large amount of traffic so that would be like your Googles your Facebooks and your various content delivery networks and so then this then leaves us with a set of domains that have had some sort of Rapid increase in DNS query patterns in the last uh 2 hours so at this point you kind of think to yourself well we're done we've created the algorithm there's nothing else to do all we need to do is run it on a couple hours worth of data we'll probably receive a set of spiking domains and those spiking domains we can easily push to The Blacklist unfortunately all we see is a

ton of noise there are a ton of domains that spike in consecutive hours to put this in perspective one hour of traffic is about half a terabyte of data once we perform the spike algorithm on two consecutive hours of traffic we see about 20,000 domains that have been spiking in that period That's too many domains for a human analyst to go through and probably far too many for a very naive heuristic method to filter out so we need some new techniques to help us out here um so for example uh this is a spiking domain which you might think is actually malicious but in fact it's just an example of a DNS amplification attack so we can't actually block this website

because it's a totally benign website it's some poor poor person who like guess misconfigured their DNS server and somebody's is exploiting it but we can't actually put that on The Black List because it's not a a true malicious website so we have to be very careful when it comes to kind of deciding what we can consider malicious and what we cannot and that then brings us to a much larger problem a lot of people and in recent years there's been a lot of kind of talk about machine learning and bringing big data and data signs to uh the field of security one of the hardest things to establish is this idea of ground Truth uh ground truth is the idea

that when you're given a set of domains you have a very or a set of objects you have a very clear understanding of what those objects are because for any type of class classification system to work you need to be able to kind of say with certainty that an object is an apple and an orange the challenge here is we have very little idea of what these domains are initially we have about 20,000 domains and we can't manually inspect and to classify them so we want to use a technique from the machine Learning Community called unsupervised learning uh it's not going to be classifying anything for us instead unsupervised learning is a set of techniques that allows us to explore

the data by analyzing essentially the shape of the data itself so we can use unsupervised learning as essentially a pre-step before we create our actual classifier so the goal really for the last couple of months has been to create a set of algorithms that allows an analyst to analyze a set of spikes and then determine perhaps what kind of groups of spikes emerge in the data so the method that we decided to use was uh clustering uh clustering as you can imagine just helps us find similar groups and I decided to use the most basic of clustering algorithms uh K means you can think of K means boiling down to just kind of this little

function right here you don't actually need to really kind of focus on the math the idea behind K means is the following we have a set of points and these set of points We Believe belong to set of groups what we want to do is cre create the ideal sets where if we go to every single point in a set we can say that its average is minimized basically if I'm at Point a in set a and I go to point B which is in set a I notice noce that the difference between these two points is very very small and if you do this for a set of points eventually what you find are these things called

centroids which are kind of the center of uh n-dimensional objects um and so with K means then we have to solve this uh summation of a set of means the technicalities aren't that important but this function becomes quite important later on when it comes to kind of choosing the number of clusters we imag imagine to be in the data so as you can see uh at step a we have a set of data points and we iteratively go through each uh step and we kind of create uh naive clusters we make kind of guesses on how big the cluster is and how many points are within the cluster and at each step we keep uh adding uh points to the cluster

or we remove them and this continues for a set of iteration so it doesn't run once we have to run it multiple times and we also have to basically choose where we start at random at each time because you don't want to start at the same place every time you run the algorithm and because we're trying to minimize the mean uh or kind of the average variance within a set of points we're essentially acting as u a least squares estimator for people who are familiar with the any form of regression model uh this causes some problems however with the type of data that we're working with oh how Okay so uh she just asked how do we

choose a number of clusters we'll come to that uh problem in I think two three sides so don't worry okay so now we have an algorithm that we want to use um then it just becomes choose a set of features we scale them and we run the algorithm and we're done unfortunately it's not that easy uh one of the most complex parts of this entire pipeline is actually choosing the features because the features are what determine kind of what should be examined what should be considered as something that's important to look at when creating your cluster so we chose the following set of features the number of query counts the number of unique IPS uh a resolver distribution and the rcode

distribution and what we then did was filtered out all domains that had less than three unique IPS now uh at this point you might might be thinking well this isn't that many features why would you kind of just stop at 4 five well the reason is there's this problem of uh in kind of machine learning called The Curse of dimensionality and as you increase the number features to kind of help you create clusters you also then kind of make everything look the same so essentially the problem is let's say I added a 100 features to kind of determine whether something's a spike the more features I add the more similar any two points start to look in a

high-dimensional space and so one of the biggest goals is trying to choose as few of features as possible in order to kind of minimize the amount of unnecessary calculations that you're doing and possibly creating incorrect clusters within your data set so we now build uh an N byn Matrix to kind of store our samples and our features and N is the number of samples so that's uh every domain is considered a sample and M then is the number of features so we have our M we have like in this case uh seven features and so we have u a 6,000 by7 Matrix that we're working with so uh if any of you have been working with cines in the past uh

and kind of been listening to what I've been talking about you notice that I'm doing some hand waving right here uh and that's because I'm dealing with query counts and query counts are discrete data points so for example I can expect to see in an hour five DNS queries or 10 DNS queries but I actually will never see say 10.5 or 6.5 so this doesn't seem like that big of an issue however the problem is a lot of the techniques and tools we've been using make some very strong statistical assumptions on the actual underlying distribution so K means uh operates on a normal distribution and the normal distribution is kind of the familiar gaus that we all see in school

or of seen I guess in images and a gausian distribution expects continuous data that means I would expect to see 10.5 queries an hour 11.5 queries an hour but the data that we're working with does not follow this normal distribution if anything it follows what's called a pan pan or a negative binomial distribution so I had two options this at this point uh I could either uh uh kind of rework the algorithm using the pan distribution or some sort of other distribution or I could kind of just charge ahead uh I decided to charge ahead because rewiring the stuff for these two distributions become a lot of work and I wanted to first see whether

there was enough error in using a normal distribution before I decided to just completely toss it toss it away so that's kind of the caveat and something that's really important whenever you're working with data is to know kind of what type of data you're working with um in many many times you might think that you have a fantastic algorithm but if your data is say categorical data where you're just measuring say the number of boys girls doctors or count data your algorithm is not meant to basically handle that type of input and so one of the key kind of challenges is figuring out how to either massage the data to make it more acceptable for your

algorithm or or just choose a completely different algorithm uh and so finally what we do is for each feature we calculate its mean uh and we remove the mean so that that means that we essentially scale it to zero and then we divide by the standard deviation that's kind of the the normal whitening approach and this works really well with K means because K means is an algorithm that is aimed to essentially identify uh various signal properties quite well so it's almost expecting some sort of wave or Spike uh when it's clustering so we kind of made these assumptions and we and we went ahead so now comes the question of how do we choose the number of

clusters uh many times software uh libraries will kind of give you an initial cluster count uh I believe s learn gives you eight as the initial but in this case we have a very kind of crude technique to help us estimate the number of clusters that uh exist in the data so if you remember we had this equation over there and we call this the Distortion function uh the Distortion function basically told us kind of how much movement is going on how much variance is going on every time you run this algorithm so it's a good way of kind of figuring how much change overall is happening every time we add clusters or remove clusters so with that kind of

understanding we can kind of come up with the following idea let's assume uh that we can have between five and 20 clusters and now let's start mapping out what the Distortion function value is whenever I do a k means algorithm for that number of clusters and then let's just see what it looks like so uh this is what we get so as you can see when we start off at five we have a pretty high level of distortion and then we keep going down and it looks something like an elbow and so this method is called elbow method uh basically you're looking for kind of the the crook in the elbow and then you want

to move a little bit past it little bit past it is subjective but the best way to kind of determine how to choose a little way past it is to actually look at the cluster sizes eventually and so we have this algorithm and so what's kind of the underlying theory behind it well the idea is if I look between five and say nine there's a really big drop off in kind of the in the amount of change that's going on when I increase the number of clusters that means each new cluster that I add is giving me a lot more information about the underlying structure of the data however as I move further on say between 15 and

20 the rate of change in Distortion is a lot slower this means that adding additional clusters isn't give me that much new information and if anything I'm basically wasting computational time as well is creating false clusters so what I really want to do is kind of find this sweet spot within the curve uh it's kind of like gu find like the inflection point uh within this curve so this is kind of what happened when we ran uh the various set of clusters for 10 11 and 12 and so as you can see the cluster sizes begin to drop off uh quite dramatically uh and as a result I ended up just kind of choosing 12 I I I kind

of went between 11 and 12 uh what's interesting to note is how there are a couple of clusters that are extremely small uh cluster 11 and K equal 12 cluster 10 and k equal 11 and uh this is actually a worrisome point because the fact that you're generating such small clusters is a sign that there's some sort of very widespread in your data and K means is not capturing that spread correctly so this is kind of like a a point to remember and something that should be refined in further kind of improvements however this is this is still pretty good and and we can actually derive some pretty interesting uh observations from this so the next question that you kind

of want to ask is um how much does each feature help me kind of create my cluster right um and this is important because if if I have say a ton of features we talked about this problem of dimensionality uh so we really want to know that we're getting the most bang for a buck whenever we choose a feature so we can then think about this as the folling problem each sample or domain lives in what we call an n-dimensional space and this n-dimensional space is created by essentially the number of features so we can then think as this diagram shows if we map all of our if you're dealing with only three features we end up with a

three-dimensional space and we end up with some sort of cloud of data points now if I'm dealing with only two or three dimensions this becomes quite easy to see determine what features contribute the most to uh the clustering because visually it's it's quite quite easy uh however we're dealing with say six or seven and uh that's kind of hard for your eyes to understand so the question is is there an analytical method and uh thankfully there is um so the linear Al there's this technique called principal component analysis and we don't need to go over the actual details but the general ideas we assume that our data is some sort of cloud right and we know that the most

important features are going to basically create the most stretch in the data because there's they're kind of what's drawing the data towards them so with that in mind we want to create a set of techniques that can essentially find the most important directions with than the data so this can be solved as a linear algebra problem where we think of the data as a giant Matrix and what we're looking for is um this idea of the covariance Matrix of the data covariance refers to essentially standard the change in standard deviation for each Dimension so once we have this covariance Matrix of kind of all these data points we want to find which direction is it changing the most in so

so then we just calculate the igen values and igen vectors because IG values uh and igen vectors give you directions of the dot the kind of the your data cloud and the igen values tell you how important each direction is to the overall change so I create a covariance matrix I calculate it igen values and its igen vectors and then I find the largest igen values and its corresponding IG vectors and I look at them and then um I the following values essentially query counts uh accounts for around 33% of kind of the information within my data uh by information I mean like it's probably one of the most important kind of variables used to kind of create my

clusters uh I only I'm only showing you three instead of the entire seven just for kind of space uh after you can come and talk to me about that and unique users is around 20 and unique resolvers is around 133% uh the problem is we want to do better so there's one last feature that I decided to add in so I talked about how we were only looking at a 2-hour window and so the problem is there's all this stuff that's going on before right so let's say that between hours 12 and 1 a domain is spiking and then 5 hours previously it had a little bit of traffic but it wasn't really spiking that information in the first

feature set that I talked about would not be captured we don't really know kind of the previous history of the uh the domain but that's really important because we know that exploit kits in general go from zero to 100 really quick uh as a result we want to kind of measure some sort of measurement for kind of the volatility of the past uh query TR traffic to the domain and the method that we decide to use is this idea called the phof factor uh and so in statistics the Fano factors this idea of dispersion essentially we take kind of the standard deviation that's kind of the kind of the spread of queries over some time window and then we divide it

by the mean and that kind of gives us some sort of dis dispersion index so when you're dealing with say dispersion that is very close to zero that means that you're dealing with some sort of distribution that's going to have extremely abrupt changes uh uh excuse me dispersions that are greater than one usually means that there's a high degree of change in variation within that query pattern and so this is very useful as a method to help kind of identify domains that have had a long history and have some sort of potentially Spike esque like pattern to them in the past um so now we can talk about some of the results that we had um so we had

about 12 CL clusters and some of the interesting kind of clusters that emerged were mail servers uh mail servers love to spike at very uh kind of predetermined hours of the day uh we caught about 1,000 mail servers which kind of go from zero to around 200 uh about every hour uh these are completely benign mail servers though it's just kind of normal query patterns what's more interesting however is some of the dos and spam that we see so what happens is some attacker notices that somebody's kind of misconfigured their DNS uh their their DNS information and what they like to do is they like to send a ton of DNS uh 255 requests which are Dena any

requests and as a result you get this huge packet back but they're spoofing their IP right and so essentially some poor unsuspecting person gets giant DNS uh responses from a set of domains uh and they get over kind of kind of overwhelmed with traffic uh so dos and spam is a very kind of common technique uh and cluster pattern that we saw and and one interesting thing about dos is that uh for most exploit kits you see very kind of low frequency query Spike so you're going from say 0 to maybe 200 dos goes from say 0 to 3,000 uh they're much higher in height and that kind of makes sense because you're essentially trying to get as much

uh as many kind of DNS responses as possible within an hour we also noticed something similar to Dos in our taxs on WordPress sites uh people would try to Brute Force WordPress sites We Believe by trying to kind of just do password brute forcing and this was kind of verified by looking at some of their IPs and some previous history and then we also saw a lot of kind of bizarre clusters like Tumblr a lot of Tumblr websites would spike and then kind of go down to nothing uh the bottom two spikes right there refer to Tumblr domains and as you can see they're almost identical so there's some group of people who every couple hours

decide to visit Tumblr uh and mass and finally then there's the more most interesting part the exploit kits so we were able to kind of successfully find quite a few interesting exploit kits within all the junk uh in one of the largest clusters um so exploit kits were grouped quite well together uh and one of the most interesting cases that we found were actually a lot of compromised domains so uh we found this website hosted on a I guess belonging to some Turkish user called escort La Bersa and it had the following subdomain Spike up as you can see and now our guess is that escort Laura was a compromised and these subdomains were actually injected so uh

is anybody here familiar with like domain shadowing kind I know Cisco put out a report and DM Mau has talked about it in the past um it's this idea where people are able to kind of hack registar based on some sort of vulnerability and then inject subdomains into various websites uh and so for the longest time people had been noticing that GoDaddy was the biggest kind of contributor to this whole domain shadowing problem and what we started to realize is that there's actually quite a few other registrars that are also being that are also Fallen victim to this so all this work had to be done manually but it could not have been possible if we

didn't kind of have easy methods to kind of separate the data so with this new domain we then used escort Lura as a seed and then kind of leveraged all of our passive DNS data to help us find new malicious domains and so as a result we actually ended up finding an entire IP space that was used by these guys to host uh some pretty bad stuff and you're able to block it all out uh and so actually that that's about it uh i' like to thank di Mau uh he's a senior security researcher at Open DNS a lot of the advice he gave me and of course the rest of the team at Open DNS and John

lamping uh [Applause] okay are there any questions or yeah sure sorry um two questions the first is um why you do assume that a a gaussian distribution wasn't good enough and why I didn't understand why you didn't use a poon uh distribution and the second one do you guys try any uh uh case of supervisor learning to maybe classify that uh classify first and then analyze this before doing a supervisor so I'll answer your second question first uh so with supervised learning uh uh we really you need to have a pretty good understanding of everything that you might potentially see and when we first kind of started this project we didn't have that good of

an understanding essentially the ground truth so we wanted to use unsupervised learning as a method for us going to understand what's going on and once we get that Baseline uh we can use more supervised learning techniques so we're starting to kind of come up with some supervised learning techniques for this type of Spike classification uh as for your second Point uh it really kind of like if I understand correctly it's like why did we use the gausian uh yeah so uh well we didn't want to use hello we didn't want to use a a gaan because for count data where everything is greater than zero it just gaussian distributions don't work as well just because of scaling issues using a

binomial distribution would be ideal some sort of negative binomial uh I was just kind of lazy I didn't want to coat up an entire it's like it's a pain like it's a little bit of a hassle so see what's quicker y are you are you seeing any evidence of long ttls to minimize DNS spikes or pre-warming of resolver caches are are you seeing Long Live ttls to reduce the spikes and any pre-warming of resolver caches so uh actually we're not looking at any of the TTL data when we're doing the spike so that would be probably one of the techniques that we want to start using when we already have nice clusters as an additional layer of clustering so

there as you notice there was no lexical analysis done there was no kind of analysis of ttls I just wanted to purely look at the actual signal pattern itself uh because the more features that that I was adding in I was concerned that it would really kind of distort kind of how the algorithm would kind of process the data especially if I'm dealing with TTL data which is follows a completely different distribution than say you know query pattern distribution so it was a trade-off but yeah absolutely uh using ttls is something that that will be done as a fil yeah yeah uh

any