← All talks

Building a Predictive Pipeline to Rapidly Detect Phishing Domains

BSidesSF · 201830:02659 viewsPublished 2018-04Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Wes Connell - Building a Predictive Pipeline to Rapidly Detect Phishing Domains Registering a new domain, obtaining a legitimate SSL certificate, and deploying it on a web server got much cheaper for threat actors thanks to free SSL services like Let's Encrypt. Detecting new phishing domains has always been a reactive process for security teams; just like malware, one cannot provide threat intelligence on phishing domains before they're registered and operationalized. The development of the Certificate Transparency log network adds an interesting dimension for how this process can be improved. SSL certificates, and the domains for which they are issued to, can now be monitored in real-time... and security analysts already have intuition on what phishing domains look like when they see them. Building a predictive pipeline to detect SSL certificates issued to new phishing domains can be reasonably accomplished using supervised machine learning. In this talk, I'll introduce a Python-based framework for building this predictive pipeline from scratch.
Show transcript [en]

[Music]

great any guys hear me okay all right great again the topic here is building a predictive pipeline to rapidly detect phishing domains and I wanted to thank you for being here thank you for having me thanks to the volunteers and especially the audio-video team they don't get any love but again my name is Wes Connell and the research that I'm presenting today is a little side project that I had done in December and I've already published my slides to the B side scheduling website and I've built a command-line utility and I've also got an I - Jupiter notebook that's also published to github and I'm gonna walk through some of that in this talk and

with that I'm just gonna we're gonna do a little live demo we're gonna kick this off this has all been docker eyes so I have a dr. container that's listening to this feed of domains so I'm gonna go ahead and kick that off and it's not a real software development project if you don't have awesome ASCII art so we're gonna let that run and then we'll come back to it in ten or fifteen minutes a little bit about myself I run the security analytics team at pattern X and again the predictive model I'm presenting today is very simple it's very straightforward at pattern X we've got predictive models that's been multiple data sources and are looking at

both spatial and temporal features looking at time so thanks to them for for letting me be here and and release this they didn't have any problems with that and I think you know with this upcoming week there's going to be a lot of conversation about machine learning and AI and advanced statistics among other things and it would be nice to see vendors do a little bit more to kind of explain the independent variables that make this up versus having you know a lot of the discussion being it's magic it's a unicorn it's perfect it's it's all that so my background I've always been a security guru and fortunately I got started at Northrop Grumman and I

immediately joined a research team where I was exposed to machine learning and intrusion analysis and for me a lot of the things that are second nature to security folks they've never had exposure to because it's for me it's everything I've learned on the job and so I hope to share some of that knowledge and insight with you all here's what the agenda looks like I'm gonna go over the ingredients for what I'm gonna talk about in the tool that I built go over the implementation the training data the features kind of what the machine learning lifecycle looks like some accuracy metrics from the data that I have and then we'll circle back with that demonstration and then

hopefully we have some time for Q&A for my awareness of the audience if you could just raise your hand are you familiar with the certificate transparency log Network okay great and how many people have ever trained a predictive model all right great so the certificate transparency log network is about open auditing and monitoring for the issuance of SSL certificates and you might think well what on earth does this have to do with Twitter and I think it's a good analogy if you were to look for Mark Wahlberg on Twitter you know you would search and you might find 30 different accounts and your visual cue for which one is actually his account is that verification check mark and at Twitter

this isn't much of a problem because they presumably have a small team and they are the only authority that is issuing that's checkmarks and conversely when you look at SSL and you know securing HTTP connections you've got hundreds of authorities that are assuming that kind of role so imagine it Twitter instead of having one small team they said hey you know what we're gonna outsource this to 170 different groups and when they publish those check marks you know we don't really have a way to look and monitor an audit and that is precisely the problem that we have with SSL Certificates it's what we use to validate identities for for web servers so what Google did was they have this

project they started in 2013 for monitoring and auditing and occasionally you have certificate authorities that get popped or mistakenly issue subordinate certificates so that's what this is all about an extension of that there's a group called Cali dog security that published a python-based library that monitors this feed so I haven't looked at the nitty-gritty details of how it works but for me it's very I really like doing prototyping in Python and you can just load this it's in pi PI so this is the data input to the predictive model for that built an eye view scikit-learn so it gets along very nicely and then the third component here is with phishing attacks so the trend is very clear this

graphic is from a group called fish labs and if that's not exponential I don't know what is but this is the rate of phishing attacks that are conducted via HTTPS and part of the problem is like we used to have a lot of exploit kits we still have some today but the sophistication for that is much higher than fishing you know you can get a domain name for free from certain top-level domain registrar's you can get an ssl certificate for free and you can probably get hosting for free so this is a very cheap attack and in most security programs people will agree that humans are the weakest link so this is a trend again for me that's really important

because if more phishing attacks are being conducted via HTTPS they need an SSL certificate and if I'm looking at that that log Network I can kind of see when they're getting their certificates issued or or renewed a couple months ago and this is what inspired me to build this project this was in November there's a security researcher he's anonymous who goes by the name Xers and you can see it's kind of a key I took a couple things here there are some keywords that assign arbitrary weights to different things so in this domain name if you see PayPal or MoneyGram or Morgan Stanley give it this this number and then if it's over a certain

threshold then it's in bold and red otherwise it's it's you can log it somewhere else and I immediately when I saw this I said hey if I had been provided this project and he didn't give me any of these scores if they were all blank how would I know what the optimal score is and is there a way that I can prove this because from my end I've got two things I know how to characterize what if domaine looks like and we've got training data like we've got historical examples of phishing attacks so I immediately thought like hey maybe you know the score of 70 should actually be 64 maybe these keywords are unnecessary or maybe these ones are really good and

should have a higher weight and then another thing that he looked at is looking at Levenstein distances which is a similarity between one string and another and I looked at this uniform machine learning do we have the training data yes can I characterize you know for some of these not all phishing domains have blatantly obvious characteristics in the domain but for a lot of what I'm looking for especially from that feed I do so I thought let's let's try to build a predictive model a little bit of background on supervised learning so we're gonna start in the top it'll be top left so the training and text that is libraries of domain names I've got a

few these are actually fully qualified domain name so API dot support.microsoft.com you see the third one looks really suspicious so that's the training data the labels it's a for me this is binary it's either you know uncharacterized it's it's normal or it's phishing so I've got training data and beneath that I've got labels and the feature vector is just a fancy way to describe the characteristics that make up that domain how do you describe home dot path POW country some random domain comm and you'd say there's one - there's three periods one of the words is they're probably trying to make it look like path tau and that's what your feature vector is so each one of these records

is a domain from my training set and if the label is zero that means that's that's benign and those are the characteristics that make that up so when you train a predictive model it's not like magic you throw a list of domains at it and it returns fishing or not fishing it actually doesn't even see that it is taking your characteristics of how you describe that with a label and all it does is it it uses its algorithm this this component of what's the optimal score that's what it computes for you but it requires those characteristics and this training the training data so what I've done with this project is I built my predictive model

and this new text this kind of workflow that we're seeing in the feature vector that would be the search stream feet so it's saying my knowledge all the training data I have what I think looks like fishing it's building a classifier and then I'm hooking that onto that feet and ideally what I need for it to be effective as my training data must be representative of what that feed is spitting out the whole time otherwise you have a lot of noise so I'm gonna go into the implement implementation here there are a couple different definitions of what the machine learning lifecycle looks like of course it depends on if you're doing supervised modeling or unsupervised modeling where you don't

have labels and you're just normally distributing and trying to find anomalous or extreme deviations from normal so for me I'm gonna start in the top right I have this idea I think I can build a predictive model I know where to get it I've got some ideas of having normalize it which is really important too so if it takes URLs if it takes fully qualified domain names if it takes just domain names I have to normalize that or it's gonna throw off my classifier then I'm gonna explore different features and I'll go through all of that I'm gonna extract them I might engineer a few things I'm gonna select an algorithm and it make sure

that I select the right parameters for set algorithm and then I'm gonna evaluate and this is very much iterative there were lots of oversights I had for example I was looking for keywords account and services or service and then when I ran this on search training of course it's tax season I would flag everything that was accounting services dot whatever so data acquisition I took a number of different things that security researcher who had his project I took a few those keywords I looked at there was another researcher Swift on security who has published some fishing regex for exchange there were some good words there I thought and then I looked at fish tank which is a

resource for fishing domains and looked at the top alexa domains those are the most popular domains and that doesn't mean that just because it's an Alexa doesn't mean it's not malicious but specifically I'm looking for like brand-new phishing domains and it's highly unlikely that that gonna end up in the top of LexA and even if it is if it's only a handful I might be able to generalize enough with my classifier to detect them and then again it's very iterative my the data that I used ultimately were from doing the model Corrections from my false positives as far as features I took you know 800,000 different domain names I got into 170,000 I thought we're

definitely phishing and I looked at I split them my special characters so if you can see the screen is probably large enough I just did a regex and I split on the /w and it said every special character whether it be a period or a dash or whatever appended to this list and I had this massive list and then I used this collections from the counter library and I just said show me the most common now what I should have done is filtered out the top-level domains but that's a very long list and I saw probably three hundred that were really suitable for when I see that that's a good indication that it might be fishing

so that's kind where I got about a hundred or so and again it's extensible the platform you can add whichever ones you want maybe I missed some next was looking at popular brands and this was really eye-opening fortunately fishtank already has this search where you can look by targeted brands so I didn't think to look for things like github or Dropbox I just thought that defaults like Apple and PayPal so there were about 30 brands there some banks that I hadn't thought of either and I cherry picked them from this list I added a few of my own Gmail is a pretty popular one and then I searched through again the 700 thousand or so domains that I had

from the regex and keyword search that's completely uncharacterized data but it gives me some corpus to look through to try to find data for my predictive model and this one's really important and it's really neat it's also very simple looking at Levenstein distances from phishing words again this is a measure of similarity between two strings so I don't have signatures for these two examples on the top right I know when I see that though that's definitely trying to make it look like Apple idea they're trying to make it look like PayPal but again using if I search for only PayPal or Apple or app I'm gonna miss those every single time so what I'm looking at is saying for

this list of keywords I'm seating it with like PayPal Chase Bank HSBC things like that take all of the domains split it by those special characters and compute the levenshtein distance and see what it is so if it's one that that's you know again these are on the right-hand side these all have a distance of one from PayPal and Apple ID this is also really helpful utility for type of squatting so if you type instead of espn.com you missed the O that is registered and owned by some threat actors that redirect you to exploit get landing pages so I just it's something that I leverage for this utility another one is looking at targeted brands and

specifically the placement of them are they the second-level domain or are they sub domain on a different domain and this is something that again you know encapsulate my tuition and I can write this in a few lines of Python so on the left hand side we've got some benign data where the bro the brand is the for example Microsoft com so those are ones that I actually flagged initially and this is before I started looking at the second-level domain being one of you know thirty or forty really popular brands conversely when you look at fishing this is a huge distinction of having Apple comm as a host on a different domain and again that's a

feature that I can look for in my predictive model a few other things I looked at the top-level domains these are things like com and org and net TK in ml and lots more and I extracted those and I use them as categorical features I also looked at the number of dashes and the number of periods and the domain entropy which kind of calculates if it's random or if it has repeating characters it'll generally have a higher entropy and again I have an icon notebook that what's through every single one of these as far as the algorithm I use logistic regression and some would argue that this is hardly even machine learning it's more advanced statistics

for me if I wanted to use random forest it's a one-line code change and I just like this because it's very fast to predict it's very fast to train and again it gives me exactly what I'm looking for so what the algorithm does is based on my feature set my descriptions and the training data I have it will compute the feature weight so it computes the feature weight and the offset I just give it the feature so for each feature it has a weight and one of the things that's nice about this is after you train the model each feature gets a coefficient which is like the strength of whether it's more likely to be in a positive class or in the

negative class and what you'll see is on the the blue ticks in this histogram those are really strong identifiers of fishing so you'll see Bank of America brand in the sub domain I think we would all agree if you saw that and then some random that's probably fishing it had a paypal Levenstein distance of one that's also a good indicator and a few others on the other side if the top-level domain is calm if the accounting service keyword is there that's also a really important characteristic and if the PayPal brand is in the second-level domain that's also important so the other thing is that with especially with neural nets it's really difficult to visually kind of interpret why the

classifier is doing what it's doing and this is a simple enough problem where I can use logistic regression the features I had were really good and the day that I had was great as well so when I was you know swapping different algorithms they didn't really make a difference and I liked using this to visually kind of understand what it was doing now what that what this another point here is that just because these have really high weights doesn't mean that that is definitively like the benchmark globally that is merely a representation of my training data so if I have all of the Bank of America brand subdomains in my training set like that's going to be

that highest but maybe it's actually something else but I don't have that in my training set at least I don't have all those examples of my training set and that's what you know 80% 90% of what data scientists spend their time doing is getting training data that were flex the real world especially when you're doing deployments with customers that have very dynamic data streams they have different traffic profiles getting data that represents their environment is incredibly difficult and again this is what's so critically important and that's going back to that keyword tool where it had 70 and 60 and 45 I don't have to do that I just have to say here's how I look at phishing domains

and here are some examples and that's something I think that's really powerful as a security analyst the ability to generalize at scale that's what supervised machine learning can offer and even though it again this is a very simple model it doesn't mean it's not impactful I think there are lots of metrics for measuring performance so I think I'll go over this in the notebook again but the confusion matrix despite the name it's not confusing it's pretty simple this kind of shows me my misclassifications so when the I'm taking all my training data and I'm splitting into it like a training set and then I leave a little bit out to do testing and this is showing the

performance of that classifier against the test set and what's good is I have I have one false positive and I've got six false negatives and largely I'm getting the classifications perfect when it's actually fishing it says it's fishing when it's actually benign it's it predicts it as benign and again I'm not using any rules or signatures or you know things like that so with that let's jump back into the demo so we'll pivot back to the command line and this is great so this was running again the whole time if you take a look at some of those names I think we can all agree I don't even have to do some kind of dynamic analysis every single one of

these here I mean unlock update account information CF I didn't have a signature for that I just had keywords in some ways to characterize it and you can look at a bunch more account Apple ID locked Vericom - bounty K so this is kind of what the command line tool does it writes it to a text file on disk I didn't want to boil the ocean I mean I get lots of questions of you know I'm not even honestly looking at the issuing CA I think there's a trend see a lot of the suspicious ones are from let's encrypt you could extend the model to also take that into account and I'm not I think the idea here was just

to give some you know introduction some some background on how this works and the different independent variables that make up the classifier and to give you some idea of what the search stream looks like it's I've seen anywhere from a thousand to five thousand domains per second so on the opposite side it takes a few seconds to fire up but these are all the domains that my classifier is analyzing and on the left hand side those are the only ones that's flagging so it's I like if I were if it were my job to go look through this list of this fire hose and find the phishing domains I block out of the door or I would just

train a predictive model that did this tell them I'm working eight hours a day and that's that so stop that actually I'll pivot back a little bit more into the command line tool so this is it everything is self-contained I package all of my training data you can train your predictive models from the command line you can view you can manage them so there are binary objects so when you train multiple classifiers it's important that if you swap them you have to be able to extract the same features that were used at training so if I start with one classifier that looks at ten features then I add more and then 20 and then 30 and and 40 and then I want to

revert back to the first classifier I have to have the code the same feature vector from when I trained is when I'm deploying so again everything's docker eyes I have a database that stores those objects and its own container I have this command-line tool that runs and its own and the Troubadour notebook runs and its own container as well and I've trained a couple classifiers here I didn't I didn't deviate any of the features I didn't change any of the training but you can see some of what this looks like I've got the true positive rate false positive rate area under the curve things like that test set accuracy the size of the feature vector again this is

identical but as you extend this or explore and look at maybe you've got some ideas of better keyword or better features or better training data you can continuously increment and all of that the data the features is being persisted so you can swap them and use them at your own consent and that's again a lot of what data scientists do in Zeppelin notebooks or in Jupiter notebooks here you can drive it in a command line and if you haven't used docker containers this is my first kind of exposure to using them for development it's it's fantastic I really like it and it's built in Python 3 and I understand everyone has different environments maybe you have a MacBook VP

of Windows different flavors of Linux so if you have doctor and doctor compose you can run it anywhere I had I use Linux this is a system 76 laptop I usually have Virtual Private servers and digitalocean and they run I have an install script that runs perfectly and when I push this code on Saturday I had my friend check on his macbook and he just said doctor compose up and then it was running so so that's pretty neat a couple examples of you know what we saw from the you could share a pic pretty much any of those here are a few examples of what I flagged PayPal support there of case I did whatever and

when I hit it from a honey client I got a 403 forbidden error and I then I spoofed the user agent and refer and things like that and then I got it to download but these are things that people are using so dynamic analysis would not have necessarily worked for me and then I put it in a virustotal for whatever it's worth a lot of the engines it's not the complete product and especially with phishing it'll probably catch it but maybe there's a delay for me as a threat researcher it's one of the lower hanging I don't I've got lots of other utilities I prefer to use when I'm really doing a threat analysis and

then on the bottom right-hand side here's the sandbox where I got it to work I had actually found this on the same day this was December 28th 2017 when I submitted this talk and this is the certificate that was issued to that domain that day and you can see behind the screen as a PayPal login another one that I found a lot of is phishing it's I found hundreds so I haven't built a system that can automatically queries to go hit those websites and do some directory crawling and fetch the zip archives but I found again a few hundred and here's one that I pulled when I went to the website you can see the index of

directory listing and I found a zip archive and then I took that and one of the first things I did is I sent it to virustotal and one of the vendors said it's got a web shell and when I looked at it again you can see the it takes your credentials and sends them to whoever the attacker is at their email address again this is not a unicorn it's not magic it's not a silver bullet there are plenty of ways to bypass this one example is international characters so they look different you know for for the characters that I have utf-8 that's something that I need to take into account another one is if if you're

looking at URLs my classifier I mean that the data input is only fully qualified domain names and I can very easily extend that to look at URLs but at the moment if you were to take this and run it in production or something if there are attributes in the URI which a lot of times I see if you look at there's a researcher a legal fawn on Twitter who posts URLs every day and there are lots of characteristics that I see in the URI that my classifier is not looking at so that wouldn't that wouldn't help another for the deployment that I have here is if you're not getting an SSL certificate I'm not going

to see that domain and then I'm not even going to analyze it so that's something to take into account I think a really good use of a tool like this would be in like a DNS service so rather than you know this I've got a tool that's pretty proactive and rather than reactively chasing these alerts it would be nice if again because I'm doing it so quickly on the DNS lookup you would have the fully qualified domain name and that would be a pretty good implementation another one that's really important is that a lot of the fully qualified domain names come from let's encrypt and they enabled wildcard certificate support last month so it was due in January and it was

pushed to March and out of live and I thought I was going to see a pretty good drop-off in a number of fishing domains but as he saw I mean we can pivot back there were several that we saw up here so I haven't seen that you can see a lot of these are like PayPal com admin refund mandatory GQ the wild card sort would kind of would change that where the star dot admin refund mandatory dot GQ and I would still probably catch that but it's just something to be mindful of and then the biggest part is that my recall rate is unknown and I'll go through the Jupiter notebook in just a moment and out of all

of the domains that are in search stream I don't know I like I'm not doing ground truth on every single one so I don't know how many I'm missing but I know my precision is very high because it's manageable with regards to the let's see here's what this the Jupiter notebook looks like so it runs as its own docker container when you bring it up you just look at port 9000 on that server and I walk through everything and as much detail as I can what's really nice about this is I can enter weave text with code and then you can literally just run this and you can go all the way down it's very detailed and it explains the

features I'm using and the code that supports that and then at the end you know again some of the features how categorical features work the training of the classifier you can view the performance metrics and these are some of the metrics that you saw on my slide deck as well and more detail on the precision and recall and true positive and false positive rate I'd encourage you to check it out if you have any questions if things don't make sense just reach out and let me know it's available on github at this link I'll leave this up for just a moment I pushed it on Saturday and again everything's available there I've got the training

data I've got three different folders for my containers and there's a lot of things that you could do with this I'm not even looking at the IP addresses that I'm resolving I could look at hosting providers I could look at passive DNS I could actually hit the websites pull down HTML do bag-of-words create a second second classifier you go crazy so the the what I was trying to do here to be like here's a very simple example of how predictive modeling works if you have an example where you're drowning in data and you're you know there's a lot of talk especially again this week at RSA and machine learning AI solving everyone's problems and this technology

from the future that marks the beginning of the end for the human race maybe you can hit the marketing guy well what's the precision and recall rate looks like where did you get your training data what kind of features are you extracting and then he'll probably forward you to the data science and team and that's who you really want to talk to and I think that's everything I had I'm light on time but I hope you enjoyed it if you you think I'm stupid you think I'm a genius you want to buy me a beer I'll be here this week and thank you very much [Applause]