← All talks

GT - Is This Magikarp a Gyarados?: Machine Learning for Phishing Detection - Veronica Weiss

BSides Las Vegas31:41101 viewsPublished 2019-10Watch on YouTube ↗
About this talk
GT - Is This Magikarp a Gyarados?: Machine Learning for Phishing Detection - Veronica Weiss Ground Truth BSidesLV 2019 - Tuscany Hotel - Aug 07, 2019
Show transcript [en]

hi everyone my name is V Weiss also go by Veronica Weiss and welcome to my talk is this magic carpet Gyarados using machine learning for phishing detection and talking about the interplay between the two so let's just start off the talk with who am i I'm just a college student who really likes gradient boosting and Pokemon and I'm this past summer I've been working on cloud application security really dealing with credential compromise type of tooling which is really funny that capital won't happen because hey what better way to show your value than an actual attack happening so I've worked on data science teams in the past security operations teams in the past and this work kind of was inspired

because I noticed a lot of workflows being not fully automated which then got my interest in more machine learning and how to automate more of your workflows as a security operations team and then I also was working for a computer vision lab for Skidmore College and MIT this year that really introduced to the more visual part of data analysis so let's get the elephant in the room out of the way sometimes you click on a harmless link and it's like a Magikarp using splash and then sometimes you click on a link and it's a hyper beam exe is downloaded and your IP address is somewhere that's not just your personal ARP table and some attacker has your network

information so that's just a quick way to describe phishing to some millennial kid if they don't know what that is so the agenda today is a brief history of phishing detection that goes into the current state of research the data collection process if you're an independent research who wants to get into this and what some data processing and feature extractions that you could also do have that data and modeling and evaluation as well as a conclusion so motivation is that I was just in college wanted to do a machine learning project I was thinking about maybe taking elf files and looking at header data and then my professor is about those two ambitious for one semester and honestly

matching file signatures in the year 2019 is kind of outdated so phishing is pretty cool it's a nice legitimate business case so but the issue is is as a college student you don't have an institutional College phishing repository accessible to you if I was some security analyst at a big financial institution then I would be able to just ask someone for some phishing emails and would be able to get some as a college student you are kind of just thrown into the abyss as the internet and your brain as the only tools you have so just thing about phishing I have some friends who are really smart at security but they were asking me why phishing like I can't

someone look at an email and say hey I can tell that this is fake or not but let's think about the vast audience of people who aren't accustomed to always thinking about what they're clicking on or thinking about how they use it they're using the Internet so if you're just so an employee low level you're thinking about just your day to day operations and you get email from your sis oh saying hey like update your credentials it's mandatory it'll look bad on you on your personal account with your manager if you don't update your information of course you're going to click on a website such as the C that looks pretty legitimate and then boom

all your iCloud credentials are stolen and now people can go look at your weird dog pictures that you state like it's collected for many months so that's really important to just think about as thinking it's security professionals we might think something's basic but what about just a normal person so common phishing attacks that happen right now or just DNS cache poisoning attacks where an attacker can compromise a DNS cache server and then redirect to other IP addresses or a malicious open redirects this is actually using the javascript in HTML itself to have a redirect to a different website or a malicious attack and then a botnet of course everyone knows what this is it's just having your hosts be compromised as

a larger attack surface - so how has phishing detection been done previously and how has it changed at all so before there are black lists and then white lists and that's just a lot of manual effort it's not scalable a security analyst has to maintain this and it also is based entirely on previous incidents when you have larger people like in full phishing campaigns targeting your security team especially if you're a smaller institution you cannot just rely on static retrospective work and that's why I think automation and machine learning can be really helpful for it to solve this issue so a more new way of fishing detection has been human and machine learning driven and it's

really great if you could build out your own machine learning detection but as security operations teams you usually have to buy a product instead so this is the small part of the talk where we talk about products sorry about that but you as Co fences human driven phishing defense which has human analysts observing phishing emails in real time looking through your email repositories and kind of asking hey it's as phishing or not behavior SEC agari and Enki are all companies right now that are pretty prevalent all are in series a funding or better and they rely on machine learning in a various amount of ways behavior SEC uses user behavior analytics which we'll talk about how that's kind of a little

suspicious to just depend on weird user behaviors that you may or may not actually have to be detecting to detect fraud agari uses ml as well but they don't really like to share their solutions Enki is based on a textual analysis using NLP of visual analysis using convolutional neural networks as well as other probabilistic modeling that they used to extract features from the email directly so that's just some ways that if you're a security operations team these are some good products that you could use more so inky and so we're we talk about behavioral ii behavioral metrics but in machine learning a lot of products right now are looking at the way that users type on a keyboard the

ways that they swipe on an iPhone and that just kind of to me seems like a little intruding on people's personal information and confidentiality and hopefully at ground truth people have been talking about collecting the right data to prevent cybersecurity attacks as well so if you're a company that kind of depends upon combating fraud but you're also looking at people scrolling techniques like that's exactly what tinder uses to like boost their metrics like you probably don't need that that's kind of uncomfortable information that's slightly dystopian to have as well for your employees so what if we had a better approach that was scalable it wasn't all based on a retrospective approach it was more automated and had

relevant information where you didn't have to just store extraneous information about people and their daily habits so the tection the right way depends on two data sources emails and websites so to dynamically evaluate an email we'll go into that but also we have to now be able to dynamically evaluate a website as well and so this talk will go into what's been done before and the ways that you as an independent researcher can go into scraping phishing emails off the web and build your own fishing repository so here are six for an email this is just a phishing email as you can see like this you see the text you see the paragraph you see just the colors the space the

way that white space plays into this email and so a the right way to look at this would to have an NLP based textual analysis as well as a visual based analysis that will be looking at the coloration or seeing anything that's odd because as a human we can look at an email and see oh something's off we need to be able to also teach machines to do the same as well so standardization for your email classification should go upon eight different categories which are actually seven categories that is a Bitcoin cryptocurrency extortion scam a lot of phishing emails right now are talking about icos and saying hey like if you reply to this with a certain

amount of information we'll send back some Bitcoin wallets or a coin base account information and that's just not so that should really be like a prevalent category that is talked about in an email classification scheme as well as if emails are from a disguised as a educational or a banking institution or if emails are disguise from an application or a help desk type of person who is asking hey update your account click on this link to update your account those are also really prevalent and should be accounted for as well and 4:19 are sometimes called Nigerian prints emails that's when someone sends hey I'm in dire need of help can you come send information back

and those are obviously bad should be accounted for as well and malicious PDFs are becoming more and more important as DocuSign and electronic emailing can really target executives and people who are aren't really trained on how to use the internet safely but are very powerful and then of course we want to have a phishing not non phishing email group as a control group to for our classify so here's the websites art can be broken down into four different parts a URL the deployment of the website the content of the website which also includes the copy written contact the structured text the HTML and the JavaScript itself as well as the reputation which we could use

certain algorithms like PageRank to determine so now let's just talk a little bit about data collection the feature extraction process and modeling so there's such a lack of phishing email datasets and that's probably because researchers in academia do not have access to corporate and institutional emails and it's also really hard to anonymize that information once you even have that because you want to promote the integrity of the email as well for your classifiers but still really hard to anonymize data thankfully certain services like Google Google cloud has a data loss prevention API that can be used to help deal with anonymization if you're interested in this work and that's one recommendation for that and so part of a phishing data set too is

what are the prominent features of phishing datasets what are we going to look at and there needs to be more of a discussion in the machine learning community about hey are we going to look more at the URL or you're going to talk more about the JavaScript we really need people who are going to be having a deep dive and hey we have these features they're important and let's talk about the correlations between them instead of what exactly are the prominent features so the Enron email dataset is a really interesting data set that is from an energy company in 2002 and it can be used as a control group of just regular phishing emails what happened was that

an academic person from Amherst College purchased this email server a server for about ten thousand dollars and now it's being used for academic research the original copy was about sixteen megabytes now it's much shorter thankfully but if you were interested in a house having some fishing research this is a good data set to start drawing off of I've used it before for my classifier by building and it's a really good place to just have a basic control group so public fishing repositories are also really good to have and know about Cornell UNC Texas all have their IT departments give out differ and phishing repositories and it's really easy to scrape the two off of it

and I'll also be putting on github like the tools that I've written of using selenium and docker to scrape phishing repositories as long as you have VMware a file system that you want to write out to it could even be just a file system or a PDF as long as you can draw it and make an analytic space table off of then you can use it and so this will be going up on my personal github soon if you don't want to have to write your own Python scripting maybe you don't know how to write Python and so that'll be up in a bit too so the phishing website datasets that are that do exist or one

from Mohamed McCluskey and fab da it's probably the most comprehensive phishing website dataset and it has about eleven thousand instances all the instance all the values are also binary categorical data so it's all one hot encoded values which is really good if you have some rural based decision sets or some decision or a decision tree such as C 4.5 you'll be able to have some nice splits because of the binary canary categorical categorical data is nature and the data set is split about 43 to 57 so you won't half the jitter too much or change the data so that now let's talk about just what the mohamed McCluskey fab touch did to talk a little bit about

what it means to establish finding extraction from a URL so important things to think about for phishing and detection in a URL is is the IP address in the URL if so it's probably harmful the length of the URL if it's 75 characters or not then that's probably going to be more harmful if there's an HTTPS token added manually to the URL that's also going to be harmful of course if there's two forward slashes in places that shouldn't be or if there's a presence of an app sign in a URL that's probably suspicious and at sign can be used to get away from browser mitigations so it's really important to kind of look for these miscellaneous

tokens that could be suspicious and then this data set also has it included in their data so then you could put this into a classifier and see the correlations and relations as well as the frequency that attacks like this happen and there data sets so now we can move on to deployment which would just be talking about is the issuer a certified trusted issuer of this HTTPS certificate or any ports open and if so which ones how old is the website and if the website has a DNS record or not also it's really slightly small things on the website such as the image itself in the address bar could talk about if it's from a

different domain and if it's harmful or not then and so then we can move on to the content itself which is looking at its an iframe was used and an iframe if you're not familiar with HTML is a way to embed a web page inside of another website so then you would be able to in a subtle way visually be able to prove it go around someone be seeing a website inside a website so that would be harmful seeing if the website disables the user from right-clicking you can just use a basic JavaScript to disable right-clicking on a website if that is the case and that's probably harmful as well also if the website asks the user

to submit information to an email address or using mail to or trying to have any client-side mailing systems those are probably going to be harmful as well as if there is any javascript:void 0 skip or hash tags and the flags of the anchors on the webpage itself and if a pop-up window of occurs that's probably obviously harmful as well and so now we can move on to the reputation which is just talking about how are the website how is the website inter played with other web sites on the web how does Google itself see it if you were to google it is it more of a safe website or would it be considered as suspicious

and the features to look at the for that would be to see if the web traffic has web traffic is to see if the website has web traffic going to it seeing if it appears in Google seeing if the website is pointed to by another website seeing the page rank which was an algorithm detected written by Google which ranks websites in a page zero to one if a website is 0.2 or not then it's usually closer to the scale on 0 so then that is really harmful and just not because there's not that much web traffic going to it and if the website is blacklisted or covered in other domains as well as being harmful then

it'll be automatically suspicious in the classifier as well so if you do recursive feature elimination you'll end up with the five since most it's statistically significant coefficients and if you just do that then you'll end up that the most statistically significant coefficient all right HTML that deals with the redirect which is JavaScript avoid anything that deals with skip or a hash tag if the URL has two subdomains or more which that means that there is a dot in the URL in places that shouldn't be and that that section consists of more than three of those sections if the amount of web traffic to the website is negligent or if there is nothing really going to it than that

would be considered suspicious which makes sense that's one of the most important statistically significant variables or coefficients and if meadow or script tags are on a website container URL then that's also it's a suspicious so that makes sense but if a website has a trusted certificate that was seen as it's statistically significant coefficient however it's pretty easy to get assert if you're ever setting up a honeypot or some fake websites so we'll go into that about how context really matters when you're evaluating your classification models and how you can't just rely upon curse a feature elimination or just going based on certain metrics itself so if we just go on modeling if you just use like the

table data set you'll end up with something around a 95 percent accuracy with a point nine one Kappa statistic which means that over ninety one percent of the data is actually reliable and is better than just guessing at random and then if you use a C four point five decision tree you also end up with a ninety seven percent accuracy which means that 94 percent of the data or it is better than guessing at random as well and this these are really high accuracies and Kappa statistics do you evaluate the reputation of the data itself because of the categorical binary nature of the data and so if you just use a naive Bayes you'll end up with

about seventy percent if you don't do any feature extraction but then you can obviously increase your metrics with that and so you can end up with about a 91% accuracy and about that eighty-five percent of the data is reliable and better than guessing as well this in this however the data set does not really talk about if all these phishing attacks and campaigns are independent of itself and a really important assumption in naive Bayes is that each category or row would be independent so if the data set or the people who were to release this research were really talked about the independence assumption and how maybe it's not just one attacker making a fishing kit then that would make a much

more reliant a much more relied upon assumption for a classifier a neural network just a basic RN n shows about a 93 percent accuracy and that almost 90 percent is better than guessing or not at the data and then gradient boosting means really well to binary and categorical data because the classifier like GBM depends upon gradient boosted one side sampling which gets rid of any features that have a small information gain as well as exclusive feature bundling so if any features are much mutually exclusive then they are also taking out of the data set so that so that makes sense that a classifier would have such a high accuracy with 90% of the data being better than guessing

so just takeaways from modeling is that if you have binary categorical data of a really complex scenario zeros and ones then you are going to have really nice decision trees you'll have really nice accuracies and metrics but you also really need to think about how they'll play into the context of real life also context really does matter with the recursive feature selection you can have really important statistically significant variables but if it's still pretty easy for an attacker to get assert then that's going to just mean that you need to really think about how your classifier is going to deal with more in the real world and the future work that I would wanted to do is apply

more of a visual analysis and textual analysis to the email repository that I'm building out virustotal also is really important for malware analysis and I think that would be really interesting if people were to make virus so for phishing emails and have hashes associated with it because these campaigns were used some of their same techniques and the exact same email itself so I think that would be a really important way for security analysts to kind of as a community get together in the Labrador and so the conclusions are major takeaways is that phishing repositories are incredibly under worked on as a resource College phishing repositories are also under you Turla under utilized and that phishing attack

station should consist of both emails and the website itself that the attackers are using to target people as well as a feature extraction for harmful websites should really be talked about and more and talked about the what exactly uh what exactly is we should be using and thank you so much to David Reed for being a great mentor throughout this whole project and the friends and mentors for support and I'll be putting the scraper on github in the next two weeks if anyone's interested in getting a scraper to have their own phishing repository research being done and that's basically it if you have any time for questions just feel free [Applause]

hi can you tell me more a little bit more about why you chose the models you chose yeah so the model is that I chose where can we go back in so I could see and I don't remember off the top I'm sorry yeah yeah so the models I chose I wanted to get a an array of different types of models I wanted to really make sure that you start off with a decision tree or just rule based learning because I think you want to start simple piece you know PCA was also involved in one art to really just start off simple but then I figured that'd be more interesting to put more of the more

complex models as well in the research before the presentation and obviously like naivebayes is really different than gradient boosting which is really different than a decision tree and so I think it's just important to show that even with an array of different models you can get somewhat similar metrics with the dataset that itself did you consider other neural nets and 7rn in like was was that was just kind of when you went to or what were the other options that you could have done for the other options you definitely could have used a current convolutional neural network and that's part of the future work that is being done currently

okay so looking at this one thing that I've always wondered is how do you avoid the unsubscribe links avoid hitting those two like when you're processing through these data sets the unsubscribe links yeah so like if somebody received a legitimate email it has an unsubscribe link at the bottom if you're assessing all of these web pages that get processed from there how do you make sure you don't unsubscribe people from mailing lists well most mailing lists would be look really different and so it would probably be kind of negligent to the classifier itself so if you're being unsubscribe then like a classifier would be able to see that hey this is different than the rest of the links on

the mailing lists okay have you considered adding information from third-party sources as a feature like the Google Safe Browsing API or site transparency report yeah that's currently being added to some current work right now that's a great point that Google has its own analytics too that could be involved I was going to say for the unsubscribe links I'm pretty sure when you click unsubscribe you usually have to input your email or something like that so that I don't think just clicking it actually unsubscribed you so that might not be a big concern I was also gonna ask so we we recently saw some new ml based malware detection software being able to be exploited because you could just pin strings on

the end and it would read it as like a whitelisted waitlisted software so what ways you think an attacker could craft a malicious email that might be able to go around machine learning based a phishing detection that's a great point and especially really timely to add however I think that when you look at kind of news reports like that you have to be aware that the data scientists are already aware that there is a certain things like vanilla mimic or like what you were the bytes appending that you're talking about that was something that people were already aware of and should have incorporated into the model I think that all you can do is kind of have bug bounties for like

that it's hackers or classifiers that you're working on and just work with incorporating or improving your models itself as you get more feedback from people grow in more adversarial work so where do you see tech trends moving if this sort of technique is I mean you're using it here's an undergrad and you have member companies coming up but the millennial wave detection where do you see attackers moving to get around these defenses and so forth yeah I think a lot of that would be into the work that's being done with convolutional neural networks if you look into any recent adversarial machine learning type of work convolutional neural networks are a huge part because you can just change

certain metrics certain distances certain hues in the image itself and kind of add more static to brake the classifier also if you put anything that is similar to a input into a classifier but just change its slightly then you could break it the classifier in that way so I think as more companies start putting ml into their fishing detection then they're going to have to also think like how are they going to break it visually and also in a type of text way which would be really simple of just hey we can capitalize something that we don't have to

could you go back to the categories you had yeah of course possibly like you had the seven categories okay I'm just gets just going slow going yeah how did you come up with these categories as did you cluster them or did you kind of choose them based off of what you'd seen in the data I did it based off of what I've seen in the data I think if you just kind of go based first on clustering and then kind of putting that into a context then you can end up having some biases of reading clusters in the wrong way so I really wanted to make sure like looking into the research first and seeing what are the frequencies of the

most prevalent type of attacks and then coming up with these categories through that way did you go through and then label the the core PI from the educational service sources this way and then run your classifiers or did you have a different way of labeling so the data was labeled through a classifier so but first it was I first created these categories and then had a tag and then I could assign those tags to the frequencies in the data okay thank you my question was if you tried to actually do a motor classification a motor class classifier given different categories so a one versus many kind of thing I mean maybe after you'd attack oh it is

a phishing maybe you could run through some of these modes just wondering if this is something that yeah that's currently being done yeah that's that that's super cool yeah that's reciprocal and more of the results will be released to eventually for that

so whatever is left over after the classifier is by default there for the user to catch and there's research showing that as these become more and more uncommon in people's email streams they actually get worse at finding them so in the face of very good technology that makes an attack email a very rare event how do you protect a user who might see this so infrequently that they don't realize it's a threat and how do you protect against the possibility that the actual number of compromised individuals doesn't change or even goes up well I think for your attack model as this is becoming ever more important you also need to have more education for your employees at an institution you

said that people are going to start looking at more emails like and not be able to tell the difference but if you have like constant education as a security operations team then you should have people who are mindful because also part of like being a worker is also thinking about it but if you're someone who like doesn't is that a corporation that doesn't have a security team and is much smaller than you just have to probably say village it to actually think like hey is this actually something that I should be clicking on or not interesting so you train your way out of people not seeing it in the wild that'd be the best way to do it probably

if you have two resources and then a corollary question to that is when the classifier has identified phishing emails what do you do do you tell the user do we mark it do we disable the URL what's what would be great if if I were to put this into a different application and say like hey this is like fishing this is just owned research of just working on the classification part of it not really the deployment

hi awesome talk by the way um I've heard that some scammers actually craft their emails poorly because like there's the probability of someone reading a poor email and actually clicking through they'll be more likely to be a and a victim at the end or something like that have you actually like looked at the frequency of your emails to see like if there's a difference in distribution of like really well-crafted phishing emails and really poorly crafted emails yeah there's an index that kind of encapsulates that and very most of them are pretty badly but try to be written pretty nicely um so no one really tries to just like write garbage and then see if that works but that would be great if

that happened in a higher distribution that would make some way more fun to look at

[Applause]

[ feedback ]