← All talks

Using Machine Learning To Detect Phishing Emails by Anthony Saich

BSides Cheltenham15:452.7K viewsPublished 2023-06Watch on YouTube ↗
Speakers
Tags
StyleTalk
Show transcript [en]

it wow thank you all for coming uh well I told my parents I was gonna talk today they uh of course they have been proud for me they laughed saying who wanted to listen to me talk about machine learning for half an hour and people showed up so I don't care so I'm happy so this is my first ever talk I'm giving uh so please be nice no don't Heckle me don't shout me so yeah so a bit about me my name is Anthony I'm a decent graduate of the University last year with a degree in cyber and computer security after my graduation I start as a tier one sock analyst where I uh stop

Hackers from attacking our systems but in case our client systems we worked in banking sector Healthcare manufacturing and after that I pivoted into a shift lead slash senior role so now I'm in charge of a team about five Lads on a shift and whenever they have any issues they can talk to me and I try and fix it if not I'll give it somebody else and no thanks for me so back onto this talk at University I needed to complete the station so Pete my friends are doing dissertations on uh Healthcare and cyber security in education cyber security I thought it'd be a good area to do machine learning because it was an interesting uh so

machine learning is quite an interesting topic for me and fishing because fishing such a however I had no experience in machine learning so I was reading the dummy's guide to machine learning in my lectures uh which my lecture thought was quite funny uh well I was talking about he said answer you sure it's a good idea and it worked out so I can't complain well quick catch up what is fishing fish attacks are not our new uh Cyber attack vector uh the earliest known fish attack was the 1990s a good 12 years before I was born so uh just to Aid you here um and it was first used on AOL online uh attackers would pose as AO employees

and trick victims and handing over their logging credentials which they will then log into the account still credit card information however with the increase of advanced instant communication such as email and if a text message and you know the Amazon Instagram or Facebook um fishing has changed from the more a personalized attack which I'm trying to stop with this machine learning why is fishing such a big deal an average of 3.4 billion fishing emails are sent every day this is a conservative estimate so it could be four billion 5 billion fishing campaigns are incredibly easy to create at work we can make a good conveniencing fishing campaign in about 20 20 minutes 30 minutes um

and we send out to our clients to test their fishing responses and probably about half of half of users will click on link and they and we pop up this is saying you've been successfully fished and this is a no good way and as I said three people four billion fishing amounts and today if one percent gets through that's about 34 million emails being successfully fished again um list of emails can be easily found online so hackers just call through data sets find emails and send it out so what are the current methods used for phishing detection there's out four main types of machine learning base which we're covering up today there's content based we should look in a heuristics uh

heuristics and word based so looking at keywords within pushing emails and there's list based more traditional uh Blacklist real-time Blacklist whitelists as well as DNS lookups are looking at web DNS comes from where future names come from um but it has as I said earliest is looking at the machine learning based so traditional methods IP blacklisting online listing is a process of collecting IP addresses in a list and prohibiting any email communication attempts to come from the IP this is a a very old method used for 20 years it's still effective to degree because to to realize efficient an IP address is bad someone has to be hacked first detected so someone has to actually get fished so he to First

understand it's a bad IP address an average takes about 12 hours for phishing IP to be identified and said that this leads to users and organizations to be available to cyber attacks nlps nlps are natural learning processes uh there's a popular way to step fishing your mouse this is when you actually look at keywords within the phishing text itself so for example we're looking at um action verbs to influence views are clicking so important click here click now this is a really good method however it's false positive to quite high for example when a bank emails you're saying to change the password people might think it's phishing against sensitive that's spam box and markets into user

when it's important to change a password if a bank says so OCR OCR Optical captive recognition is to play Business Solution uh user scanning in uh user scanning documents for files and onto the online system however it can be used for uh machine learning it can be used for phishing just to actually read the words out and put it a list however it's not the best um it's another best method of detecting it due to the fact there's a lot of false positives so in my dissertation I thought well it's always old old methods what are the current up-to-date new methods so I was looking at free machine learning algorithms to try and take these

phishing emails I'll use the namespace algorithm decision tree algorithm and Forest algorithm so this is when it gets a bit more technical so please bear with me this would be a bit more technical part no space classification is a machine learned algorithm based off base probability uh no space Works off assumption for example base knowledge base level is an apple is an apple isn't an apple is an apple is considered a fruit because it fits in criteria it's being red or green or being over or being an oval shape and this can be passed on to machine learning so if you have a machine learning data set of phishing emails it knows it's phishing due to its internal data due to its

heuristics inside it nice Bay is a simple machine learning algorithm easy to implement compared to others uh the amount of training time needed is quite small and with machine learning training is so important if if your model takes five days of train compared to two days to train that's five days extra which is um you know critical decision tree decision tree algorithm uh is a decision tree algorithm is a tree based technique which is which uses any path so decision entry algorithm is a tree based technique which the path begins to root so it's going to save top with a root node which splits down to a decision node which goes down to terminal node it's all splits down to a

decision tree node this is uh [Music] and once it goes down to Total nodes anymore and the decision tree will split us down so we sort of data goes down like a tree random Forest random Forest also known as FS random virus shares May lux's decision tree classification however manifest made of many individual decision trees so it gives more variability and it gets more dated to run through each of the decision tree has that prediction and most competition becomes a vanophiles prediction the strengths and weaknesses of each of these albums no space performs well with small data sets and is quite fast however it relies on the Assumption of important and independent features based off biased

probabilities so if you have quite a biased data set that can really throw off the machine learning algorithm we've actually about 95 accuracy this doesn't treat using less data preparation during pre-pre-processing however this algorithm can scale so it's easy to scale to add more day trim it takes a long time to train this model that's the weakness of it within Galaxy about 85 percent random Forest uses less data preparation during previous processing however it uses a lot of computing power to actually blend this algorithm with an axi of about 94 percent so even time for more technical stuff now the programming I use the uh Python programming language don't judge me because uh it was the useful is language

I knew um I chose it because it had a large amount of resources online and GitHub was always there so that's always a good thing to have um machine learning based libraries were also were valuable pre-made which was good for me melee is because I have the most experience with it so I chose the one I was most comfortable with so machine learning is based off data models and data sets and data sets are really important to have um it's really hard to spell because a lot of researchers like to hide their data sets under and they'd like to share it out so I'd unfortunately have to use a quite old data set from about 10 years

ago so for the phishing emails I use the nazrio fishing Corpus uh that was about four and a half I think four and a half thousand phishing emails in there and to to detect for real emails to compare it against I'll use the N1 data sets used probably M1 um Financial leak uh back in the early 2000s however first before with machine learning was taking place I had to First clean data Dave sets uh The Chosen were not suitable for machine learning this is because the natural data set was actually in an inbox file Mbox files are a archived email file so it's all no it's not very yet to be be my model um

and also all the emails had to have um all the emails came through as HTML so I had to check out that htl afterwards I also had I always had to normalize and tokenize the data sets because um you know having apostrophes having commas and that sort of could mess up my data so I had to tokenize that make anything standard so I I could do that manually but I wrote a small python script to go through the whole day set to uh I went so well I went for the whole data set to clean it using python if this video will work mm-hmm so as you can tell we have the raw HTML of the email here and it's this is the

outcome so it stripped out all the HTML I made it all nice and pretty ready for my algorithm to work better uh this took about 20 minutes to run from 5000 emails so my laptop was screaming at the time the fans was going mental

however some of the phishing emails had been removed because they had they contained zip files they contained uh very weird like weird formatting others so I had to take those out so to classify which emails were fishing which ones were non-fishing I used the classification method ham and spam and spam classification method is a standard process for detecting fishing and non-fishing emails um with ham being uh non-fishing and standing fishing and in my model I put Hammer zero and spammers one so to actually measure the success of our of my algorithm I based off four metrics a Time metric accuracy recall and Precision Time metric pretty straightforward um yeah it just took how long it takes for

that to Machine model to run I ran the other 10 times and I took the average time as my as my end result accuracy is measure actually this is going to measure exactly model by showing the correct number of predictions based off the Amazon classification recall uh recall codes measure the total positives classification positive classification correctly from a positive sample and lastly is Precision Precision measures the number of positives as positives and not as negatives I'm going to research position is attempting to class ham as ham and not spam as ham so to make sure that no real emails go with this fake and no fake emails Kung fuers real so we've actually accuracy scores nice

based achieved and 83 um actually score rating decision tree score 93 98 90 sorry 89.8 accuracy and Brandon Falls achievement 88.8 over 10 lens and I took an average overdose scores as you can see on the time measurements now space took a whole two and a half seconds to complete which when I first read the uh timings I thought it was an Ellis I went again and again and again but it was it was it was right the decision tree took a bit longer with two and a half seconds and Vander fires took the longest of 14.5 seconds to run um so you know as we know speed is key some cases so nice players would obviously

won here with a recall uh Naya space achieved an 83.3 percent uh rating decision tree achieving 98.8 and London Forest achievement 88.7 percent position scores no space choosing 83 score decision tree scored at 89.6 score and Vander Forest Achievers 87.9 score um so which waste machine learning model is best for detecting my fishing uh fishing at fishing emails as a guitar with namespace got the best time however time is no thing it was score scored some of the lowest on I can see recomposition decision tree score the best and I can see time Recon Precision so I scored best on accuracy vehicle incision but it's called okay on time so oh and my van first came last and

then I think but but we're going to think of a large picture here with um you know timing is key it is important but actually probably having better recall decision and accuracy is probably more important so overall in my my research decision she was better than used to set Vision emails clever with my data does a few things what could have gone better for example we there's more data was needed like I said earlier this uh research would like to earn the best at showing date unfortunately uh because these data sets need to be cut to date so 5000 emails wasn't the best so maybe I've had ten thousand twenty thousand thirty thousand emails I would have got better results

um also within with the real emails they could have been a bit better because a bit old and I got I couldn't find an up-to-date uh legitimate email data set but what is a future holding fishing isn't it's going to be a major issue and there's no one pixel solution as I said earlier the current Solutions can detectable fish emails that still leaves fertile based off autonomism earlier still at least 31 million successful emails being sent to real count today even if we incorporate spam filtering uh fishing the machine learning fishing based uh detection systems still millions of emails coming through every day so as I was working working at AIT uh blue team sector which would be

focused on user training and user awareness top issuing tax because as we all know humans are the weakest link in the equalization it needs one person to click a link and that can split through a whole network and thank you to the end of my talk um I'm not great at the end again again talking so at the end give me a shout afterwards and be asking any questions you like but thank you again for coming and I hope you have a good rest of your day so thank you [Applause]