DGA Domain Detection Using Machine Learning

Name: DGA Domain Detection Using Machine Learning
Uploaded: 2023-11-05
Duration: 5 min 46 s
Description: A Sheridan College capstone project applying supervised and unsupervised machine learning to detect domain-generating algorithm (DGA) malware. The team extracted features from domain names and trained algorithms to distinguish malicious DGA domains from benign ones, achieving ~75% accuracy on the su

BSides Toronto · 20235:4661 viewsPublished 2023-11Watch on YouTube ↗

Speakers

Jackson Gorny

Tags

CategoryTechnical

TopicMalware Analysis

StyleTalk

About this talk

A Sheridan College capstone project applying supervised and unsupervised machine learning to detect domain-generating algorithm (DGA) malware. The team extracted features from domain names and trained algorithms to distinguish malicious DGA domains from benign ones, achieving ~75% accuracy on the supervised model and discovering interesting clustering patterns in unsupervised analysis.

Show original YouTube description

This lightning talk was delivered on October 21 2023 at the BSides Toronto 2023 conference held at Toronto Metropolitain University's main lecture hall in George Vari Engineering and Computing Centre. All lightning talks were volounteers that stepped up on the day with minimal preparation time.

Show transcript [en]

hello everybody uh my name is Jackson uh like Pablo earlier I am a recent Sheridan graduate um and as part of our final year of classes we have to do a cap storm project to basically take everything we've learned and do something that we think is cool and that can look good for this school I guess and so most of my Capstone groups actually here so thanks guys I'm taking the credit and so when we were deciding what we wanted to do our Capstone this is an eight-month project we wanted to find something that we would like or that would be valuable for us in our future so originally um we were having some trouble so at the time we were

doing a malware analysis class so we thought maybe we can take some of these Concepts and turn it into our own tool so we thought let's make a malare analysis multi-tool let's take a bunch of the tools that we're using and put them all into one so it'll do it all automatically but that was too much so uh we did some more research looked into some more Concepts and we found out about DGA malware if anyone doesn't know what DGA malware is it's a domain generating algorithm so the idea is that the ATT both the attacker and the malware on the person's computer have the same randomized seed algorithm and these uh randomized seeds so it'll create both the same

um string on both the attacker and the uh infected computer so this is used for command and control domains because if I if these domains are generated right away they won't be able to be on any blacklists they may look a little suspicious because there's no information about them but they won't be blocked right away and the thing is these algorithms create thousands of domains in one go so if you block one they just move on to the next and keep changing what they're hosting so it can keep um sending them information or they can send information back so we thought that this was pretty kind of scary it's hard to Blacklist and if you

can't detect it right away it could be in there for a little while and you don't necessarily know what kind of randomization algorithm they're using generally there's some sort of pattern in them but it's hard to tell and so we decided to create a machine learning algorithm that would hopefully help you figure out what is a DGA domain and what is a regular domain partly because we thought it was a good idea and partly because we wanted to learn more about machine learning and so we ended up creating two algorithms a supervised algorithm and a unsupervised algorithm uh we took a bunch of features from the dat we we found a bunch of uh DGA domains and we took certain features

from them basically we turned uh the alpha numeric name into a bunch of stats length uh amount of vowels amount of consonants um some engrams and some other things we found some data sets from past studies that looked at DGA domains and we've used uh some publicly available um benign domain lists so for the supervised domain we uh figured out what algorithm we were going to use and we trained it based on our model and then for the UNS supervis domain which was more my part I'd say that I worked on we basically used graphing um algorithms I'd say to graph all of the data based on the features and put it all into this big

plot graph um and then the algorithm would put them into groups basically so you'd see a splotch of red a splotch of blue and those would be groups and so our thinking was if we use the right features and the right algorithms perhaps we'll find specific patterns so some groups might be all DGA some might be all doain all benign but it didn't quite work like that so when we finish the finished the project we found it had an all right um accuracy rate I think we ended up around like 75 for this supervised we found this interesting issue with the unsupervised one where if we had four groups two of them would be like almost fully accurate where one was

mostly DGA one was mostly um Bine but then the next two would be like pretty much 50/50 so if you put if you have a domain and you put it in this algorithm you don't want it to be one of the 50-50 ones that just doesn't really help but what we figured was if you can put this into a set with a lot of other domains and you know that a certain group is 99% DGA odds are you might want to look into that domain if it gets put there and you might be able to use that algorithm to help you figure out certain patterns in the domains if all of them are put in

the same group you might see a certain pattern and might want to look into those um yeah but it was a very interesting thing to study I didn't really know anything about DGA or even much about machine learning going into it uh I just think it's important to uh go to your comfort zone uh learn new things and that's partly what I'm doing here I guess thanks everybody and thanks for my group

DGA Domain Detection Using Machine Learning

Related talks