← All talks

Acceptable Use of Internet; Categorizing The Web at Scale

BSides Budabest · 202225:0139 viewsPublished 2023-06Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
This presentation was held at #BSidesBUD2022 IT security conference on 26th May 2022. Tamás Vörös - Acceptable Use of Internet; Categorizing The Web at Scale Because of the prevalence of watering-hole attacks, drive-by-downloads, and browser exploits, the security of an organization is partly a function of the kinds of content its employees browse. The content of those websites widely ranges from pages that enable social networking to sites that engage in sharing protected intellectual property. To help organizations profile the risk of the internet usage of their employees we have developed a neural network approach for web content classification in support of security goals. Introducing web control might prevent employees from accessing pages with inappropriate content, risk of legal liability, or that simply have a negative impact on productivity. Here we demonstrate that we can effectively expand upon the coverage of a blocklist for 80 distinct categories by building a machine learning model, using only the URL as an input, that can accurately predict the category of previously unseen websites. https://bsidesbud.com All rights reserved. #BSidesBUD2022 #ITSecurity #TamasVoros
Show transcript [en]

okay wait ladies and gentlemen we continue in the afternoon with boris tomasz on the acceptable use of internet and categorizing the web at scale thank you yeah hey everyone thank you very much for attending our talk acceptable use of the internet categorizing the valetsky my name is tameshwaresh and i'm a data scientist with surface ai first let's go through the agenda of the talk it will have me four main components first we will go through the motivation why we are doing this project uh within sofa ci in a nutshell we think that the security profile of our organization is heavily dependent of the content their employees browse so we find it really important to understand that

content to best of our capabilities to do this we propose a machine learning model for url classification for downstream systems so this will be a defensive application of emma we we propose to use amazon to do this because the internet is huge and highly volatile so it will be a invisible task for a human to to do also even though the url is not tightly related to the content the page is serving it still holds valuable signal and it's a very lightweight artifact that can be used more easily than a html or a portable executable file then we will go through the model that we are proposing to solve this problem and we will go through the architecture

of the model and the intuition why we propose to use this and finally we will go through the is that in terms of accuracy and coverage and finally we will hand inspect some of the samples let's look at the problem more closely let's say i'm a sys admin at my organization and i go in monday and i see three urls in the telemetry of my organization so i essentially realized that my colleagues looked up so on the x-axis we have the urs that my colleagues looked up and y-axis we have my happiness level about those urs so when i see the first url it's a security playbook pdf from microsoft.com so that's like fantastic kudos to my

colleague i'm over the top happy about that the second one is less known domain it's called tricks for winning the lottery pdf like i'm not that happy like i guess i'm not judging but whatever and finally i can see in the logs that someone looked at pirate bay and downloading uh or trying to download the pdf now parity is a torrent site so that's intellectual theft at best or maybe that's not even a pdf so that's how i get run somewhere so it's just me but what will i do i will just simply bloat that url in my firewall or my utm now next day i go in again i see the winning the lottery pdf so

it might not work uh but more importantly i can see that same pdf is still being still being downloaded from piratebay except it's from a slightly different url so this is an issue so that's it i had it i blocked the whole domain the pirate bay next day i go in again and i can see that it is being downloaded from a different torrent site so it looks like with a bit of a cat and mouse game on our hands and the question that we are trying to answer is that is this a game that we can win or is this a game that we should be playing at all so let's do one more exercise um

here we have a plot which says bin domains by lookup count on the x-axis and on the y-axis we have lookups covered in telemetry so let's unfold what this means this is a bar plot it has bars so if we look at the very first bar it's produced by taking a sample from our customer telemetry uh taking all the url all the lookups from the customer telemetry grouping the lookups by the domain and taking the first 100 domain by this kind of popularity so if we sum up other lookups from the top 100 most popular domains and if i have some kind of information about those domains it will and and i block it in the utm allowance

block list in the utm then i'm it's really good news i covered a lot really a lot of of the telemetry of my customer base so if i do this exercise for the next next bin or next bar for the same amount of work hand labeling 100 domains or urls i get significantly lower return of investment and we can see there is an exponential decay so at one point i will just simply run out of time or permissions i will just give this process up because it's simply not worth it to unlabel stuff so let's say i i give up and labeling or blocking stuff in my firewall after the first i believe these are for

40 bars so that's 4 000 domains most popular domains is that yeah and the question is will i be surprised the next day i i go into the office will i see any other funky uh passing this kind of block list like this approach now the answer to that is that probably yes because if i take the rest of the telemetry and sum it up as the last column there it turns out that the long tail of the url lookup distribution is actually really long so it's simply infeasible to take the most popular domains and and just assign some kind of label or knowledge to them how does it translate to our actual coverage this exercise um

this will be a plot that you will be seeing more after this but on the x-axis it has the time so for every day this plot represents for every day how many unique urs were looked at by our customers for for this 100k sample and the solid range color means that that we had some kind of information about those urls and the dashed one means that we we have no clue what the customers are browsing which is clearly not good so that's a blind spot so this is this is not as not necessarily one to one map to the top and domains but it's not a bad strategy after all so what do we mean by labels uh internally

with suppose we track 80 labels um i track i i show nine examples of that eight labels here these labels or urls belonging to these classes could have impact on the organization along multiple dimensions or aspects so there is one important which is the security aspect so there are trusted at this sites like microsoft.com or search engines google or there is stack overflow which is something that probably you are not going to get infected from or less likely and to the bit right of that there are the social networking sites you can argue how risky it is but it's more likely that you will get infected from a facebook link or a dropbox link than uh

then from a trusted update site and then on the far right we have the extreme categories which is pornography or drugs so those are just simply infested with malicious content so you might want to pay more attention to those urs and then there is the other aspect of all this which is called which is we call the negative impact of productivity or negative legal impact or ethical impact so there are these law sites which have no product negative productive impact because we need to need those to do our everyday job but then it's up to an organization whether it wants to allow its employees to browse social network sites during workouts or dating sites and for

sure no company wants to allow employees to browse pornography or or buying village drugs during work hours so this is sort of the motivation but why we are tracking these classes so just to recap the motivation part what we are trying to do is we want to use our machine learning in contrast to the standard approach which is having analysts hand labeling a few sites that cover a good chunk of the telemetry but not all because it leaves a lot of unknown data

this this manual labeling approach we want to replace it because it's really slow it's expensive manual work doesn't scale at all and we want to use a ml instead of that specifically to provide extra labels for our long tear that manual labelers couldn't cover this project has additional like side effects but it reduces time for about new sites i i mean whenever you deploy the model it will be instantly there so you don't have to wait for the whole pipeline to get to the analysts and it could also highlight conflicting labels but that's for another day generally if you use emma you would want to split your data into two or three groups the first group being the training data

that will be used to train your machine learning model and the other two or three is the test of validation data where you will evaluate your model how well it was trained we use the so-called time split with which we want to emulate the deployment scenario meaning that we pick a point in time which is roughly end of january for this scenario and we take everything before that time as training data and everything after that has test data we use 200 million urs is our training data and one one detail to highlight is that there are no duplications between the training and test data which will be still a realistic scenario because people still can look up urls from the past but we

are not really interested in how well the model can generalize between url caps but sorry not interested how well it can memorize but we are interested how well it can generalize for new caps but domain revisions are allowed but not not for urs

so now that we have our data and our labels the only thing we need is our model uh we propose to use a 1d convolutional neural network for this task there is the on the left we have the architecture of the model it has fundamentally three main blocks one is the character embedding then the feature detection part and finally the classification so just a few quick words about it the input to the model is a url as a string so it needs to be the ml operates over numbers not on strings so we need to convert that in strings into numeric representation we choose a 1d character level embedding to do this um it's a it's an existing concept there

are many ways to do this we choose to go with that so what that does it takes every character of the input url and it maps it to a numeric vector every every character to a about numeric vector in a way that characters that occur in a replaceable fashion so like numbers are like in a url so it's not necessarily changing the number it's not necessarily changing the meaning of the url and then then those similar characters similar characters with similar rows would be mapped to similar numeric representations so that's probably enough about the input part but this this makes it makes this more robust to obfuscation attempts if you just want to type more

and more numbers at the end of the url and and then the most interesting part of the key part of the model is the feature detection part which consists of 1d convolutional layers what convolutional layer does on a high level it operates a sliding window on top of input strings are fundamentally one dimensional so there is an example on the right for that we have the casino bethlehem.net as a input url it is converted into its numeric representation by the embedding layer and then we set a we pick a convolutional window that we slide over this string so essentially if we set out the key equals three so a three line convolutional window it from

the beginning with the step size when it will slide over the url so it will capture the numeric representation of class rc c and so on so essentially it will capture all the three length substrings of the url now we do this with window lines two three four five so for example window length four it will capture the every four length substring of the url but with that rank three it will capture bat and with length four it will also capture game so these are important because with the human eye these are the very birds or subwords that drive the human eye that's saying hey this is a gambling site so among the other substrings these will be captured too

and all these features will be fed into the classification layer and it will assign baits to these specific features in a way so it it gets the best possible probability or accuracy at the end so how to evaluate the model as such um there are many ways to do that i'm showing one this is called rap curve uh recur shows a trade-off between between the first positive rate and the true positive rate this plot has three lines on it three curves so for example if we look at the orange line there is a dot at 10 to the minus third false positive rate what it means that for every one in a thousand urs that are not in the

healthcare side we will make a mistake and say it's a healthcare healthcare site which is not good but it is a price that we have to pay that we get a decent true positive rate uh roughly on roughly on the 90 percent mark which is a pretty good result uh we are doing well for example with the pornograph pornography category as well so these urls turns out they have current pornography something that are really explicit even in the url but but but they will be hosting and of course there are more generic categories like personal cloud apps which is hard to hard to decide what they are about but it's still a okay result but how does this uh translate into into

the actual deployment so we had this uh plot before with the covered and uncovered data so first let's see how the tracker translates to the labeled part so it turns out if we apply our model uh for each each label for each labeled part we get the corresponding model predictions but we can evaluate our model the solid blue is the urls that we got correctly and the libraries that we got incorrectly which is this this is our right reset but this is not where the that we would expect this model to gain value from because this is something that we already knew so let's see how we do on the on the unknown part and it turns out we can

roughly have the amount of unknown data that we had in our telemetry before just by deploying this model and it picking up specific subwords looking at the ratio with which the model got the urs right it kind of gives hope that the pink one the additional again from this model will be also uh could a bit good good accuracy even though it's a harder problem

these deep learning models are fundamentally black box so as of now we don't really know why they make the decisions that they make there are multiple tools to post-op models that are used after the model training to explain the first modals these decision decisions one is called lime so we are showing results here from uh from a positive mode called lime um on this plot we have two set of urls one is the [Music] high scoring previously unknown examples so that's that's the net gain from this model and then the high scoring missed labels which is the failure mode of this model first let's go with the high scoring unknown examples so the first url is movie 7.8

series and it was predicted by the model to be an intellectual piracy site now with lime if we feed this url into lime with our model and its predictions what lime high level does it splits the input into tokens permutes the tokens and it assigns probabilities for to each token like how how much the contribution of that specific token is to a specific class so the deeper the color is on on that uh right picture uh the tokens contribution to the class intellectual paradise is higher so the highest contribution is that movie samantha cam which is i would say it's a fair fair token to pick up as a marker of the intellectual piracy side

also the series one is highlighted so it seems like this convolutional concept is working out and then we have the democratic strategy.org which is likely to be politics and finally the pokerstars.tm which is obviously a gambling site and there we have the other set where where we made the mistakes with the mother so the first url is for solarknives.net so on the right we can see that the model looked at the burn knives and immediately said it was a weapons site whereas in in reality it was food and dining so i would say this is a fair mistake to make then we have the word cancer day which was uh predicted to be ngos in truth it was a healthcare site so

this should have been caught sometimes it it just not capturing the proper uh subwords and finally this is the mode where you would expect a url based mode and not perform so well is the study.com where the url has not no no no signal whatsoever but it will host so unless you knew that study.com was a hosting site you you stood no chance with this modal predict that it was a hosting site so that would probably explain why we still have uncovered data in our telemetry even after uh utilizing the moda but even with that we we get we get nice looking urs extra coverage yeah and with that thank you for your attention and please

if you have questions asked [Applause] do we have any

yep so i was thinking what is the next step to fine-tune this what how would you be able to fine-tune because you are focusing now only on the urls do you want to focus or or add more content based on the words pictures on the web pages or so the main next step is that this system access every filter so obviously there are methods that we won't try want to try like different models but i would say if we are looking at that as a system as a whole the next step would be implementing a model that takes in html from the url then that would provide more so that's the trade-off so downloading and scanning on the html and

predicting on the html it's way more pricey but uh using this model as a pre-filter then feeding that to an html that's the next step for this project thank you

so thank you for the presentation i would like to ask a quite practical question because i'm working with content filtering softwares and testing it and so on and what i see so far that even with this kind of ai introduction it is still not real still not near how the human people trying to interpret the page so i would like to ask your opinion when we will get to the point that instead of classifying any domain saying this is a nature side this is a gambling site instead of that the classification software downloads the page interprets the text with machine learning interprets the pictures audio files video files on the page and it actually you can say that for example

this download this part of the the of the parent-based site is a free textbook therefore you can download it and it's it's a completely fine but for what is content it says it's highly uh inoperate therefore it is classified as a torrent side so you are not allowed to open this url in your browser when you are working in a corporate environment thank you so i would say machine learning is there but there are more aspects to machine learning than just the model so there are two key other aspects so one is the data that you use to classify so my answer would be first that if you think as an model as a compiler and the

data as the code the compiler executes then then we need better code we need better data so that's one part and the other part is that the hardware is not there yet so there are models that are pretty good at uh at uh at classifying all these things that you asked it's just pricey to get the data for it and to run it for everything but i would say the argo is there and i think maybe 10 years but it's not really for me to say that's just my personal i guess uh i have a simple practical question why three is the magic number because i realized that in another product they also took every three consecutive characters

and and build their model based on that why not four characters no uh we we have uh actually four four of those layers so we take every two three four and five substrings and by that it's empirically so someone tried six and someone tried one and we arrived at the solution okay this this is where it made sense but no practical reasoning just we tried it all okay that seems to be it thomas again thank you very much [Applause] you