Detecting Malicious Websites using Machine Learning

Name: Detecting Malicious Websites using Machine Learning
Uploaded: 2016-10-27
Duration: 46 min 48 s
Description: Researchers present machine learning algorithms that detect malicious websites by analyzing SSL certificate attributes extracted from network traffic. The work demonstrates pattern-based detection using logistic regression and other classifiers, showing how identifying rare certificate field combina

BSides DC · 201646:481.4K viewsPublished 2016-10Watch on YouTube ↗

Speakers

Ajit Thyagarajan Andrew Beard

Tags

CategoryResearch Technical

TopicDetection Engineering Network Security

StyleTalk

About this talk

Researchers present machine learning algorithms that detect malicious websites by analyzing SSL certificate attributes extracted from network traffic. The work demonstrates pattern-based detection using logistic regression and other classifiers, showing how identifying rare certificate field combinations can expand threat indicators and reveal infrastructure correlations.

Show original YouTube description

We present a set of newly tuned algorithms that can distinguish between malicious and non-malicious websites with a high degree of accuracy using Machine Learning (ML). We use the Bro IDS/IPS tool for extracting the SSL certificates from network traffic and training the ML algorithms. The extracted SSL attributes are then loaded into multiple ML frameworks such as Splunk, AWS ML and we run a series of classification algorithms to identify those attributes that correlate with malicious sites. Our analysis shows that there are a number of emerging patterns that even allow for identification of high-jacked devices and self-signed certificates. We present the results of our analysis which show which attributes are the most relevant for detecting malicious SSL certificates and as well the performance of the ML algorithms. Ajit Thyagarajan (CTO at Atomic Mole) Ajit Thyagarajan is an innovative and passionate technologist who explores challenging technology opportunities. He is currently CTO at Atomic Mole, a cybersecurity company developing a simple and effective security solution for the Enterprise. Until recently, he held multiple Director positions at Fidelis Cybersecurity. His area of research is new techniques for the detection of malware using network tools. Prior to Fidelis, he was heavily involved in with Internet Protocols and building fast routers. Ajit also mentors several cybersecurity start-ups as part of Mach37, a Virginia based Cyber security incubator. Andrew Beard (Lead Software Architect at Atomic Mole) Andrew Beard is the Lead Software Architect at Atomic Mole. His background is in software development, threat research, and abuse of enterprise grade security products in his home network. Thanks to our video sponsors Antietam Technologies http://antietamtechnologies.com ClearedJobs.Net http://www.clearedjobs.net CyberSecJobs.Com http://www.cybersecjobs.com

Show transcript [en]

the bides DC 2016 videos are brought to you by clearedjobs.net and cybersex jobs.com tools for your next career move and antium Technologies focusing on Advanced cyber detection analysis and mitigation yeah uh hi my name's Andrew beard uh my colleague AIT tarajan is sitting over there in the audience um so uh we work for a little company called Atomic mole uh wanted to talk today about detecting malicious websites using machine learning um and I'm really specifically talking about a a very specific subset of malicious websites um I'm really focusing almost entirely and actually totally entirely on SSL encrypted websites ones available over https um so you know obvious question is why focus on those um encrypted websites

are you know huge rise in the past couple years um it's easier than ever to get an SSL certificate free or uh low cost make your own via self- sign certificate um the big thing that I want to focus on if you can actually read it though is that part up in the corner right this was a slide taken from uh dellor uh 15 which was a conference they had looking at 2014 data and they were projecting at that point that 50% of all attacks were predicted to use SSL uh by 2017 so they were saying this in 2015 um 2017 is a whole lot closer now than it was then um and everything that we're

looking at says that yeah they're pretty close to on track um so SSL you know there are definitely some benefits but there are some pretty big downsides to its use too um you know especially in the uh post Snowden uh leak realm um user websites everybody wants encryption right they want to encrypt all the things I don't want my email being read by anybody be it a government be it a malicious individual be it the company that may be hosting it um we want everything to be encrypted and Transit that's great for users that's not so great for the guys sitting in the knock who are used to having content visibility who are used to be able to

see data traffic going forth and being able to tell if there's an executable in a data stream or if this is just somebody looking at Twitter um it's really good news for people who are using the Wi-Fi at Starbucks really bad news if you're on the defense side um there are SSL decryption appliances proxies man- inthe middle devices um all of that that can be deployed um has anybody ever actually deployed one of those devices in an Enterprise environment like a a SSL decryption pass through okay so we got maybe three hands um was that a fun experience for you guys yeah and that's pretty much mirr the experience of everyone I've ever talked to who's had to deal with one uh

you know trying to manage Keys trying to manage cut-throughs all of the sudden realizing that you now have the liability whenever one of your employees visits their Bank um it's yeah it's a bit of a pain in the butt um so really want to talk about a lot of those especially when we deal with that kind of medium Enterprise Market um those devices are entirely out of the r with possibility so the question is what can we do what can we look at in an SSL encrypted stream that can give us some indication of whether this is something okay or if this is something that may be a problem um and a big part of that is the

SSL certificates themselves um so this is a daring coded SSL certificate this is a basically a way to represent it in NY if you to buy an SL certificate from a CA and they give you a key a file back um this is probably what it looks like this is the the what it's going to look like when it's interpreted by say your web server on disk um it's not what it looks like in transit though uh when it it's in transit it looks a little bit more like this uh and I'm sorry this is a little bit of an ey chart here um so this is a dump taken from encoding this decoding the same certificate that I saw

earlier just running it through Bas 64 to get it in a raw binary format and just doing a heck stump of it um and you may or may not be able to see there are a couple strings in here that look like asky uh but they look kind of weird right they're they're very um very there's not a lot of actual human readable text here and there are also some what looks like asy encoded uh numerical values there um so if you're just looking at this on the wire and you don't necessarily understand the format of the certificate you can see there's stuff here but you don't necessarily know what it means um however you dump this through

the SSL um tool that's in included at as part of the SSL distribution or sorry open SSL distribution uh believe this available on pretty much every Linux system ever uh package as part of Mac OS um chances are you already have this if you're not running Windows um so you take that same certificate you do the B 64 decode you pipe it through here and all of a sudden now some of this stuff makes sense if you can read the screen at all um you can see here there are different fields version numbers different algorithms um validity dates um some of those strings that may look a little bit familiar if you could see the

previous one uh different algorithms all this now you can start associating all right there are different key value pairs associated with the certificate Beyond just the binary stuff that we saw earlier um uh there are uh so I'm using the op SSL binary tool there are a number of things that can read this uh wire shark if you look at an open SSL exchange where they do a server certificate we'll be able to dump pretty much all this info um bro a really excellent tool which uh Liam was doing training on the other day and I believe is doing it on Sunday as well um can decode this live um but I want to talk about some of these fields in

particular um specifically these two sets of field um one is the uh certificate issuer and one is the certificate subject um so in this particular field itself it's this one here there are a number of sub fields and you can see things like C equals Au what that means is the um the issuer country code is au right Australia um the St which is supposed to be the state according to the spec is F2 te4 which is no state that I've ever heard of um same thing with the location organization organizational unit and common name common name is more or less the um host name of the server uh in most cases um so these are the fields that we

specifically are have been concentrating on as part of our research um there's been other research on some of the other fields the validity dates in particular um there was a presentation at bides charm earlier this year by Will glck um where he kind of looked at those values um so one thing I want to mention this particular certificate here is actually um common to pretty much all of the trickbot C2 nodes um this was uh brought up in a paper that fidela cyber security published I think last week week before last um aside from the country code which is au these values look almost entirely random right there's they're definitely not valid countries or locations or anything like that um if

you look at long enough though you start and you have a US keyboard as I do right in front of me uh you can kind of see that all of them are very close together on the keyboard um so what looks like a random string of characters and you figure is somebody just doing a script is actually somebody to just went through the certificate generation process and just went do do do doo do over in different places um and kind of give you some information about who set these nodes up right this isn't somebody who has a lot of scripting who's going through who's going to be generating one of these keys or certificates every 30

seconds or so this is somebody who had a very manual process involved um so it's just kind of an example of the kind of things you can get when you look a little deeper into these attributes um so this the issuer in subject Fields the ones I talked about and as we kind of saw in this particular certificate so they're structured by a convention right it's supposed to be that c is a country it is supposed to be that L is location that is what the r RFC says is going to happen there's nothing that actually enforces that in any technical process when you do a self-signed certificate um if a certificate Authority generated the

certificate or at least signed the certificate they may check to make sure that those values are good and if you try to do one that uh you know has a location of Fe whatever they may say no we're not going to do it go away um not all certificate authorities do but um some of the larger ones do especially when you're going Beyond The Domain validation certificates um but self- sign certificates are a total Wild West you can put whatever data you want in this thing and uh it is up to the client to try to process it um so I want to talk about a couple resources that um are available and that we use as part of this and uh how those

pieces kind of fit together I'll make a little more sense going going on um one of them is Project sonar project sonar is basically uh a project it's a project run by rapid 7 um it goes out and it scans the complete ipv for space every 7 Days looking for servers that run SSL web servers on Port 443 and specifically on Port 443 if you're running something on 8443 they don't look at it at all um they publish data every week they've been doing it for a little over two years now uh actually sorry a little over three years now um and so far they've captured about 82 million certificates um they make all that data available for

researchers you can download them week by week um and they will give you complete x509 certificates in that base 64 Dar encoded form that I showed earlier um so a lot of data available there you kind of got to have to figure out what you want to do with it though you know obviously if you're interested in SSL research this is something that tends to make people salivate a little bit when they think of all the potential stuff um but there's a little bit of work required to actually get something meaningful out of this um another project is the SSL Blacklist sbl this is run by abuse. CH also the known as The Swiss security blog um they put out a

lot of really good feeds of information um but this one particularly is focused almost entirely on certificates that are used by malor C2 servers um and they're almost all commodity C2 servers we're talking things like dryex Groot kit um D was a big one when dire was still a thing um and they distribute lists of certificate fingerprints sha one hashes um so you know that's a resource where you can go through and and um find some known bad certificates um so I'm going to make a couple assumptions here and one hypothesis and hopefully these kind of prove out uh or at least you think they prove out as uh we talk about things and you can kind of see how these

link up one is that the internet is really really big um there are a ton of sites out there uh malicious sites though ones that are um absolutely done with bad intent are relatively rare there's a lot of weird crap out there that isn't necessarily right I still don't know why I see so many certificates for Blackberry playbooks available at project sonar um but those aren't necessarily malicious unless you've actually used a Blackberry playbook um so assumption is certificate blacklists don't provide anywhere near uh full coverage right or even of a particular malware family there's re even though the internet is really really big there's still a lot of bad stuff out there and if you look at the

total number of certificates that are flagged by um Blacklist they're really small compared to the problem right the uh SSL BL I think has somewhere around a total of 130,000 um certificate hashes and the vast majority of those are for randomly generated dire servers um which weren't really an issue as of February anyway um so if you went out and you blocked every access to every site on a SSL Blacklist um you could pretty much guarantee that you're still screwed right it might help you out a little bit but if that is your sole line of defense um you suck sorry uh so the other thing is that we can pretty much trust that the certificates that are identified as

bad really aren't really are bad and they're really going to stay bad um the kind of certificates that we're talking about on The Blacklist are done ones that were done with malicious intent right that they they were specifically generated to use on a C2 or malware distribution server these generally aren't hijack certificates these aren't the kind of things where if you are a small electronics manufacturer in South Korea somebody hacks your website and starts using your um site to distribute malware that's not the kind of stuff that shows up on this particular blacklist at least so this isn't the kind of thing where okay we're going to clean up your site and you're clean now

um in a lot of ways is very similar to what we assume with uh malware md5's right if it's a bad file if it was a piece of malware it doesn't generally become good at some point later um and the last one which is really the hypothesis um some actors are going to generate certificates for malware s sit specifically to use for C2 servers for um distribution sites and a lot of them do it in batches that look similar and what do I Bas similar similar means that they have some identical metadata Fields some of those fields we were looking at earlier are the same as some other fields that are bad uh or are used in certificates that

were bad so with these assumptions I think we can build a database of certificate metadata that can help us find and track our adversaries better than just using the Blacklist right we can it's available in things like project sonar and some of the targeting information that we have using the um sslb feed and find something that's a little bit better than the sum of its parts um so this is kind of just a general flow here right we take information from Project sonar well number one project soer spiders the internet we take information from them we run it through a bunch of Python scripts and the op SSL binary to generate basically a bunch of CSV files

where each row in that CSV represents a certificate that was observed on the internet and then we dump it into an Amazon red shift database um now we also apply the information from The Black List to try to label which ones we know are bad and which ones we don't necessarily know a lot about um my CSV files for this look a whole lot like bro data fields if you've ever used bro that's mainly because when we started this we were using bro exclusively um I like bro I use it a lot the first paper we did on this was entirely about bro um and I mentioned Amazon red shift here only because it's cheap and it's a

postgres cluster for people who have absolutely no business trying to set up postgres clusters like me um it's uh I think the instance we use cost us about 50 cents an hour and you can once you load the data in shut it down terminate it snapshot it bring it up on demand so the entire amount of research here if I wasn't lazy and didn't forget to terminate it at the end of the day most of the days um probably would have been would have cost us about the amount as two or three burritos um so wow that's terrible uh um so this would be what it would look like if you were looking at our

database and you couldn't see at all um so this is just kind of uh overall each one of these columns represents a separate data field um you notice maybe that most of these are pretty sparse most of the certificates on the internet uh that are coming up through project sonar uh don't have most of the fields most of them may only have say the common name field which is uh right here which you can see is is filled in pretty uh pretty consistently um so there are definitely ones that have full entry and this is like the the uh yeah this is location this is the country code um there are definitely certificates that have that information

but a lot of them don't um and a lot of that is because there just a lot of devices out there that just have an SSL certificate that just says hi I'm you know 10.15.1 1278 and that's it we can't necessarily tell a lot of information about that um so I do want to talk about what we can find what we can't find though cuz there are a lot of caveats to this um what we can't find the example I mentioned earlier hijack or compromise certificats if the certificate was a good certificate when it was generated if there was no malicious intent that went into it there are probably no patterns that we could be recognize um

after the fact after it was hijacked if you generated a good certificate for your website um and you did all the things that people who generate good certificates do and then somebody else hijacks later that doesn't change the certificate I can't tell anything about that um so also doesn't really help against targeted attacks right if you are um if somebody is coming up with infrastructure specifically to Target your organization they're probably going to do it a little differently than they did the previous time there probably aren't going to be a lot of patterns there it's possible but it's extremely unlikely this is generally for the kind of crime wear uh stuff where somebody is just targeting a whole group of people

unilaterally um the other thing it's not going to help with at all is initial waves of campaigns uh the way that we're looking at things right now we are using uh slsl as the pointy end of the stick right they have to recognize something as being bad first before we can start recognizing if there are patterns there um on the plus side it does find a fair amount of stuff though it finds a lot of certificates that were specifically issued so that that someone could distribute malware from it finds a lot of commod malware torrent Locker like I said dryex D Groot kit um and it finds a whole lot of dumb uh when we was looking

looking through these especially when we were looking through them manually we noticed a whole lot of patterns that just seem to display a striking lack of intelligence on behalf of the people who are generating these certificates um and I'll I have a couple examples of these later but I couldn't put in too much um if you're really interested in this talk to me afterwards cuz there's some funny shouldn't there um so the first thing we started off with we said all right we really want to determine if a certificate is good or bad we want to give it a binary classification we want to just say yes or no so you know effectively we're building a binary classifier model um

the a very simple implementation for that is just use a a logistic regression algorithm it's an algorithm it's a very common method has a whole lot of implementations available it's a supervisor learning model but basically what we're going to do is we're going to use that sslb data we have to label our data to say this is bad stuff we don't know if this is bad stuff or not or um you know we can incorporate some other data to say we're sure this is not bad stuff um so we have an idea of for a given data set if it's good or bad and we can use that to train our model um so our first try was all right let's just

label some good data using sources like Alexa toop 1 million um the uh SL notary um that kind of ratees certificates let's just label 5 million certificates and take all the bad certificates we have throw them into a logistic regression model and see what happens um so not a lot of good things happened uh so we labeled 5 million good certificates we had 1,335 certificates that were recognized bad that showed up in the project on our data so this here is a graph of the good certificates versus the bad certificat see that little line right there no of course you don't keynote doesn't even render the line it doesn't matter that the projector sucks okay we're talking

about 5 million certificates that we know are good out of a set of 5, 1,335 so that means that uh 99 99.97 3% of our certificates were good um so and we threw it into a model and it said it had absolutely amazing accuracy great fit rate in a couple minutes we realize the problem anyway anybody see the problem with this just straight off the bat no okay um so we basically taught our model how to lie it just said you know what every single certificate in the world is good there are no bad things in the world awesome I am right 99.97% of the time we taught computers how to lie we taught him how to cheat we

may be one step closer to Skynet if this leads to the um enslavement of humanity oops my bad um there are thresholds you can adjust with a logistic regression model to prefer false positive versus um false negative results uh in a case like this it doesn't matter it's you are so far biased towards good results because you've given it so much more good stuff that you it's always going to fail um so we learned pretty quickly that you really need to bias your data you need to to split things up in a way that makes more sense to the model you need to start with data sets where the amount of good things you give it and the

amount of bad things you get it are a little bit closer so that you don't just train it Whole World's good there's no Badness here and create the eternal optimist computer model um so we thought okay let's let's try something a little bit different uh we went on we tried some other things eventually we came to uh a use case where you said all right let's let's talk about random looking values um and I okay my slide there is missing doesn't matter you guys wouldn't be able to see it anyway um so there were a lot of values that we came up with uh in terms of the bad certificates that had a lot of random

values specifically in the location the uh oh okay I'm trying to remember so location the common name and the organization Fields um and when I say random I mean these were values that were a varying length that had characters A through Z lowercase and uppercase and numerical values and you could it was very easy for a person to kind of look at it and say yeah this is complete junk um it's possible to recognize these kind of things by doing a transformation right compute Shannon entropy on every field and look for certain cut offs say this is a number of unique characters that appear here um that's kind of expensive to do on every piece of data

not just in set but eventually when we want to try to apply this stuff to live traffic doing it for every certificate that comes across your box um and we really wanted to try to do this without creating more problem specific attributes um so we took our 1,335 bad certificates and we all we l them them all as random or not um and we came up that 38% of the certificates we had which is totally says down on that thing you can't see uh we're random um oh I do have a slide and yeah you totally can't see that um uh but we you know 38% is a pretty surprising um percentage when you're talking about all the bad things that

you found total um so we said all right let's take this and in the column here is one that we added that's just called looks random this was something that a person went through all thousand all 1300 certificates and they just said this one looks like it random this one doesn't um so we we went through and we did a l logistic Reg agression model on this too and this time we found something that was pretty good we were still in the 95% and it was giving us um for both false or for both positive and negative results so it was able to differentiate the majority of the time um whether or not this was something

that was a random value that kind of looked bad the bad news is we started playing around with it and we started realizing that it was entirely basing it on the state and country codes so the state codes and country codes most of the time are two character values both uppercase and so you can say all right well I've got letters 26 letters so 26 squared operations each you've only got 26 to the 4th possible combinations there so it basically ignored all the other values that we had um the other very long random text strings and it just said all right I'm just going to based on this these two values figure out which ones are common and good and

which ones are really rare and bad um there's a lot easier way to do this right uh there's a method called Bloom filters which are basically a probabilistic way of taking sparse data and uh creating a filter out of it saying yes it matches this pattern or no it doesn't um by using a bloom filter and just finding out which ones are really really common country codes which ones are really really common state codes if I have a state that's VK chances are that's not something good that's probably something random um and if I combine that with a country code that no one's ever heard of as well then we have a pretty good likelihood that

this was just something entirely randomly generated um this method is a lot less computationally intensive um to run with than the uh running an entire logistic regression model and then evaluating something against that model um like I said it's not 100% accurate because we're talking about random values it's always possible that somebody can generate a US country code at random and combine it with something it's a valid state code um but probability is quite low um so we did some more work we started realizing um overall we were trying to come up with different patterns of things we were trying to come up with different groups of problems um different groups of bad certificates that we could come out with and and find

something that was usable uh and we ended up just kind of settling on and we got our best results out of just simple clustering um clustering is all about counting occurrences of specific values you start and we decided let's just start with the most common values for a given attribute for the location what are the common locations uh you know there are like say London Springfield York um you look for other common attributes that are happen in that same set of results uh okay what's the country code that generally corresponds with London most of the time it's Great Britain um that one's kind of obvious but then other ones like what organizations in our set of bad

certificates normally correspond with London then you start finding more interesting things um this a little harder to do with categorical data um and when I say categorical I mean non-numerical um the word London just as text as a word isn't particularly closer to type than it is Springfield um if these were all numbers and I had something like I could say four and five are closer together than 47 um so we start with kind of a one-dimensional clustering which is basically just trying to find counts of values then we look at points of intersection with other values and try to find things that are common in our bad set this is generally an unsupervised um activity just finding

clusters of values but we're specifically going to try clustering in our set of bad values that we know are already bad um so it's still an unsupervised method but there's some level of supervision that's going to go on when we go to apply the result so imagine for a second that you take one attribute and in this case it's the location which is by L and pretend you're back in grade school and you have one of those big things of construction paper and those stupid little scissors that you can't can't possibly cut yourself with um and you cut a piece of paper out for every value you find and the bigger the piece of paper the more

common the value is so I can see here that if I cut this out and I laid all these things on the paper the most common location I've got is just no location whatsoever and that kind of makes sense giving that sparse table that we showed early on most people don't include location especially most people that um are doing the malitia certificates don't either um but then there are some other values that are very common Springfield York London Taipei and there a bunch of little ones um so we're going to do the exact same activity again we're going to do it with another attribute and we're going to Overlay them on top of each other uh basically you're going to come

up with a whole bunch of blobs and this kind of show helps represent a little bit what some of the different things that can happen when you do these intersections uh for example this one uh Springfield location Springfield and organization dis almost always happen together those the the these two sets are almost identical almost nobody uses location Springfield without using D um you know location empty is pretty common but then there's a big section where you've got an empty location and internet widgets PT y LTD by the way anybody know the sign significance of this string here internet widgets PTY LTD defa right and if you're just clicking through default then you know somebody May totally uh omit location

but still have um this guy in here so this is common all over the place uh I'll show this a little bit later um my company LTD that should sound familiar to a couple people too um company is a common Organization no matter what the location is so there's no real correlation between these two values um but there is some correlation between my company LTD and York um and type although the location itself is pretty common maybe there's nothing that necessarily clusters with that specifically for an organization standpoint so we keep doing this over and over again and we try to find different layers that intersect with each others to make big oops sorry wrong

one to make big groups like this that kind of gives us an idea of what a pattern is um but this just tells us that it's common in bad stuff right we still need a way to kind of go and um rate this um so we try to do some math around that right overall the patterns that we have should be more common in the bad certificates than overall right if if we're looking for bad patterns and we have a set of bad things we're hoping that we know more about the bad stuff than we do about just looking for the same thing in the internet at large um so we ended up finding a ratio that is

number of matches in the overall set the the 82 million versus the number of matches in our bad set 1335 and we normalize that to the number of matches per bad set so effectively we're saying for every time that this showed up in the bad set I would expect it to show up so many times in My overall um if it's exactly one then that means that we only ever found it in the bad set because a bad set is a a subset of the 82 million so it's not really useful right we have identified all the cases that we possibly know about where this thing occurs doesn't mean that it's not a good thing good indicator it just

means that we've told everything that we can know about it just by specifically looking at The Blacklist values um if it occurs greater than a thousand times that means that it occurs really often in the overall set in in fact it may occur more often the overall set than it does in our our little small set that means it's a pattern but it's not necessarily one that we want to use and I'll I'll show an example that a little bit The Sweet Spot that we found is between a value of about 1 and 250 that's where we find potentially use useful stuff um I have some scripts that calculate these values and shoot them out um from the raw database in our

GitHub um and this is kind of the the example or this is an example output of that script um so what this is basically saying is all right I'm going to take one here uh I'm going to say let's say uh us or Monsanto okay so I found Monsanto as a subject twice in our bad set I found it 26 times in our overall set okay there there may be something there um democracy found it twice in our set and 11 times in the overall set so I'm trying to look at expansion um this one here is the empty right subject o appeared uh or sorry organization appeared in the subject blank 189 times and then it appeared

that's a lot of digits I can't tell if that is 52 million or now it's F or five million times but it means that it's incredibly common in the overall set just by saying okay blank um a blank organization appears in the bad set fairly often okay but it appears everywhere fairly often so I can't necessarily say that that's an indicator of bad behavior um once something gets that big again internet widgets PTY um appeared 75 times in our set appeared 186,000 times overall it's just a common value um when those High values come up you can tell well there's normally something going on here and there's some reason it appears often but that reason

is not necessarily maliciousness um but I'm going to take an example of of one that is in that sweet spot that we kind of run down um so this one value here uh appeared twice in our bad set and it appeared 13 times in the overall set um has a ratio of 6.5 we say okay that means that the algorithm goes to the next step and then it tries to correlate that with other values um I don't know how much you can see of this guy stuff so it starts it says okay based on this particular organization value I'm going to look for other locations other organizational units other serial numbers other countries I'm going to try and find

every other value that seems to happen in common with this one and it came up with 1 two 3 four five six seven values um so we can see here yeah you can't really see it um but it's kind of funny and this is where I start talking about the dumb stuff um so this value shows up with subject L Quebec but uh Quebec doesn't have a K on the end of it um when Montreal uh there's a t in Montreal this is monreal um so you you when you start seeing these patterns you tend to see a lot of misspellings a lot of things that look on ver very casual glance like they may be legitimate but it on a very

cursing examination you can see yeah this is this is just somebody who is going really fast trying to do a lot of stuff and then totally flubbed it um so down here you can see all the results for the P the pattern we have here with Quebec uh gay team as the uh organizational unit uh Canada as the country uh monreal and Quebec um and you can't quite see it here this column is the whether or not it appeared in our feed or not this particular row appeared in our feed you can see the first two did appear in the feed so there were two things and they had a lot of stuff in common they were

almost identical we searched for these specific values in our overall set and we found 13 different cases where those uh values popped up and uh after a little more research it turns out that every single one of these um was actually hosting a dyex t C2 server um so you know you can generally say once you start having these weird patterns of things that are fairly uncommon but do expand out that there's commonality here and that whoever generated this certificate these first two is probably the same guy who generated all these other certificates they're things that are rare in isolation but for some reason appear you know more often than they should in our bad

list um you do that and now all of a sudden you can find a lot more stuff than you did from that Blacklist alone so big question right what can I do with this um so this was my kind of good better best except the okay one isn't really that good uh so it's my okay better best list um the first thing you can do is as we saw here you can take your black list and you can expand it right we found six 6.5 times as many indicators as were on The Black List for that one particular example and it turns out that that page that I previously show showed um everything that was in the 1 to 250 save

two indicators came up with a similar expansion rate um so from this you can go from having you know 1,335 indicators that were matches up to you know tens of thousands hundreds of thousands pretty easily um so you can make just a better Blacklist one that's based off of inference a lot um you could find stuff there you could patch that through the same way you do your normal feeds the better thing to do is to start looking using those patterns and finding strings in your SSL traffic um and you know you can do that with a packet based IPS even uh we saw early on those kind of recognizable strings even in the binary

data um there are plenty of snort signatures uh I know Sid 19551 in particular is an example of one um that look specifically for binary stuff in the SSL exchange uh and those Tex strings generally reflect um back to to attribute somewhere in there the the best way thing you can do with this though is to just start looking at actual patterns in your network traffic using a tool that can read SSL data bro is a good one there are plenty of other commercial tools that can do similar things um it really doesn't take a big investment to get that kind of contextual awareness and uh start reading what those values uh can do for

you you can match against that stuff and find a lot more stuff going and I say that looking for it live is so much better than just creating a black list because there's a lot of delay in The Blacklist right if you're doing an off ssbl you're based off of what specifically they can see if somebody generates another certificate with the exact same pattern hasn't shown up in SS slbl anything like that you don't know about it until that entire process falls through if you look at these things live you can hit them immediately um so this is the the Triangle really pyramid because it's a two-dimensional figure um of pain I think David biano was running

around here at some point there you go thank you for the triangle um so when we started we were talking about feed values right and this certificate sha one values are just hash values right that's at the bottom of the pyramid the pyramid is all about how much pain you want to cause your adversaries or how much pain an individual thing can cause your adversaries personally I want to cause as much pain as possible um I want to see my enemies driven before me I want to hear the Lamentations of their women um um so I want to do better than this right by kind of working through this process um and coming up with

certificate attributes we're really talking about more Network artifacts right we're talking about things that have much longer shelf lives we're talking about things that um we can predict or at least deal with some amount of change on the adversaries point they can keep generating certificates all day long with similar attributes and if we know what attributes they are we win um uh so you know this this allows us to have slightly higher quality um indicators that we can use so big takeaways it's possible to leverage really large data sets with specific indicators find more General patterns that are more useful than those specific indicators in and of themselves um you got to choose your data sets carefully

though right if you just try to dump everything in and just see what happens you're probably not going to like the results and uh you know machine learning algorithms overall they give you insight into your data um sometimes the Insight that they give you is that you really shouldn't be learning using a machine learning algorithm overall in production sometimes they suggest there's something much simpler going on here do that instead uh and I want to say a couple thanks uh first to besides DC and the staff here um Swiss security blog for creating the SSL Blacklist and Rapid 7 labs for providing the projects on our data uh any questions what's up yeah were you able

to do some statistical analysis the dat broadly like to see if things like U certain certificate authorities were more commonly than others uh yes I did um something like that except you can't really read it very well uh so the main thing to to realize is a lot of our data is kind of old um we're dealing with stuff that's from 2013 on so you kind of got to go time slicing wise right when you I look at this originally um let's encrypt barely registers but mainly because let en Crypt hasn't really been around that long the one that shows up the most uh out of the total number of certificates by far as kodo um they've

had really cheap certificate programs or free certificate programs available for a long time which very much biases it but 87% of the certificates we found were still self-signed uh what's up so have you looked at a correlation of information like external information for instance maybe the age of domain or uh what exactly it's referred to as far as what what sign uh so the question was have I have I looked at external information correlated with um the certificate information like age of the domain I haven't um so far this was pretty much just taking raw data in but that's definitely a next step on there um I think a lot of stuff in terms of age of

the domain also versus the validity period be potentially interesting um but we were kind of trying to keep it simple with those textual fields that are all exact match so we didn't deal with any scalar values whatsoever there yet uh what's up in the front oh yeah sure if I can find it there we go

yep something so actually I'm mostly talking about using scripts um so there is as part of the broon presentation that we did last month um we had some kind of example scripts and that's available on our GitHub page as well um and uh but one of the things that we're looking into is kind of making that a little more flexible because we ended up hardcoding a lot of the values into the scripts themselves I really want to get something like where I can load in a Json and have that available as part of the Bro package manager so that's very easy to import but it's there's really no Universal form at for these kind of

patterns as an indicator right now so I think we kind of need to figure out something to do and and mik thought right now is Json because it has a you have to be able to say this field exists this field doesn't and kind of go from there uh one in the back over there what machine learning framework you ow uh so all a lot of all the machine learning stuff was pretty much just or sorry all of the clustering stuff was entirely custom uh based off python stuff for the logistic um regression stuff we started off um we had some people working on I believe they were using R coming in through some python

packages numpy and and some others like that to do data frames um I ended up just doing it the simple way and I ended up using AWS SML as a service a great deal uh mainly because processing power when I dumped in all the data was uh very very intensive any others no cool all right uh well thanks for having us [Applause] uh oh and if anybody wants to talk about this later because I realized I've now chewed

Detecting Malicious Websites using Machine Learning

Related talks