← All talks

Internet dataset combinations for #ThreatHunting

BSidesSF · 201730:54232 viewsPublished 2017-03Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Advanced Internet dataset combinations for #ThreatHunting & Attack Prediction Have you ever had to look up an IP address, domain name, or URL to decide if it is a threat, and if it is targeting you?, Do you ever need to analyze what what malicious action it just took on your potentially-compromised users? If yes - this session is for you! It's time to move beyond simple Whois & PDNS lookups, and noisy threat feeds. Learn how to combine SSL cert facet data with tracking IDs like Google Analytics, ad-trackers, performance management trackers; host-pair relationships; technology stack fingerprints; detect, verify, and stop your adversaries' next attacks.
Show transcript [en]

sponsors very much bear spray the people that Fitbit and hacker one and without further ado here is Ariana seat three thank you everybody for coming today we have a very dry presentation I think you're all going to like it I'm afraid to touch this if it seems like we're really extra gentle we found that a one press disconnected so who are we steep hi I'm Steve dizzy I'm the product measure for passing total one of the cofounders of hastor platform security researcher and this is great because I'm a beer aficionado and I get to do security and drink a lot of beer all the same time my name is Aaron I got started in the semester doing financial

applications so i got to meet the Russians and the Chinese and the Secret Service and that got me into forensics which led me to here today and most of what i found in this presentation I just find by accident he's a smart guy here so this this presentation is about it's really really two sides of what's essentially the same coin we we've been studying how bad guys okay we've been studying how bad guys targets and attack people what are their tools techniques and procedures and what data sets to bad people use when they go after people and we've been looking at ways that we can leverage the same TTP and data sets to help defend ourselves we're going to

talk about that we're going to talk about turning that back around at the end on the bad guys so the goals of this talk are really we want you to walk away with an idea how both red and blue teams can leverage some new data sets we've been working with and your existing TTP to give you better visibility who's attacking you what they're setting up where they're going to attack you next trying to turn the table back around on the bad guys we learned some interesting lessons doing this so starting about two years ago I wasted a year talking about the data sets specifically and after about a year of presenting on data sets the same people who had the same

problems that i talked to weren't using the data sets that could help them they weren't doing anything new and there have the same problems which led me to realize that the sorry about data sets is not only boring but people can't act on it so today we're actually going to talk about some newer datasets combined with traditional data sets most of you are probably already using in one form or another but we're going to specifically show you how to combine them and show you some free tooling you can use in investigations because what we've found is if essentially we don't hand it to people prepackaged in a way they can use it easily or fit it into

their existing PTP that people just don't have time to sit down and research and and play around with a lot of these new data sets so that's the goal show you how to do it and give you an easy way to do it you should be able to walk out of here and do everything we just did today when we're done so we're going to dive in here Arian just showed a couple of the data sets that we at least like you use to track adversaries and we want to kind of impart the best way to go about using that information so the whole point of this talk is to discuss how we can use you know web crawling and

infrastructure training to allow us to better hunts our adversaries those people that are targeting us and so how do we do this traditionally we've been doing this for a while we're using interconnected data sets to find relationships between things domains connect to IPS there's malware that associate students you know you have C&C a command and control Adam and IP addresses all of these things kind of interconnect with each other and if we present that data in an easy way to interact with you can start to surface new connections about your adversary you can possibly substantiate your thoughts about why the adversary is targeting you or how they're targeting you and you can hopefully start the group activity in a

way that makes it easier to defend your network so what do we know about traditional data sets they're going to see these little these little polygons octagons all over the place but what we have those traditional data sets in pasadena and who is organizations have been using this for a while what we are saying is let me look at this commanding control domain or IP address I let me look at this down domain that was used to download mallar to my network and let me expand from that piece of data to understand the adversary at large and so these are tried-and-true methods we start with an IP address we find additional domains that associate to that IP address over

time from there those domains associate to other IP addresses and we're solely chaining out information to bring context to our investigation additionally humans data allows us to take facets of a who rejected and quickly understand you know if this email address was used to register a domain maybe it is used register of the domains maybe this phone number or address or specific piece of information in that record was used to register additional domains we're basically building off of you know poor operational security and human error when people start to register domains they're going to possibly do the same thing over and over again because humans are creatures of habit and so we can use that scenario to identify additional

parts of the actor infrastructure and again use that to defend our network so like I said this is traditional data we know this we expand beyond this how many people here are instant response or blue team that actively are doing investigations today how many folks here are wanting to learn how to do that more of it how many folks here like red team pentester reverse engineer okay we've got a man resting makes an elastic so building off that kind of traditional methodology of chaining infrastructure what do we want to do we want to bring that Triton through method to other data sets the reason we want to do this is we found that actors and attackers are

evolving they know as defenders how we're tracking them these days and so they're using that knowledge to better protect themselves from our investigations they're responding to our methodologies and privacy protecting who is data they're using fully segmented infrastructure for so that there's no overlap and passive DNS data they're increasing an operation of security in ways where you know five years ago they wouldn't care about it as much now people are doing things to stop us from identifying them and new data sets and tactics are needed in order to really expand our visibility and on the right you start to see you're bringing in other data sets we're bringing in SSL Certificates facets of those certificates trackers host pairs all

these things that I'm going to talk about in the coming slides and all of these data sets are based off of web crawling so crawling to increase context now the idea here is that technology today now allows us to crawl the Internet scale and database that information to gain understanding of our adversaries and also of those that are targeted by those adversaries so we have the ability to to do this database that information and not only understand what actors are doing today but go back once we understand new tactics techniques and procedures and look at our data to say how how does this new bit of information layer over the data we currently have it

allows us more insight what we're really trying to do again is to use actors mistakes exploit poor OPSEC and look at new infrastructure connections to really understand our adversary so this is going to be really rapid fire but the question here is arian had mentioned earlier in the talk that he talked about data sets a lot and people don't always understand the data sets we were talking about and so I wanted to walk through what's in a modern web page very quickly and what we get from a crawl so you kind of understand how we view this data there's a lot of different things in the modern web page using a crawler we can understand you know messages this may be

developer comments this may be error messages these things come out as we call a web page we're going to understand links most pages are made up of other resources that they're pulling from and so those links are part of the makeup of that web page this allows us to know where content is being called from we're going to understand dependent requests what other parts of the web are organized pulling from to make the page you're viewing today it's not not it's not like websites are ecstatic anymore so they're actively pulling things on demand to make that page those types of calls can give us insight into is something malicious taking place here cookies for both pho ton of legitimate

and nefarious purposes organizations drop cookies as you via web page it's normally to track and understand who's visiting their page actors do the same things these days they're dropping cookies to understand who's already visited the page to profile their user base to say okay how I fingerprinted this person yet have they seen this URL is this a unique visitor and we can use this understanding of cookies that are being dropped to do the same thing we do it who is data say show me everywhere else we've seen this cookie what sites are they associated with it so it makes it you know it's another connection point additionally you see the name scenario we've seen a lot of scenarios

where the cookie name is unique or fat fingered and so when an actor is coding they'll enter in a cookie name that isn't common instead of PHP f essid it's pnp and so we can use that in order to to identify unique attributes of an attack campaign headers this is just the you know the sequence of what's going on how do I build the page what do I need to know to make that page up here and then finally the response and document object model as we call page we pull back that pages content this can be indexed for unique attributes of the page there's a lot going on with building building a web page including everything defaults to

SSL these days so we now have certificates or so to say it with pages and IP addresses so it's a lot of different information built from a web page how do we operationalize this data in order to use these same methodologies of infrastructure chain I wanted to walk through that because there's a bunch of different ways we can we can do this we don't operationalize all these data sets but very quickly you can see we can increase our visibility because we can take that link sequence and dependent request data and do something very similar we do with who is data we can take each of those and create what we call a host pair that host pair now

tells us site X redirected to site why what was the cause of that redirect maybe it was a top-level redirect and we can also see that we've seen multiple domains associated with that redirection chain so site X redirects to site why but it also reaches redirects to site Z as well so now we have an expanded understanding of that attack campaign and we can use our traditional methods like passive DNS data to find IP addresses that associate with that additionally we can use the the response in the document object model to understand what we call trackers these are things like Google Analytics codes clicky IDs New Relic Facebook Twitter you name it everybody has some type of

kind of unique ID that they're pushing to a web page and we can use that to understand multiple domains that are using that same ID usually those are legitimate but what we found is actors are also inserting JavaScript that looks like a analytics code that we can use to track compromised websites ssl-certificates we can take a certificate that lives on an IP address and we can quickly say what's the sha-1 of that is that sha-1 been seen on any other IP address is there a unique facet inside that certificate that allows us to understand additional context or find additional certificates and we just keep building out our understanding of an adversary network so that was a lot of we can do this

let's prove that it actually works we're going to start with with the mage cart example this is a 2016 cybercrime campaign it was a keylogger keewatin mallar that was targeting ecommerce sites it was injecting JavaScript code into those sites and it was basically understanding how it was stealing the payment card information as a user was typing it into to check out of the website you can see that this is a crawl sequence from a crawler that shows we have flavor which is a book company in the UK is redirecting to mage online net and jQuery dash CM top it shouldn't be doing that both of those are actor owns domains and and that is how they were

redirecting to capture the payment card information being entered so what do we know with traditional insight to data we start with the Faber URL we see mage color mage online net it has some who is information and some IP addresses associated with it those IP addresses associate with thousands of domains so it's really hard for an analyst to dig in understand if those are malicious or not but we have some who has information that gives us some additional domains angler duck club and let me look at the at the screen magento dash CDN top additionally we have some other overlaps in hosting providers but that's it our kind of Investigation stops right there who is provided some other domains we

could we could block against but we don't have a full insight into the attack campaign if we enhance our visibility with crawl data what we find is not only with failure redirecting to mage type met but also 45 other compromised sites redirecting cue that that actor own domain additionally as we start to pivot off of our other known domains like that angler club we see 334 other websites that were also in acted and redirecting to to the actor own domain and within a handful of pivots we've understood that this isn't just a single incident this is a significant amount of infrastructure that's compromised and redirecting to those four actor domains that we knew of so we really enhanced our visibility and

what was going on with his attack campaign by using that crawl data the second example I want to walk there is the DNC attack rizzoli steffi everybody's been talking about this has been a lot of good publications from from CrowdStrike for a threat tax but I think the interesting example here is you know we have an attack targeting the DNC that's been attributed to Russia fancy bear or whatever name you want to call the group at this point we have a couple if we look at the original CrowdStrike report we have a couple of CNC IP addresses we have a couple of external implants for command and control and there wasn't too much you could find out if you looked at that

infrastructure to begin with we find the ip's that they listed they associate with a handful of domains only this one was interesting and that's it we have no really other context if we start to look at crawl data and use our certificate repository to expand our knowledge of this what we find is there ssl-certificates associated with these IP addresses most of them are only associated with a single at the address but if we look to the to the phone right we see that large chunk of additional IPS that's a huge group of a possible additional external IP addresses that you identify based off shared certificate so in a handful of pivots using these new data sets and I you know

tried-and-true methodologies of infrastructure chaining we're able to really dig in and understand a much larger aspect of the attack campaign now since Arians a very daring individual he's going to do a live demo of how we would walk through some of this I don't even have any booze yet not cool all right so we're going to start with a simple example and basically going to walk you through this step-by-step like if you were going to do this yourself these are exactly the steps and the connections you make and along the way I'm going to show you some mistakes I made with anybody at Derby con last year okay so so I gave an example I mean used

here in der beek on and I totally missed a huge indicator that explained a whole bunch of things I didn't understand i'm going to show you today but this one this is a basic fish and I didn't screenshot it it's gone now if we this is a just passive DNS data we're looking at and so it's no longer live but this was a google login form and this guy if you actually look at the links in the Dom he just sources in Google's legitimate assets and puts up login forms and and so we got this we got this google login form you know what does that mean to me do I just block it and

move on you know I got an alert my user went there and you know some endpoint our IDs think something bad might have happened I don't know where it matched up with the blacklist those things can be pretty noisy black lips and threat Intel so so I want to ask some questions about how dangerous is this thing do I really need to care do i block it and move on there's somebody targeting me do i need to spin up a whole team and really investigate this and you know keep do i need to put monitors on this so we're going to look through this data really quick first thing is is this is this is this google is this legitimate

well i'm going to come in here and i'm going to look at the Whois data and this guy conveniently leaves his who is data or some who is data in here we can pave it off up so looking at the whois records i'm just going to jump over here into the email and this gives us a pretty clear picture let me make this bigger can you read that can you so basically when you jump into this guy's email he's got like 25 35 domains registered and they all follow this pattern of Facebook and Google fake profile fake login pages for the various products this guy seems to have a niche in his niches fake google and facebook profiles and that's

all he does so I'm not really worried it doesn't look very targeted it doesn't look very legitimate I'm just going to drop those into something I can block my users from going to it and move on you look through the rest of us who is it's not that interesting when we look at the link data which is what the host pairing is we can see they're sourcing in Google App sets now where this gets interesting though when I started saying like what could this be Google I jumped over and if you look at the actual infrastructure underneath it it's pretty amusing because all his google and facebook stuff runs on asp net and an older

version of iis7 so I'm fairly confident google and facebook on aren't actively using today so so there we we basically walked in with said hayes afficher i don't really think he's targeting me is it legitimate that's definitely not legitimate i mean it's it's he's copying unix vendors on on.net stuff so i'm going to block it and move on that's the first example I promise you the restaurant one more interesting here all right this one this example is pretty interesting one two unfortunately something just changed on it in the last week or so so there's a point where i usually get a good laugh that I've lost I'll tell you about it when we get to

where it should have been funny but essentially this is this is one these folks were active for a while drop-in links on social media and a lot of social media threats are blended threats so they'll link back to add additional phishing attacks or they'll try to link you to a fake mobile app to download or the lot of human try to social engineer you so you'll you'll see these pop up there usually sub 24-hour attacks like fake customer service you know you use the hashtag mattock capital one and then they'll drop a link you know here's our customer service guarantee come talk to us the day these folks we're going to look at here are just Fisher's but let's see

what we can learn about them so we're going to come down here I'm going to look at the who is and unfortunately this one's a little difficult to extract much because they've anonymized the who is and this actor actually just started anonymizing than who is at the start of the year prior to this this actor actively advertised a set of who is that we could use to to kind of verify it arose a bad attack so let's go a little farther Sudhir I went back and now I pulled historical who is it's important when you're doing investigations to have historical data especially his adversaries are getting smarter about hiding themselves and covering their tracks so you want a good source of

historical data and an actor i have here his persona is Hildegard Gruner and so we can pivot off that and if we look at Hildegard he he owns a lot of stuff and actually if we go through all the different facets if we pivot up all the facets of us who is data will find roughly 10,000 domain somewhere between five and ten thousand this guy owns a lot of domains or has historically so when you first look at this though this mcafee and bitcoin stuff is all brand-new he's into something new but you look at it he's got a lot of junk domains and weird typo squatting stuff it looks like he's been into everything

if you start sorting the data historically though you can see as an adversary it looks like he matured over time looks like he got into some basic typos squatting and very basic stuff and then as the years go on you can see its attacks start to advance and target specific name businesses and then they start to get into more campaigns so I already know who Hildegard is you know I know this isn't legit but are there other points of data sets I can use to verify that my employees are my customers aren't going to what is a legitimate site I want to block this thing if it's legitimate so I thought well let's let's actually look at the

technology stack running on this alleged capital one bill pay site so we look at the technology stack it's running wordpress it's running is 75 there's actually not running a lot of stuff here it's only got 100 six components unique technology components so I thought let's let's compare this to a real so this is capital one calm and if we look at this guy here I'll just pull it up to give you an idea the numbers so so the legitimate default capital one comm site has over 3600 different web components on it where the fake one only had about between one and three hundredths so they're not doing a very good job of impersonating them as well as the

legitimate Capital One links to everyone on the planet but what was more musing about this when I dug in is when you come back here to the adversary site and when you look through their components they actually keep all their components patched and up to date so if you look through the WordPress if you look through the jQuery even the PHP versions the phishing site is relatively up-to-date it's usually within a week or two of new patches and releases when you go through the capital one components what's interesting is actually the legitimate capital one site does not when you start looking through their stuff you'll find old versions there's asp.net 2.0 still running in production

here there's heartbleed running in production on their Alexa which is their Amazon Alexa integration for capital one if you're using that it's vulnerable to be sniffed all day long so they used to have some more components but I talked about this online once and then I noticed that the next week the components they talked about got patched so maybe if this shows up in video online though they'll fix their open SSL and other issues too so again quick way we can tell the adversary we've got historical who is and we just look at the fingerprints of the technology stack and it tells us a lot about what's going on adversaries more worried about their infrastructure being compromised so we

got that when we're going to order to move on and and this is a PMC I've got a pnc bank being targeted for fish in a similar way and this is the example that I used in der beek on and I totally missed some couple really important things in here so we come in here and we've got a similar kind of problem if this this actor has instead of anonymizing they are just using generic bogus information for their who is today but if we go back and look historically I'm saying like the question I'm asking again as pay is this a real attack is it targeting me how targeted how worried do I need to be do I just rock it and move

on and don't have to sweat it we're going to come and look here again and if we look at the historical rules we see this actor ethan roads so let's go look and see what Ethan roads has and let that can tell us we're just going to crease um who is data Ethan notes does know a lot of stuffs little over a hundred domains and it looks pretty random I've got some samsung kies driver download site I've got a traffic quota mr. rooter it doesn't look very targeted doesn't look very threatening you know I'm not getting anything here that makes my spidey senses tingle let's let's dig around a little more and see what else

we find so the who is wasn't that interesting but he's actually using a google analytics which uses virgin cookies trackers who make this little bigger so folks can see and he's got a google analytics I'd Dean a tracker here and let's let's see what happens if we dig into those and if we dig in the doors we actually get a whole new story so if we dig into his google analytics account and look at its urgent tracking cookies all we find our pnc banking domains and a whole bunch of them so this is definitely a targeted actor that's coming after pnc bank customer employees running very focused attacks let's see if we can learn a little bit

more though I really dig in here and I'm going to look so what I did here I guess I should tell you as I do this so you've got two different things here you've got the run with all of the dash integers is the actual urgen cookie and then that's the account ID so I want to drop in to the actual cookie here so this is a campaign the 51 is the campaign so now I've got two sites in here I've got a new site that I haven't seen before never showed up in my alerts let's go take take a look but before I do that I got kind of curious i started brute forcing all his campaigns and when I got

the campaign 20 so I went back from 51 this guy as of the end of last year runs 51 campaigns against pnc bank so in campaign 20 I find a new domain that led me to some new info so as we keep digging down this rabbit hole wow look what we found interestingly enough that domain here bizarrely enough is owned by Hildegard Gruner the guy we found in the capital one banking attack that looked like a random attacker what is he doing in the middle of this guy's Google Analytics campaign well if we jumped in a little deeper here if we go back one of these domains has to and this is very rare you won't see this very often two

accounts two sets of tracking cookies so basically both actors Ethan roads and Hildegard Gruner I'm gonna have to skip the rest of this because we're almost out of time but they occasionally run joint campaigns and the Ethan roads persona tends to run there a focused ones on banking and Hildegard runs the broader campaigns but we can start unraveling the nest of everything they're doing and how they're targeting us by tracking all these accounts on facebook so for the final 1 i'll give you this was another 1i found recently this is a fake amazon login page but he sets it up this actor creates domains that look and seem like amazon and they put them all over the web day they host

them in amazon infrastructure you know what can I learn about this really quickly digging in well I can't really learn much because this actor actually uses legitimate information for who is registration for other people in some cases what we think is these are actually compromised sites so they're legitimate sites that are compromised that they put their stuff on on our piggybacking so when you go to the who is this is all going to lead you to legitimate businesses you're going to go okay well this this doesn't look like a bad actor this is all legitimate Lyon stuff if we dig in and we look at the who is we're going to find all these dis

Alpine audio related businesses I'm going to blow through all this what was this one out phone number all right final 1 i'm going to show you because their time's up facebook ID was left embedded in the page now that's interesting what happens when we what happens when we query the web for that facebook ID we find all this and is on fishing an account stuff stuff that's already been flagged as malicious so it's an Amazon Fisher and a malware distributor and we don't have time to go deeper but we captured their facebook page and they were actually running a Facebook account advertising their services on facebook now Facebook's now nuke their ID but these folks are

running a profile and running ads on Facebook advertising the infrastructure for other people that to rent or purchase and use so that's how you can combine these new data sets to quickly go from confusion or misunderstanding or thinking something's benign to connect all the dots thank you very much thanks very much air him thanks very much Steve here we got a little gift for you guys from our friends at Fitbit for thank you you guys and thank you very much from b-side desktop really appreciate it