Connecting the Dots: Building a Data-Dump Search Engine

Name: Connecting the Dots: Building a Data-Dump Search Engine
Uploaded: 2017-06-21
Duration: 35 min 42 s
Description: Arron Finnon explores the scale of third-party data breaches and describes building a searchable index of leaked datasets using Elasticsearch. The talk covers technical challenges in indexing hundreds of millions of records, the security implications of making breach data queryable, and what organiz

BSides London35:42301 viewsPublished 2017-06Watch on YouTube ↗

Speakers

Arron Finnon

Tags

CategoryTechnical

StyleTalk

About this talk

Arron Finnon explores the scale of third-party data breaches and describes building a searchable index of leaked datasets using Elasticsearch. The talk covers technical challenges in indexing hundreds of millions of records, the security implications of making breach data queryable, and what organizations can learn from their exposure in public dumps.

Show original YouTube description

We've seen in 2016 the datapocalypse of 3rd party data breaches, with conservative estimates reaching around 1.5-3 billion peoples information being leaked or dumped on the Internet. Yet these numbers somehow mask the very real impact of these breaches. Many companies and organisations after been exposed have been exposed, without ever really noticing. In 2016, 100% of the FTSE 100 has their email domain in 3rd party data-breaches. This talk does look at what has happened, but more importantly it looks as the journey I took to build a data-dump search-engine. Like many things in life, it's easier said than done. Why should you be concerned? Because this is passive OSNIT, than can reveal so much about a company/organisation without an attacker ever touching Google or their site. What is your current exposure to dumps and leaks online?

Show transcript [en]

Should we just start? So, uh I'm not sure if anyone's compier in the truck, but we don't need them. Um my name's Finenix and I'm talking uh the talk's entitled connecting the dots. Um so, what is the talk about? Well, so it's a little bit bit a little bit about big data. Yeah, I know. Um, as we know, big data is defined as anything that crashes Excel. Um, it's a little bit about third party data breaches and it's a lot about finding a way. So, what this talk should have been called is search is hard. Um I'm not sure if you read the abstract in the schedule. Um but this talks about building a um a search engine for third

party data breaches. Um not a breach notification platform but an actual kind of like showdown of data dumps. Um, so when I started, I had this basic idea, um, that I would take the Ashley Madison and adult friendfinder dumps from a couple of uh, and see if we could uh, get like a search-like experience out of it. Um, now why don't you index these dumps? And with like everything, it was a lot easier said than done. And there's a few different ways that you could go about doing something like this. The traditional way that I've seen so far about how you would build a uh a search engine of data dumps would be to export

the data dumps into whatever the native DB was and then put something in front of it. So if it was a Postgress dump virtualizer Postgress or MySQL or so on and so forth, right? Uh and I took a different approach. What I did was modify um the Elk stack. And if you don't know what Elk is, it's Elastic Search, Log Stash, and Cabana. Um interesting story. Doing this project, I absolutely smashed Log Stash into little pieces. Um when I first indexed Ashley Madison with like trying to use Log Stash to push it into Elastic Search, it would fall over after about three after about three days. it would fall over. Wouldn't have inserted any records. Well, a third

of the records and it would completely die on its ass. Um, so I managed to replace it with all things. Uh, you know, it's bad when your project is when your tool is replaced by curl, right? Um, so I I ended up using curl and a parser called jq. Uh, and boom, we went from taking three days and going nowhere to taking 30 minutes to put in 128 million records. Right. So yeah, so modified we had this basic methodology where we would take third party data dumps, analyze them, edit them, uh turn them into JSON documents, uh insert them and realize that we screwed up uh and that there was some problems in the key

value fields and then we would start again. So this kind of repeat rinse repeat rinse sort of mentality. Um but what we were left with was something that was kind of actually pretty cool. we were left with like um a REST API key value pair search over multiple databases simultaneously um multiple uh data breaches. So we were able to look at at the time four different databases from leaks. So Ashley Madison adult friend finder uh that LOL porn list and a few other things and we're able to search that really really fast. Now we you can say that we're reinventing the wheel a little bit because you could have done that with GP. Um but we were

getting results back in two or three seconds. Good luck doing that with Grep, right? Um and it also enabled us to do uh conditional type of searches. So if this condition is met then search upon this da da da da. Um kind of really the difference between search and filter is what's coming on here. Uh, as I said earlier on, it kind of speaks JSON. Um, and what I mean by this is everything that we stick into the the search engine is um JSON because of it's basically a document store with full text search when it boils down to it. So everything we stick in is JSON and everything we get back is JSON too. um which if you've

ever worked with JSON a little, you probably hate it, right? Um pro tip, if you're not nesting JSON, you're not doing JSON, right? Um but don't actually undervalue or underestimate uh the benefit of having such a defined standard to work with data sets when they come back. It enables you uh to be able to script very very easily, very very quickly um because you know what you're going to expect. It's not uh it's not unknown. So, it's actually hugely cool, especially when you think that all of these data dumps that we got came out in different formats. So, um adult friendfinder was CSV and some Excel. Um I think adult friend um Ashley Madison was MySQL and a few other dumps come in

these strange formats. So, like I say, it's kind of cool to get JSON here, right? Although whenever I deal with hackers the first time working with the tool that we we built, um I spend just as much amount of time explaining to them that their JSON is crap and that they have to work better on it. Uh they go through this life cycle of about 3 weeks and then they get how JSON works. It's quite strange. So in the beginning, we only had a couple of dumps. Uh and it became quite clear that we needed to have more data. Um and this is what we did. It all sounds quite interesting. so far. So when we started looking for

data, I'm like that looks like an interesting data dump. Um let's index it. So much stress regardless of what you use because any problem that you get in the web world, you're going to get in JSON too, right? So um when you've taken data out of database and converted it into JSON, right? Comment fields and stuff like this. It's that smileys in your on your pro uh dating profile title may look cool, but as far as like as far as JSON's concerned, you've got a screwy key value uh pair and uh whatever you're using to index it will remind you about this constantly. Um so yeah, this happened a lot. I'm like, "Wow, that one's really cool and this one's really

cool." And then I would spend weeks trying to get through how we would do it. So after we had the first three initial uh databases, we I I spoke at Berlin sites about what we got up to and what we found. It was quite interesting and um people started pointing me to locations of other public dumps. Um and we added mate one. Hands up if you've heard of mate one. Come on, don't be shy. No one in this room has heard of mates one. Mate one. One person, right? Okay. So mate one was a dumping it was a dump uh a Dayton website. The difference between like Ashley Madison and adult friendfinder is they were quite those dumps were Ashley

Madison was looking for uh well they were looking for um people to cheat with but I think most people were just speaking to bots to be honest with you if you've ever looked at the data dump um mate one was a little bit different. It wasn't it was like a traditional dating website. What was interesting about it though is it gave us millions and millions of plain text passwords. Um, lol mate one, why did you store your passwords in plain text? Well, I'll discuss later on that it's not the worst sin in the world, right? I mean, it is, but they're not alone. So it gave us millions and millions of uh plain text passwords which was quite

interesting because when we had Ashley Madison beforehand uh Ashley Madison used uh brypt to encrypt its um password hashes in some cases they totally screwed that up which was quite interesting. Uh maybe someone can ask me about in the questions later on what they did. LOL. Um, and then we added LinkedIn afterwards, which is strange because I probably should have started with LinkedIn, right? Uh, because this one was quite quite old at the time. But what happened, there was kind of this mental change to me when this happened because beforehand, we had this like dating website dump search engine. So, you were never guaranteed to, you know, find your clients in there, right? But, and it was always LOL when you did. Uh

but when we added LinkedIn, what ultimately happened is we were able to tell all of a sudden it started going from LOL to finding people all the time and then being able to tell if they were Ashley Madison or adult friendfinder users or mate one users on top of that if they'd used the same email addresses. And trust me, unbelievable amount of people did that. Um so yeah, this is kind of interest kind of interesting for a few things. Uh if you remember what was the problem with LinkedIn's data dump in 2012, yay, they didn't salt their hashes. Um and that led to a kind of more interesting turn later on. Um as I said also as I said also earlier

on it was also non-dating related uh data sets right so beforehand it was quite specific in niche uh niche and it kind of grew uh further on and you'd be no idea how many people were in multiple data sets if not all of them and if you think that's like I mean I'm sure you all heard the headlines when Ashley Madison happened of all the uh the people inside the Ashley Madison data dump, but they're literally a huge amount of u of corpse corporations that lost their corporate identities inside dating websites. Um huge. And when you're able to connect the dots later on, it's quite interesting. Funny story, um Boeing got a huge amount of flak over

the Ashley Madison stuff. There was a lot of blog posts about it. They had a 153 accounts in there. Um yeah, great move, right? Let's sign up for uh let's sign up to have an affair with a Python bot. Um but let's use our Boeing uh email address to do it. If you think that's bad, they they had one user that signed up 13 separate times with his Boeing account. Um needless to say, he's probably got one of the most locked down social media profiles you've ever seen in your life, right? Um I think they learned the lesson quite a lot. But this wasn't alone, right? This has carried on. Also with the Ashley Madison data set, if you

think that's unbelievable, um you would be surprised how many BBC people are in the Ashley Madison data dump. Scott Mills, uh denies that he was in there. Um that that someone had put his email address in there. The problem with Ashley Madison is is that Ashley Madison had this tag, this key field that was is valid one or is valid zero. and is valid one was I clicked the link to validate my email address. You can take a guess what Scott Mills is valid uh status was. Um but yeah, so it's surprising that you would have so many people um do this, right? Because if you think, right, I'm I'm going to use my work email address so the people at home

don't know. Like, yeah, okay, but the people at work know. Um so yeah, as I said, you would be hugely surprised. And it's quite interesting problem because uh actually Madison was a bit different as a data dump because it was so widespread in its public in its publicity. Um seuite was interested and and looked into it but there was loads of dumps where users from corporations have used their email addresses in places you really shouldn't be doing it. Um so yeah non-dating very very interesting. Uh, no. It also made me uh LinkedIn made me really interested because we'd started to accumulate a lot of plain text passwords through data dumps, but we also um had started to accumulate

unsalted hashes. Um, and kind of an interesting story because unsalted hashes, uh, you know, when I picked on mate one earlier on for the plain text passwords, well, really an unsalted hash isn't much better. um really not much better and you'd be surprised at how many dumps are unsalted charmons or MD5s. Um it's against the norm to find something that is dynamically salted in a in a data dump. It's probably comparable to statically salted hashes and unsalted hashes. They're probably head and head. Um so it is really unusual. So in the end what we did um we built an independent index of charans and MD5 hashes. So um if you've ever heard of the website Zusk it's

another dating website right uh they used MD5s um and it went across the board was quite a few. So we built this independent index um you could like in a record it would be the char one the MD5 and the plain text. So what you could use is a rest RE rest API push the char one to the rest API and the rest API come back to you in about two seconds and say oh we know that password and that is this right so it now has 1.3 billion different hashes that you can search for um so this um it's kind of cool right like you can crack these hashes on any rig that you've got but I can guarantee you if

we've seen it before we'll get it pretty quick like seconds it's kind kind of like re like polishing off old stuff here though, right? Because um really try to think of rainbow tables but elastic tables. Um and it's hugely effective. What was also interesting is MySpace a that MySpace is still a thing. Uh and B that MySpace gave us I can't remember what the total leak of data was but it was quite I think it was in the 200 millions maybe even 300 millions. Um and that was all Shaw one as well right? Um, so all of a sudden that elastic table lookup thing became really useful because you had LinkedIn and MySpace and a few other

things and it became incredibly effective. Um, and as I said the indexes created with every word list that we could possibly find. Um, and then any plain text that we came across in d in public data dumps we also generated ND5s and SH1 ones. So now um that independent index is um I would say half generated from actual real life passwords that have come out of data dumps and the other half is from our traditional data set uh traditional word lists that we kind of get from GitHub or whatever. There was this big uh there's a post about someone releasing a billion words for word lists uh recently something like that. We took uh we took that word

list uh and ran it against our system. I think there was 256 million different uh words that they'd released last about two months ago and we hadn't had 4 million of them, right? So, you know, we we get we're doing quite well. Um, and as I said before, it's continually being updated. And I think this is massively interesting problem because um, what it means is is that you could have quite a uh quite a good password, right? But if someone else has that password and they put it into a website, um, you don't know anything about it, right? Once that plain text leaks, we've got that password, right? Um, so you can be screwed by other people's password reuse

and other people's sucky password storage too, right? Um, or as we like to say, another flawless victory, right? So, I learned quite a lot working on this project to be honest with you. Uh, firstly, I learned that Vim users are basic vegans of the internet. Um, and anyone who knows a Vim user will know exactly what I'm talking about. Um, I also learned that the internet has a lot of creative users. So, there are some users that will use their point of sale devices password as their own password. And this, put this into context. I had someone speak to me about this recently. It wasn't the person at the cashier's desk that was using the POSOS

devices passwords. The last time we discovered this, it's like number two in organization. It's all uh it's all fun and games until you have to password reset your uh credit card processing facilities. And this is just uh Adobe, right? And you can see we just done a um a search on the password hint key field. But I suppose it could be a lot worse, right? I mean, who would use their VPN password, right? Um, I stopped doing it at this point because we got VPN password, then we got bank password, then we got corporate network password, then we got work email password, and you're like, yo, stop using this stuff. And what's great about um Adobe is they really screwed up their

hashes. Um, they're statically they're static, right? So you can search a hash and find other users in the Adobe database uh that have that hash and then you can have their password hint fields too for the same password. It's like a cross word puzzle but instead of getting one clue you get 20 or 30, right? Um, I ran into a few problems when I first started this because you can sum up the amount of knowledge that I knew about uh uh uh search engines on the head of a pin head, right? I absolutely knew nothing. And luckily enough, elastic search is dangerous enough to let an idiot like me do something like this, right? Uh but it's not smart enough to

keep me out of danger later on. So, when I first started, I had uh it went quite well. 128 million records, fast, nice, sweet, very, very cool. Got past a billion documents. And yo, that's when got really freaky, right? Like, elastic search is not my friend. And it turned out I made some like fundamental flaws in how I was doing things, not quite understanding the technology as well as I should have. Um and needless to say when we I redid the indexes and you know made sure we used analyzing quite sparely and we we worked on sharding a little bit better so on and so forth we managed to iron the problems out of using something like elastic

search. Elastic search is very very very cool. Uh hands up if you've heard of elastic search before or used it right if you are using elastic search and you've not got a reverse proxy in front of it right you're an idiot. There is absolutely no security in this thing, period. Um, and they fixed all their security issues by not implementing any whatsoever at all. Um, if you find an uh an elastic search deployment, you're going to find it on like 9200. It's highly unlikely that it has any access control. But something like curl- xdelete whatever the URL is underscoreall is the equivalent of drop tables. um except there is no au it's just a co request off the internet and

uh after who remembers like when all those uh MongoDBs got busted last year someone wrote a worm and it pulled all the data down encrypted da da da da da elastic search got owned right on the back of it and I'd been saying like the the six months beforehand yo put like just get engine X in front of it'll be cool right like it'll be fine uh And then that happened. All of a sudden there's a lot of engine X out there. But yeah, if you're if you're uh you'd be surprised as well how many people are using elastic search as a non-traditional database as well. Um preliminary searches have shown that you're going to find corporations like

uh you know like their internal directories for the companies and stuff like that. You're going to find a whole host of data out there in elastic search land where like some NodeJS developer has written a front end to store data in elastic search. So you will find sensitive information out there uh stored in elastic search. There is a security plugin but they make you pay for it. Um which of course I kind of have a little bit of an issue that you're going to charge someone 1,200 bucks to make sure they don't get pawned. um write secure right? So once we got to this, this is a big problem. We fixed it. It was cool. So going back to

Ashley Madison, um things were like simple when it was Ashley Madison, Ashley Madison is actually quite a complex dump to process because if anyone's ever like viewed the Ashley Madison dump, it's spread over five databases. uh five yeah basically five tables uh databases not tables uh which is quite complex when you're trying to convert from this relational this like relational database kind of schema into this kind of document store kind of way of doing things right so you get some liberty but you also uh a lot of freedom comes a lot of responsibility and and twice as many ways to shoot yourself in the foot um so yeah when we did this it was quite interesting we got

quite a few uh but we ran into problems because what happens with this stuff is you inherit the complexity of whatever the database is. So if the database is um five different databases that are hooked together, right? You've got that complexity in a standard uh data store, right? It it it's quite a pain in the ass. Um, also you all like if anyone's got like DBAs under like in your remit, you should totally go and work with these people, right? Because there's a lot of stuff happening in database land that's just whack, right? Um, also what's interesting, it took me a while to work out, but absolutely everybody stores their data differently. And it's not what I don't mean is you're

going to see pog postgress and MySQL and so on and so forth. But what you're going to find is that the field values the field names are almost as unique as the individual that built the database. So um one man's handle is another man's nickname is another man's email, right? Um and of course when you want to do searches on this stuff, you want your searches to be uniformed. I want to search for email. I don't want to search for email and username and nickname and da da da da da. So there's a period of normalization that you need to do as well. And this is one of the the kind of really interesting lessons that we

learned early on. Um so when we first built the the search engine um we do this kind of traditional way that you would think about searching, right? So email wildcard example.tld, right? So you of course that's the way you'd think about searching it. And the problem is wildcard searching is incredibly expensive on a server and even more so on elastic search. So we ran into huge problems once we got into a billion records. Right? This is tanning our ass. And eventually I realized that what we really needed is a domain key. Um so we went through all the data sets, took email addresses and then we basically just used Python and and just went you

know uh split amperine one and we took everything after the amperign. We built a a key into the new data sets and now you can just do like domain.com. Um and because there's no wildcard searching in there, what you get is incredibly fast results. Really fast. Because the truth of the matter is is it's about understanding the type of searches that you want to do, right? Uh and if you know that earlier on and this is quite a kind of like a it's not uncommon to think about I'd like to search my corporate domain, but I don't want it to take me 240 seconds to do it. Um so analyzing and understanding how you're going to search that database

later on when you index it is actually quite important. pretty soon you end up with data about data um and a huge amount of data about data when people tell you that oh we just collect metadata it's nothing to worry about yo when it's all fun in games until you end up in someone's index right um so yeah we we generated quite a lot of data um just about the data that we've collected I talk about that in the conclusion section so the greatest face ever for security that's ever been by the way, right? He actually looks like this in real life. Um, so much more fun. So, you can't really talk about this project in some ways without addressing

the elephant in the room of um, yo, you just built a search engine of data dumps. Don't you think that that might be a little bit dangerous? Um, yeah, I do in a lot of ways, right? Um the problem is is it's really a case of stable doors and barns, right? So yes, we built a search engine that currently has about 56 different databases in it that we can search simultaneously from third party data breaches and we do it pretty quickly. You can speak to it in a rest API, you speak to it in Python, blah blah blah blah blah, right? A lot of people freak out about the the privacy side of it and I can totally understand.

But the truth of the matter is you could delete this tool today and nothing would change. The the thing is still out there and I think what justifies this recently if you've heard of the mother of all dumps. Did you have you heard of this over the past couple of days was maybe a month ago um Troy Hunt had released a lot of emails about the mother of all dumps to people. And what this is is this is a dump that was found in the wild that is an accumulation of lots of other dumps. So it's got cracked Ashley Madison passwords in there and adult friendfinder and Zusk and LinkedIn and blah blah blah blah blah. It's about 500

million records long. Someone took the time to accumulate half a billion records and crack the hashes doing it. Um and then what you do is you grap uh for the hits that you're looking for. And we found this in the wild. So this is kind of exactly we're a couple of steps ahead on this one but this is exactly the same idea right this 500 million email addresses is their index and GP is their curl on this right so it is happening it is out there um and you also I would say as well when it comes to um third party data breaches a lot of organizations don't have a clear idea of what they've

actually lost right they'll know that they were had some records in Ashley Madison, but they'll very rarely know that they were caught up in a fishing campaign or that credentials were leaked on pay spin or blah blah blah blah blah, right? Um, so it's kind of important the data is already out there, right? Um, and it's kind of important that we get some visibility about what's actually been leaked because 2016 2000 well 2016 was a particularly vicious year in data dumps, right? Uh, I think Troy had said that there was three billion accounts affected by data breaches in 2016. I wouldn't disagree with that. My only issue there would be that they're the data dumps that Troy knows. I was able

to verify it, too. Right? These are only the data dumps that we're aware of. We have no idea what's in private hands, which is quite interesting because it turned out that what 2012 was a vintage year for pwning databases. We just didn't know about it until 2015, right? Um and as I say earlier on, ignoring the problem is really it it doesn't matter, right? The data is out there in the billions uh and affecting people quite quite a lot. The other thing to also worry about here is that searching leaked databases for credentials is the most passive type of ointment that you can get. Right? You download the database and GP it locally. They ain't

nobody anywhere gonna get an IOC to pick you up doing that anytime ever. Um so yeah it's mostly passive mostly. Uh there is times where you uh sometimes have to reach out and this is quite worrying right because it is out there it is being traded. Uh anyone that's looked at data breaches um will tell you that there is a a a a virile market going on at the moment in data trading that's quite quite uh quite indepth. I used to think that we failed at teaching web developers input, right? You know like oh we've got OWASP and we still get pawned by SQLI, right? I'm not going to blame the web developer. We've got organizations that have been

teaching people and we're not winning. And now I realize after working with their databases for a while, we really suck at teaching uh DBAs basic security too. If you go and look in any data dump, right? If you if you don't believe me, just go and check any SQL and check the emails table. And what you'll find is there's absolutely no validation of those email addresses in there whatsoever at all. Nearly all cases that I've come across, you'll find AAA aaa.xyz. You'll find all of this stuff going on in there. And they're not sanitizing. What's also interesting when you analyze data dumps as well, you can also see the the enumeration before the SQL happens. So you you watch accounts

that are getting registered with username dot dot slasht password, right? All of this stuff going on. and then all of a sudden end of file. Um you're like, "Ah, I knew what happened there." But yeah, this is a big problem. And it's interesting because I did some I did some looking at the freedom hosting dump. Have you guys heard of freedom hosting? Freedom hosting was like a dark web service provider. Um quite shady if truth be told. Um and it's interesting because all the problems that you have in the clear world all live inside freedom hosting too. So you have this huge mess of databases, huge poor deployed uh uh applications, blah blah blah blah blah.

Um and it's it's in what's even worse about it is it's impossible to tell the difference just from the SQLs that one of them is applications being deployed in in uh enterprise and the other one is being deployed in the dark web. It's indistinguishable to tell which DBA did which right the quality is exactly the same which is none. Um so do have a do have a if you have a look go and verify this is quite interesting. Uh and then we wonder why people get pawned. Now I've gone through my slides pretty quickly but luckily enough I've got some time for just a couple of quick stats. So when I first started this I had about

10 gigs of documents. The interesting thing when you analy uh when you convert um data dumps to JSON is that you increase the raw size of the file. So if you've got a CSV file that's maybe three fields um you know three columns when you put it into JSON what happens is you increase it by a factor of three because the beforehand what happens is is the the field names the key pairs you've increased the bite space for the key pairs right and replicated it for each and every single record. So, like I say, if you've got a three column file that's 100 meg, the JSON's very likely to be 350ish, you know, elastic search will

add in some metadata to if you use that. Um, so when we started, by the time we had Ashley Madison and Adult Friendfinder and a few other things, we had 10 gig. So, it was about a year year and two months ago that we started on this. Today is a bit of a massive change, right? Uh, we now sit at 728 gigs of documents. Um, and that's just a year of finding publicly available uh documents. If you missed that, that's like a factor of 72 times bigger uh in 14 months of just actually looking. Um, when I started this, I also had uh around 127 million unique documents. And a document is uh is just a record and

inside that record is associated data. So it could be inside that record could be the email and the password and the username and so on and so forth. So this isn't unique pieces of information. Um this is unique documents with unique pieces of information inside it. So uh when we started we had about 127 million and now we're about 20 times more documents in index. We're uh 2.7 billion now. Um, now to be fair, this number is also made up of the elastic tables thing that I was talking about earlier on as well. So, um, but I'll give you the useful number. This is the amount of unique emails that we've discovered that have been dumped in third party data breaches

in 20ou from 2016 to today. So just under a billion unique email addresses that we find and we valid we we don't validate in the sense that we send emails but we ensure that the the email is reax compliant right. So uh yeah if it's if it's speurious we'll just drop it. It's not it's not worth the hassle. I only had 45 minutes so I've left 10 minutes for questions. Um but that's everything from me today. So if you've got any questions now would be the time for it. Okay. Uh, thank you. [Applause]

Connecting the Dots: Building a Data-Dump Search Engine

Related talks