Unmasking Data Leaks: A Guide to Finding, Fixing, and Prevention

Name: Unmasking Data Leaks: A Guide to Finding, Fixing, and Prevention
Uploaded: 2019-09-22
Duration: 45 min 8 s
Description: Title: Unmasking Data Leaks A Guide to Finding, Fixing, and Prevention Presenter: Jordan Wright (@jw_sec) Track: In The Weeds 02 Time: 1000 BSides San Antonio 2019 June 08 at St. Mary's University, San Antonio, Texas Abstract: As organizations shift to cloud-storage platforms such as Elasticsearch

BSides SATX · 201945:08111 viewsPublished 2019-09Watch on YouTube ↗

Speakers

Jordan Wright

Tags

CategoryTechnical

About this talk

Title: Unmasking Data Leaks A Guide to Finding, Fixing, and Prevention Presenter: Jordan Wright (@jw_sec) Track: In The Weeds 02 Time: 1000 BSides San Antonio 2019 June 08 at St. Mary's University, San Antonio, Texas Abstract: As organizations shift to cloud-storage platforms such as Elasticsearch and MongoDB to store data, we've also seen a rise in massive data leaks as a result of these databases being mistakenly misconfigured and exposed publicly. If you've ever wondered how researchers discover and report these databases, this talk is for you! In this talk, we'll live-code a system that searches data sources like Shodan to discover public databases, triage them, and report them to their owners. Finally we'll talk about how to set up and configure these databases to prevent data leaks and protect your organization. Speaker Bio: Jordan Wright is a lead research engineer on the Duo Labs team at Duo Security. He has experience on both the offensive and defensive side of infosec, and enjoys contributing to open-source software as well as performing security research.

Show transcript [en]

there we go good morning everyone okay you all are even livelier than the first punch there we go so next up we have Jordan right and doing his presentation beforehand just want to give a shout out some more of us our sponsors the NSA hexa beam Accenture open security titanium level cybersex jobs denim group almo Issa and landmarks solutions so with that we'll kick it over to Jordan and one thing I forgot to add is if you're going to ask questions at the end please raise your hand and one of us will pass you a mic so it can get captured on the recording thank you guys

okay all right well we'll go ahead and get started welcome everybody so it's great to see everyone here I think this is my third b-sides to come to here in San Antonio it seems like it gets so much bigger every single year and whenever I start a talk especially here at b-sides I always want to call out the volunteers because this conference is really made and run by them and they do such a great job we give them a hand roll so today we're gonna be talking about unmasking data leaks and there's another kind of sub theme that we're gonna go into that I'll talk about here in just a moment my name is Jorden my job is I'm a tech lead on

dual labs which is the research arm of duo security and I have to say now part of Cisco and I also spend a lot of time developing open source software so some of y'all may have seen the stuff like go fish I'm the maintainer for that I really enjoy spending my time developing software and getting it out to the community to use now I talked about data leaks but there's actually different goals that I want everyone to get out of from today's talk today we're gonna more broadly talk about research security research whenever I got started in security research it was really intimidating it always felt like that's something for hackers who are better than me who are more advanced than me

they're the ones who are kind of on the bleeding edge solving some of these problems but the more I got plugged into it the more I realize that's not the case security research is very accessible it has a common pattern and by following that pattern we can make really big advances and come up with really great solutions and that's what we're gonna show today using data leaks as an example of that but first I want to give you a quick flashback so in last con of 2016 I gave a talk that was almost what you would consider the results of the process that we're going to talk about today it gave an overview of data leaks in general

looking across different types of databases how many are exposed what kinds of data are exposed and what are we going to do about it but today we're going to look at more of the how not just the what and that's really beneficial to us because we can see that process from end to end and that's what we talked about so here's our agenda for today we're gonna really briefly touch on high-level goals of security research and then we're gonna take a deep dive into data late release specifically talking about the problem how we can measure it and then how we can try to solve it and I'm a really brave person today because we're gonna do some live

coding which is like the scariest thing that you can do as a speaker but my goal is to really show you how accessible these types of research projects can be by doing it here on stage with everyone staring at me and praying to god it works so that's what we're gonna do towards the end of the talk and then we'll wrap it up talking about prevention at the very end so this is the model for my team disrupt D risk and democratize these are the core principles that we hold as a research team that applies to research more broadly our goal is to try to disrupt the current state of the art whenever it comes to security measuring problem

solving problems and identifying new problems we try to on our applied research side and D risk a lot of things for our own business try to identify trends identifying new directions the industry is going and try to explore through those saying hey there's dragons over here or this looks like it could be a good solution and finally one of the things that I hold most important is the idea of democratizing security our goal whenever we do research is not to stand out and say look at how smart we are look at what we figured out it is the opposite it's not that the end of a conversation it's the start of one we want to say here's how we did this

here's where we got and we're giving all the tools all the research all the findings because we want you to take it to the next step it's the goal of sharing it out with the community and solving problems together and this is the three kind of milestones whenever it comes to security research you first want to identify a problem trying to figure out what you want to study then you can explore it even further is there something here that we can that we can explore measures solve this could be vulnerabilities that we find and then at the very end that one of the most important is recommending solutions we're not there just to point out a problem and say look that thing is

broken it's very much to say that's broken here's how we can fix it together so let's take a look at data leaks using this methodology let's identify a problem first so it wasn't too long ago we started seeing headlines like this pop up I'm sure some of y'all may have seen these I'm not gonna read through each one there's three and you'll notice the numbers are not thousands they're in millions which shows that it's pretty significant whenever these things happen there's large numbers of sensitive data being exposed those weren't the the only instances of that far from it and here's the last one the important thing to notice look at that date that was yesterday so this this hasn't been

solved yet those those aren't all the headlines that I could find and those are the ones that made headlines so data leaks in general it's not a soft problem and it's very much a big problem so we can say we found a problem so now let's take it to the next step this is let's try to hone in a little bit and try to classify what it is that we're looking at whenever we talk about open databases and data leaks what is it that we're talking about some people may have heard MongoDB some people may have heard elasticsearch these are different new and upcoming data store kind of technologies I say data store because I have to be careful

there's likely like some purists in the crowd who would say elasticsearch is not a database it's a document store it's my talk I can call the database so we can say elasticsearch we can say cockroach I'm sorry a not cockroach TV though that's probably one coming out soon couch TV rsync s3 buckets is not a traditional database it's much more of a blob or a file store I'm calling it a database Redis memcache all of your key value stores there are so many opportunities for this data to be left exposed online it all stems from kind of the fundamental problem right it's it's miss configurations where you don't expect the data to be found or you don't

know that it's even out there and then people are able to stumble upon it as we'll see in just a moment so now let's explore the problem a little bit let's talk about how we can actually find these you know we see these headlines how our researcher is going out and doing this but first we have to remember whenever we explore this we have to be careful because our goal is remediation my goal with this talk is not to go and make a new headline my goal is to try to measure the problem and then try to reach out to the people affected by that problem to try to solve it the headlines that I like to make are a researcher

measure the problem and offered solutions it's not here's just another instance of data leaks being founded okay so we're gonna keep that in mind as we do the research and I'll talk about where that comes into play so here's kind of the one I want a fighting date elite so we have a scanner that we're gonna build over on the Left we have all these services on the Internet all the IP addresses that are available out there on the internet and right now we don't really know what's out there we could start through a simple port scan for this talk today let's just use elasticsearch as our example okay we know that elasticsearch listens on port

and 9200 we can look that up so that's the port that we're gonna scan for we're just gonna scan the entire internet so we're gonna start with the first service it's gonna say there's nothing there that's fine we'll move on to the next one and all of a sudden we have a hit there's a valid elasticsearch instance listening on this host the next step is trying to ask ourselves okay what what types of data are there like is this something that is meant to be left exposed or is this a data leak that we need to contact the owner and let them know this should be remediated so what's that we just can't either this is what

we came out with looks like there's around seven in our example elasticsearch instances and we can move on to that next step which is to ask what kind of data is here I use the kind of data and very specifically because my goal is not to just say let's go grab all the sensitive information that we can it's very much the opposite what's the least amount that I can do to say there's probably a problem and we can call it out and let the owner know right that's that's more of the that's an area of ethics that I like to I like to live in us try to be as privacy centric as possible while solving the problem okay

and we're gonna do that for each in every instance let's say that first instance we found nothing but data that was obviously meant to be public let's say the second instance we have a hit we see that there are sensitive information maybe PII maybe since it is system information being stored and we need to to remediate that okay so we have a valid hit that's kind of the 101 there's a different model that we can take these days because there's service providers out there that do the port scanning for us their job is to just scan the entire internet looking for common ports and services storing that in a database and then giving us an API

that we can use to say give me all the hosts you found listening on port 9200 there's a few different scanning providers to choose from some of y'all may have heard of shonan's some of y'all may have heard of census there's rapid Evans open data initiative which is where they publish that for free there's a binary edge which is another scanning provider these all have pros and cons some of them are strictly commercial some of them have free tiers the open data initiative is purely open data so which one fits your use case may differ in the data while you would expect it to be somewhat consistent will differ but we'll just take for example just one

from this talk we'll use census as an example and we'll see that the techniques that we're doing are very agnostic to the scanning provider it's all about just getting the data from them and using it okay so here's a game plan here's what we're gonna do today we're gonna start by getting a list of hosts with open databases we talked about elasticsearch that's gonna be the example that we use today now for every host we find we want to determine if there's potentially sensitive information exposed and if that's the case just today we're gonna keep a record of that host well dump them as a CSV for example and manually we would go offline and try to find the

owner and contact them for remediation I wish there were more of an automatic process it's very different you can even see in some of the stories for this like a researcher found a database but they weren't quite sure who it belonged to that's a little bit trickier of a problem but if we were on the researching in doing this as a full project we would try to find the owners reach out and see if we can get that remediated okay so that's our game plan let's let's do some coding and I really hope this works so in my editor here I have in open vs code instance and I have kind of our game plan at the top a little bit more

specifically we want to start by getting open elasticsearch instances from census okay nothing initially let me see if I can expand this out a little bit there we go is everyone able to see this okay by the way you look good great for every host that we get for every host that we get we want to let me give you a quick background elastic search elastic search keeps documents and the index documents these are held in an object called an index and this index has a set of what they call YES on which one would you prefer nine motor or light mode let's switch it and then we'll get a show of hands like and then we'll see kind of

which people prefer because I've seen both work see the appearance how do I change this thing yeah someone happened here where we looking settings you think

we may have to do that

let's see if we can't find a student may just have to move forward with this okay if we just move forward this is that all right okay and the good news is all of the code that were writing today I'm going to release on github at 4:00 the facts so in the slides as well so if you see links on the slide to VC sings with you know that are hyperlinked the goal is not to have y'all clicked at from the audience the goal is to give the slides later and you can go look at those those after the fact so elasticsearch stores indexes each of these index holds a mapping a mapping is a set of fields

that are stored let's say for a sensitive data store username password first name last name kind of things and it sorts the data title so those are strings with this max length and and that's what we're interested today we want to see what's the IP address which index is stored in for example users and then which properties are in that because that's a pretty good indication of whether or not there's likely sensitive information and the very end we could optionally get the number of records in each one of those indexes you'll notice we're not pulling the data itself okay we're just starting with just a metadata about it because it gives us a good idea if there's a

problem there or not while still letting them keep that data even though it is all publicly accessible we're choosing to draw that line where we can try to protect that privacy a little bit more and we're going to store everything in a CSV and let's let's get to it so we're just going to read through this game plan let's start by getting the elasticsearch of instances from census okay let's make a new function we'll call it get instances and what I've done I've cheated it just a little bit okay and I'm gonna be honest about this with you all because I feel like this is a good place where we can be honest together I'm not so brave so as to assume the

Wi-Fi is gonna work for me up here okay so I've downloaded this data from census and I've stored it offline and I've made a little help request that wraps the normal census API giving me the offline data instead okay so the good news is here's what it looks like in production if you do from census ipv4 import census where's it at since his IP there it is it's inside before this is what you would look like if you're doing this online with the real data right here's the difference there it is that's the difference so I'll try to keep it as close as possible so that things still feel the same whenever we're writing the code okay so we have

our census class available let's go ahead make our API I'm gonna make this and it says that we need it's not going to show that but we need two things we need our API ID and our API secret these are given to us by Census whenever we sign up and create an account and for our example it says if you want to use the offline version just tell us you want to use offline so we'll do that in wine and API secret is offline I'll be able to type better when remember my fingers don't get as cold from nerves so we have our API instance now let's query for house ok whatever comes to census we as a certain query language

each of these scanning providers has their own language I think I have senses here with their query syntax and the thing that we're interested in looking at is something like this right here it says you can specify protocols tagged you can give me a port and the expected protocol I'll just let you know we're looking for 9200 in elasticsearch is a valid query so let's say API that search and it asked for our query so that's what we're gonna give it we're gonna say protocols 9200 elasticsearch okay just like that hey that looks good and that's gonna give us a set of results so for now what we're gonna do we're gonna say for every

result and let's just call for instance for every instance in our results let's just print it out to make sure that we're doing a sanity check that I'm not gonna get all the way through this talk and it's gonna break from the first step let's say we're gonna run instances like that and let's try to run it let's go to our terminal here I know y'all can't see the problem that's okay scanner py so this data that we're giving back by since this is really for this particular endpoint it's the metadata about an IP we don't have any information about the elasticsearch instances out there just that there is this IP address it is in Ireland and it

has an elastic search instance available so that already gives us half the way there we saved ourselves from having to port scan the entire internet that saves us some time okay so now let's take it to the next level so we've done the first step now for every result we want to use an elastic search client to figure out what indexes are out there what the properties are they storing and how many records are out there that saves us whenever we're doing our analysis from saying oh you know here's this users index but there's there's no increase in it so there's we're not gonna waste our time trying to follow up on that okay so let's go ahead and make

a new function and for now what I'm gonna do is I'm just going to use a special Python keyword yield and I know this kind of a hidden gem that means we're not gonna collect all of them and return a big list we're gonna return them one by one just saves a little bit of memory so let's say we're going to get our indexes okay and I'm going to say let's give us an IP address because that's what we care about and this function is going to connect to that elasticsearch instance and give us the data that we care about okay because I got elasticsearch in production you would do this sale info from elasticsearch import elasticsearch for

offline I'm just doing this and that otherwise API is the the arguments all that stays the exact same okay so let's make a little quietly let's say let's just call it clamp offline elasticsearch and it says I need a set of hosts who do you want to connect to that's because typically if I'm doing this as a sysadmin I may have a cluster of elasticsearch hosts that I want to manage all at once right in this case we just care about the one so we're going to say connected that IP address now we can start getting data from this host to do that we're going to say let's start by getting the indexes client but

indices and ask for a pattern in our case a widest indices but yet by getting a wildcard pattern that says give me a list of all the indexes currently stored in elasticsearch ok now every single one of these indexes go back to a game plan for every single index that we have we want to get the fields that are available ok so let's see what one of these looks like it was a stew then let me show you what we have here I have under my offline data open up one of our elasticsearch instances let's try this one here this is fine so in this case this is the data that week that I got

back from elasticsearch offline ok this is the name of the index okay and then inside of here we have a setting called mappings ok and then under mappings we have a list of the properties each of these properties is those fields you have this field here's the data type that's what we want to collect today and we want to log all of that for every single host that we've come across let me go back to my scanner you'll notice have a cheat sheet just in case things go really wrong here so again I'm not as brave as I may come off there you go so let's say for every index name and the index itself in our indexes because

it's a dictionary we already have the index name which is great let's explore that index for sanity checking let's do a try-catch block trust me I know where it breaks we'll just continue that's probably how we want to get our mapping so I'll say mapping is the index I get mapping we're trying to blank dictionary if it doesn't exist because that will break and once we have our mapping we need every single property inside of those properties so we would say for I'm trying not to use the cheat sheet but I really want to go to the cheat sheet do that real quickly I'm gonna cheat only because it's the first thing we're gonna give us the number of Records for index

I was checking to see if y'all would catch me on that I will say I will do client but in this count and it asks for the index that we care about we know index we care about this the index name and now we have a number of Records for that index that's gonna be a command that sent off and that's been hit elasticsearch and give us back the number of records that are stored in this particular data set now we know that there's not just one mapping but multiple mappings which we're going to get mappings and then for every mapping name and the mapping itself in our mappings we want to get the fields okay I know there's multiple

levels here that's not my fault we can all blame elasticsearch together but we're gonna kind of dive in and now we're gonna get the actual properties themselves because here's where we're at we have the indexes says I have an index named users okay I know that there's the number of records in there I know let's say there's 20,000 users in there I don't know what fields are in there right if it's just if it's just a random UUID I wouldn't really care about that but if I see field names like user name password I probably care a little bit more about that okay that's what we're getting now getting the properties for the mapping get properties that blank

and then for every field in our properties we've done it we can actually log this so I will just say you know print the field for now okay disregard the the winter and all those red underlines it's not my ears they're just opinions from the lender so we have so now we have the data that we want for suma bleah assuming it works but we want to log it somehow okay so what we'll do is for now I'll just catch all this in a little tiny dictionary I'll return it back and we'll print it out and we'll see what comes out okay so I'll just call it a record because this is a valid record we have our IP address we have

our index name we have our mapping name and we have our fieldname which let's do this one I want to make this little bit easier I'm not gonna return to every single field on a different line I'm just gonna take all the field smush them together as a list and that'll be a little bit easier to search through so instead I'm gonna say the fields is a list of the properties and we want to convert that to a string if we're doing that a little bit of Python I want to grab those keys okay so that actually saves me from having to do this I grab my record so now we have a dictionary with our fields just like

that and we have a record that we could it represents an index the data types that are in that index the IP address with an elastic search index is stored and then we have this data that we could blog the ideas we want a big list of every IP address every index stored there in all of the fields that they have so we can just do some graphing write search through look for things like password look for things like token and see what we can come out from there okay there any questions so far I want to make sure kind of we're all on the same page okay well I'm assuming we're all on the same page so we're good so let's

just return that for now let's just let's just yield it that's fine so we'll just say yield that record and I'll save for now we got to actually call it I'll say for every instance that we've given we want to go get that metadata so I want to say for every for them every instance of the metadata in our kit indexes for that instance and you look down here this is the data that we're working with I it is the key of IP we want to print that out and print the meditator okay hey so assuming big assumption let's work let's see what happens just kidding nice pigeon it's grumpy at me because of why my code is

perfect what could possibly be wrong let's see what happens if we just print it worst case I have a cheat sheet where's current cheat sheet oh I don't think that's it's claiming that it doesn't like the indentation on 36 right so let's see what's doesn't like that one because it doesn't think that my function is closed but that's not true let's go our cheat sheet let's see what we got here you'll notice it looks very very similar okay I think I see the problem we're not actually catching our accept anymore that's what it's upset about okay there we go so if things go wrong ignore it you only live once so again it's upset about something else

indexes 0:35 let's see what we got here this is what happens when I fly too close to the Sun yeah we can do it at dignity sheet which is we are building a list of the records and we're returning those back to our function and then writing those down as CSV so we can do that for now I was gonna return one one by one but let's just do that let's just step through the xishi here and if it looks good then we'll just copy the thing that's just that there's obviously something in here that is correct and beautiful so we're just gonna grab this we're gonna drop this right I'm missing a space where'd you

see that I think it's all winter yeah just grumpy about the space it's saying that there's it's upset about and smooth there so let's just do this let's go ahead cheat for now because I want to make sure that we have enough time to get through my confidence will take a hit and so there we go I dropped it in there this is actually indices so I will grab that indices I'll tell you what I'm gonna keep this and then after the talk we can look at through it together if you find it find the error I will buy you a drink or something and then I'll ask you to do the talk next time okay let's try to run this

instead and see if this works because now we're getting a list of them we're saying are all of our records is here and let's just print the records there we go okay so let's just grumpy about my string okay so that whatever bug that is it's following us why is it following us

see here for the items try I mean this okay so it's probably about the except again didn't it that would do it all right let's grab let's just put that up there then it needs it there and happy service and continue okay any notes front running again this is awesome that was easy I know that I know this one there we go [Music]

that's okay you know why no another problem cuz we got to return it never do it they're gonna actually kill this thing because it's fun so when in doubt what we're gonna do is I'm just gonna like the real purpose of the talk was friendship and you know it helps to show you the doing the stuff and it's not perfect like there was also we release like we're gonna give you the polished draft version showing that like everything works but in reality things go wrong a lot like in front of people so we so we have this data back we don't really exactly know what it is and the good news is I also have that offline too so

we're just going to look through that but we have the results that we were looking for after a little bit of trial and error we got where we want it to go and I'm confident that we gonna cut that out of the video all the things that went wrong but now let's look at the data that we have okay because we have every single host that has elasticsearch accessible we know what indexes they have we know what what mappings they have and then what the properties are in each of those indexes that's a really powerful data set and you'll notice that we looked at elasticsearch we had a big list of databases that have very similar

fundamental problems misconfigurations being left exposed these same techniques apply to each one of those databases and it applies to each one of those same scanning providers you could say hey I want to look across census shodhan rapid7 you name it I want to take all of that information together and look at it much more holistically using the same exact techniques that we did here today and if you see the headlines that are out there saying this Richard researcher discovered this big finding this researcher discovered this big now you kind of know how they did it it's it's if you follow the pieces and you piece those together you can come up with pretty significant results okay so

let's look at our dataset so I made a like I said I made a backup version to plan for the coding going so well let's just take a look and see kind of what it looks like elasticsearch and says back up that's what kind of a boring example of Smith so let's give a little bit more there we go so here's what each one of these lines looks like I know it may be a hair tougher to see this because I'm trying to keep it relatively concise we have the IP address we have the index name we have the mapping name and then each of the properties in that mapping okay so we could since each record is just on one

line this gives us the ability just to grep through for whatever it is that we're looking for okay let's just look at some some examples let's just try password I guess unless see what comes out oh and uh the very end by the way we have the number of Records so that can help us sort it if we need to so we're up here and let's see what we have let's just take this one here here's an IP address has a user's index with a name password an ID whether or not they're suspended it has about 50-100 of those records it's not a huge data set but it's sensitive data that might be exposed we may want to follow up with

its owner we have here you go ten records here they have passwords we have I'm just scrolling through this one is some kind of password field that's 3100 records 10,000 records here for something that looks like it has light and darkness something that has tokens passwords all this may be sensitive information that we care about and these are this is data that's still already out there today let's look for let's do something let's just grab a bunch of these records again because we talked about finding sensitive data but there's other patterns that may come out whenever we have a data set like this because basically what we have is a data set of like what types of data are being

exposed on the Internet that's pretty powerful we don't always have to just look for sensitive data okay let's dump this out and let's see if we see anything that pops out hopefully we do scrolling through looks all kind of standard and we might see well let me just crap for one then cuz I know what it is how to that works instances so this something this is something that whenever I was just scanning through the dataset kept popping up over and over and over for every single one of these elasticsearch instances I came across it has a key our new index called readme with a mapping that says how to get my data back and it has a Bitcoin address

okay pretty pretty clear what's going on here and and the this is part of the problem like not only are we leaving data exposed we're leaving services exposed attackers take advantage of this so I was studying Redis in a very similar fashion and I noticed that there was a key that kept popping up that was called crack it said that's really weird like I don't know where this is coming from so I stood up a honeypot instance and I recorded everything that was happening to it it turns out what was happening was attackers would hit my Redis instance and it would use a known technique to pretty much save an sss ssh private key to the box and then they had

access to login pretty much as routing at that instance okay so after we published that we noticed that the same story started happening to to elasticsearch and started working its way down these databases database technologies where they would wipe the data and then leave a Bitcoin address saying if you want your data back you have to pay me this ransom the trick with the Redis instance was kind of interesting this isn't part of the talk this is kind of this is free he didn't pay for this this they weren't taking the data they were just deleting it the thing is as an administrator how do I know that's that's the that's the reality of an

incident like this was that that was the the laziest ransomware there is like I'm not even more worried about like encrypting it's to ring it offline and giving it back I'm just gonna delete it if you give me Bitcoin great if you don't that's fine you know I don't have to keep your data right and I wouldn't be surprised if something like this was was similar I haven't investigated this particular one but we see patterns that come out and as researchers we can measure this report on this and then help people understand the problem and fix the problem okay it's all about having that visibility being able to put those pieces together and come out with those actionable

solutions that people can act on ok so that's the code I want to post all of the code up on github as well as the slide since I mentioned so that you can take this and explore with it modify that as you need and hopefully take that research a little bit further but for now here the next steps let's say we come across a data set where we say it looks like there's usernames and passwords here that may be an instance where we say let's look at exactly one record and just get an idea for what what does this data look like that's optional obviously that blends based on where you want to be you know some

people may be uncomfortable with that even though it's public data for us personally I probably wouldn't do that it's it's it's you know that there's a password field there and you can let them make the determination if it's sensitive or not we can add new data sources we could set up daily alerts we could cache all this information and have our scanner run once a day and just say here are the new data databases that I found out there obviously we can alert the owner and we could create regular reports to the state of the problem you see some of this binary edge has done some of this showed ins done some of this where they say here's the state of

open databases on the internet in 2016 2017 2018 seeing these trends over time and recommending solutions so we talked about identifying a problem and exploring it but this is a part that I take most seriously which is how do we fix it because there should always be how do we fix it if there's not you're just pointing out something broken that doesn't really help people right so for preventing data leaks specifically it's all about best practices there's really no magic sauce here it's it's about the problem stemmed from insecure defaults at the beginning fortunately a lot of these are being fixed so for example many of these databases they would listen on every interface and they

wouldn't have any authentication by default and so if you don't go out of your way to change that your date is exposed okay now this is getting changed people are fixing these problems as they come up which is great to see it's all about making that progress many databases also now support authentication and in some cases they also support role based authentication so that's really useful if you want to not only lock it down but lock it down per user per data set and in some cases you can even go so far as disabling unused features entirely this is one of the recommendations given by the creator of Redis reese's look if they're certain Redis

commands you don't need you can disable them you know lease lease privileges lowest attack surface all that kind of good stuff yeah and I also on the slides I included a link to the security guides that I found for each of these different databases so whenever I do publish the slides if you're using one of these I recommend taking a glance through it again the goal is to not have y'all click on it right now it's to give up the slides where you now can find those later and it's interesting I mention elasticsearch but one thing we're seeing recently is cabaña which is a front-end to elasticsearch it's the dashboard component of it we see the case that

people will set up authentication on elasticsearch authentication to Cabana but no authentication for Cabana so you can reach there and not only can you see the data set you actually have a pretty nice interface so that's something else like you have to walk down the whole thing not just the data store itself ok and I've also included links to more tools that do a lot of what we talked about today so for each of these different types of data there's tools being written whoop published on github that aim to try to do a lot of it hit shodhan get a list of the results parse them in some way shape or form and present that back to you Lee

clicker is one I haven't used it personally but it claims to support multiple types of databases all in one tool which is kind of nice so something you may consider and to wrap up again the point today was to show that everyone in this room can do really impactful security research I hope it showed that this problem may seem like you read these headlines it may seem like magic at first like how are they finding this stuff but if you put the pieces together it's really straightforward just with a few syntax errors you know but the code we wrote in terms of logic in terms of the things that we built I hope it was accessible I

hope it showed that the process from end to end doesn't have to be so incredibly complex and we have really impactful datasets really impactful findings that we can go and report on in the future and try to get some of this stuff fixed and the problem is self opened databases do still exist today as shown from from that headline that was yesterday measuring the problem that did give us insight to the issue but the fundamental goal is always remediation it's always fixing the problems that we identify and that's what we can do together as a community here at besides by networking by going to talks hearing what people are looking into the goal is not to like

I said the goal with any of this research isn't to in the conversation it's always to start it because it these kind of things are tackled as a group as a community and we can do that together right I'm open for questions and they have the mic if you wouldn't mind wait for wait for the mic to come by

going once going twice right thank you everybody

Unmasking Data Leaks: A Guide to Finding, Fixing, and Prevention

Related talks