← All talks

BSidesNCL 2020 - Life's a breach: Modern data breach reporting with Sencode breaches - @SencodeCS

BSides Newcastle12:5110 viewsPublished 2020-11Watch on YouTube ↗
About this talk
@SencodeCS: Lifes a breach. Modern data breach reporting with Sencode Breaches We have recently developed a new, free data breach database tool. This tool is titled "Sencode Breaches". The talk is mainly to go through this web application and my reasons for creating it. Captured using OBS: Open Broadcaster Software®️obsproject.com Edited using OpenShot Video Editor | Free, Open, and Award-Winning ...www.openshot.org
Show transcript [en]

are we definitely live yeah we're live the sound we can see the screen right from the top but introducing gareth kerr who's going to talk about don't go for each of his life let's go thank you very much so hello guys uh my name is gareth kerr i work as a web app pen tester for some called cyber security i'm also oscp certified and i'm one of the unlucky people who did a cyber security degree another quite controversial a little bit about my history so i spent too long working in tech support developed passive aggressive sarcasm syndrome as a result of that and i moved into security around about four years ago i enjoy climbing and bouldering as well

if i'm not if i'm not on my computer i'm usually a bouldering wall hacking all the things of course uh web development and blue light tanning beds i find that four monitors works best for that so why am i talking to you today so i discuss a research project on the exploit.in data breach collection uh so the why and how the results and future work and also same code breaches so the application that i've made uh why i made hollywood pond version two and the unwanted version and future development for that as well so research so how was the research done so originally this this this uh idea came from office labs so they they've

previously done um passed out uh passwords data breaches into the elk stack and analyzed the data for this so it's essentially i've done exactly the same thing and the reason i did that is essentially i wanted to see how effective password masks were created using this method so i used python to pass out 110 text files which contained 789 million passwords using the elk stash which is comprised elk stack sorry which is comprised of elasticsearch log stash and cabana i only actually use elasticsearch in cabana because i did the passing and everything else with python so that was to perform the data analysis and the purpose of the research uh was to see if if it was effective

as i said to um to produce password masks from the data and whether they're any good so the elk stack if you're not familiar with what that is it's elastic search which allows us to index uh items essentially in this case we're indexing large data breach files and kibana so that works as visualization platform it also has machine learning baked into it now and that uses the apache uh lucian syntax query syntax to create complex aggregations and logstash wasn't actually used in this research as i said i used python for that so how was it done and what did the data look like well each entry in the text file look something like this so it would be

some email at some domain and some password so every line was formatted like this so what we did is we used some python code to pass it out so the code passes that line chunks it into chunks of data that are useful for complex querying so the table shows the data configuration for it so we've got the user which would be the email address the password very secure password123 domain the length of the password this is the important bit here which is the password password mask and that's this is the hashcat mask formatting so the uppercase seven lockers three digits and that was for every line in that uh breach collection so every single line would have a password mask

and then we would end up with 789 million password masks so whether it contains digits contains lowercase uppercase specials they're all boolean values so all this data uh it was passed out and converted into json uploaded to elasticsearch and then queried with cabana that makes sense so as you've probably seen a million iterations of these this is the top 10 passwords based on this piece of research and as you can see very very familiar whenever you look at a past top 10 password list you will find many of these passwords contained within them and most years have most of these in there because it doesn't tend to change that much year on yeah year and year

we're still we've still got the number one password is literally just digits one to six 109 qwerty passwords you know the standard terrible passwords that people set we've also got top 10 passwords of varied lengths so this is where things get a little bit more interesting if you're a person like me and what we what we can see here is quite standard keyboard patterns laid out in most of the lengths have forms of keyboard patterns in there and then you have anomalies like this one here 3rjs 1l and so on that's an anomaly i had to research why that would be part of this and i was quite confused as to why that would be there and why there

would be so many people with that password set in these files and what you find when you look at these things is people love keyboard patterns and if across the whole data set you find that most of them are just standard patterns across your keyboard and whether that's obviously just a way for people to remember passwords i don't know it's a really bad way to set them um and as for the anomalies so the password that was laid out like this again if you look at your keyboard there's no pattern there it's quite a random password there's no pattern involved there and what i've found and grim clearly if i haven't destroyed his surname there um

he believes they're created by bots and i would i would i would agree with him there simply because if you look at how the emails are laid out as well as the passwords they seem like bom the bot spammy type email accounts there they've all been set with the same password for whatever reason i'll let you guess on that one so the top five domains uh these these are the top five demands based on the 789 million so yahoo's at the top hotmail gmail yandex. and then the other domains make up 42 so password masks uh so this was the this was the bread and butter of the research essentially and what this was for is i wanted to see

if these masks were any good obviously morphos labs had said they had they are and essentially i want to have my own data set which i could create the masks from upload more passwords and see if these masks are any good you can imagine they were good so in order to evaluate the effectiveness we compared the masks to other password masks which are used by professionals so the kraken was doing exactly the same machine configuration uh gtx 1080 ti again if i had more hardware available that would have been fantastic but unfortunately and never and we had tested the effectiveness using the linkedin breach so six and a half million unsalted sha-1 passwords and essentially

we just run it against that that's a slow breakdown of the command breakdown so we took 60 masks from each data set so from rock u and exploit dot in again smaller set larger data set as well but rocky the reason i chose rocky is because people tend to go to that um naturally anyway especially people who haven't really done much of this in the past tend to navigate towards the rocky data set and then again the rocky password uh list is baked into most uh penetration testing um operating systems now so we tested the effectiveness and it was around about 10 better now this would change um if you incorporated rule sets into it as well so obviously you can

change out the purpose of rules as you change maybe like a letter to a capital letter change numbers you do leap speak things like that so if you apply that alongside the password masks you would likely get a much much higher percentage of the password masks being cracked so the conclusion to this and future work so the elk stack is great for analyzing massive password dumps it's really good it works really well and because of the way elasticsearch works it makes the querying um much more simpler combines a great visualization platform again it comes with the machine learning aspects as well if you wanted to go into that side and and use that as well so passwords masks created using

this method are powerful the masks have not been tried as i said with the rule sets because it's very expensive like computational it takes a long time just to run the masks themselves so if you add the rule sets to them it's again it's it's many more many many trillions more attempts so you're always limited by the hardware available unfortunately now moving on to same code breaches which was the title of this talk um so this platform is something i've made a little bit of a pet project of mine it was launched around three months ago i've just released version two uh last night actually um i've just released version two and it helps feed my obsession with

passwords and data breaches uh it currently contains all of the have i been pawned um breach data so that's currently all the data that's in there at the minute but i do i am planning on filling out with many many more data breaches once i've found reliable data sources again having been pawned is a really good data source so that's why i've loaded that into it already so you can i'll show you i'm going to show you that in a minute so why i made it so it was a personal mission to have my own data breach database the media as well they do a terrible job when it comes to reporting data breaches they often focus on sensationalism

and that's that's that's not good and instead instead of the facts and data breach information should be easily accessible and most importantly easily digestible so that's that's one one of my goals for this project was to display the data in a nice way it's easy readable for people and you just get the point of it when you when you when you view any breach and then there's the timelines so this is what happened when it happened and how the company responded as a result that's another important part uh part of why i made this i wanted to have timelines for every breach now that's no easy feat it doesn't take um it takes quite a lot of time to to to

take hundreds and hundreds of data breaches and have an accurate timeline for it so currently we haven't started that aspect of it although it's on the website uh we haven't actually started filling in those timelines yet we just need the time to go through it but that will be coming in the next couple of months so what it does not do so unlike troy hunts having been pawned um it only stores the details of the breach so unlike the way that that system functions it's completely different it doesn't store the hash passwords or anything like that doesn't store any personally identifiable information and it doesn't increase government competence in handling covered 19 that's a really critical

point of it as well so a couple of pieces of software that we're producing again none of these are paid for um it's we've got our data breach database which is live i'm going to show you that in a second our argus reporting tool which is a pen testing reporting tool and our e-learning platform which is also being developed there's two three pieces of software that will be interesting um in the future so i'm just going to bring up the platform so this is saying called breaches again you can search uh you can type your email address in there uh i'm not going to type my name but you can type your email email address in

there it'll tell you whether you've been part of data breaches that uses the having been pawned api for that so i'm not because i'm not storing that information or storing the email address information so if we go over to all breaches some statistics over time so back in 2007 one breach and then it goes all the way up over the years and as you can see many many many hundred smaller data breaches this will change dramatically over the next month or so as i find more data sources and start loading them into the system i have an important data set is really reliable and solid data set which i've used to load it in essentially to test the

platform and everything else so if we have our search function at the top keywords you know standard stuff for a search form you can you can navigate through them just through the pagination if you want to but if we take a look at one of the breaches just choose one at random click one here so it'll give you the description of it the nice logo as well what domain was breach so you can go to that whether it's verified or fabricated the events are actually timelines so as we fill those out that number will go up when it occurred when it was added the breach counts how many accounts were breached and the data types which is

really important um just to be able to easily view those so that's a bad example because it's only got two pieces on there if you go on this for instance there's a lot there again there's issues that i'm still ironing out with it and everything else but it'll show you a list of all the data types involved so it's email addresses ip addresses names passwords usernames etc etc etc so yeah that's pretty much it um in terms of everything i wanted to cover there's also a statistics page if you wanted to take a look at that um but yeah that's it for the platform so i'm happy to hear your feedback i am looking for feedback

early adopters people who think that they can um give me some advice or any anything that any uh feedback they want to give me that's absolutely fine you can pop me an email or go on linkedin add me and send me a message so that's it from me any questions just please let me know