← All talks

BG - Big Data's Fourth V: Or Why We'll Never Find the Loch Ness Monster - Davi Ottenheimer

BSides Las Vegas51:4119 viewsPublished 2017-03Watch on YouTube ↗
About this talk
BG - Big Data's Fourth V: Or Why We'll Never Find the Loch Ness Monster - Davi Ottenheimer Breaking Ground BSidesLV 2012 - The Artisan Hotel - July 25, 2012
Show transcript [en]

all right everybody I'm going to get started got a lot of material to cover because it's big data all right so the presentation's called Big data's 4th V uh why we'll never find The Lochness Monster but before I start let me just ask how many people believe there is a Lochness Monster small crowd anybody believe it does not exist firmly proof all right a little uncertainty when I started to write this presentation I thought it'd be really cool because it's a good example of big data but as I was writing the presentation I realized it's really hard actually to get any sort of straight answer on the thing can't more

volume okay oh that's loud feedback all right so yeah when I wrote the presentation I thought Lochness Monster would be really cool because there's a lot of data on it but actually as I looked into it more and more I realized there's no conclusive answers and so actually this presentation is weird in the sense that I'm trying to focus on the vulnerabilities that are brought forward in Big Data situations defining big data but also focus on the vulnerabilities and the issues around it and at the end of looking at some of these problems you realize there may not be any hope so that may be the problem here but let me just get started um I'll introduce

myself first because I know at bsides I've been asked to introduce myself before I before I talk I like to do it the other way because I don't want to color your impression of me but I'd like just to talk and then have you figure it out for yourself who I am but basically this is my 18th year I've done a bunch of bsides presentations uh two notable ones I did Las Vegas where I did a cloud Odyssey and I did stucks love in 2010 uh those were pretty fun to write and they're on the web if you want to look them up I'm a co-author of a new book called securing the virtual environment

so if you want to check out anything about virtualization that's sort of my specialty or Cloud I have a few copies here too if you want to check them out the book actually has a DVD in it with some attack scripts and it shows you how to build an attack environment how to attack virtualization how to secure it so it just came out last month and it's from Wy it'll be at Defcon if you want to buy copies um flying penguin is the name of my company started three years ago fulltime I've been doing it for quite a while but the idea is Penguins actually do fly underwater and so what I try to do in a lot of my presentations

is take a contrarian view show people things they might not have thought about before because a lot of people say they're Birds but they don't fly and in fact when they once they get underwater they put their wings out and they fly just like birds in the air so if you think about it from a scientific angle you actually realize that they're one of the fastest birds on earth when they fly they don't swim they actually fly all right so let me get right into it uh I'm going to do an introduction a little back on big data and why I'm interested in it I'll do a part one where we talk about what it is how many people here

already have a definition of Big Data familiar know what it is work with it okay so I'll spend a little bit of time on that because people are unfamiliar then I'll talk about what is really vulnerable and why vulnerability is so important in big data and then I'll try to do some prescriptive stuff talk about Solutions or the future and like I said it may not be a pretty picture if you're looking at the loch nest monster but there are other areas that do have good Solutions also I think I want to mention that you know one of the reasons I'm talking about this is because my virtualization presentation got rejected which was my book I'd really like to

talk about that but apparently this is more interesting to people it got rejected to black hat too I'm actually happier to be speaking here I like bsides I'm a sponsor this year of bsides so I actually contributed to the the conference but one of the things I have to mention is when they rejected it they said back to me like virtualization man like it's old it's like you know Zen's been zero day so many times you got to like really come up with something new and interesting which I think is complete BS I have I research virtualization exploits all the time there are no like list of zero days on Zen so I don't know what they smoke in

black hat but there's definitely like something else going on over there and I think bsides really keeps it real so this presentation is really about you know hard-learned experiences and some of the stuff that I've seen in the field stuff going on today I don't try to like sugarcoat it I'm not trying to sell you on anything so let's get right into it so introduction why am I interested in big data and maybe you are too there're guys like this how Varian at Google who said you know data is the future like maybe you didn't think about it as the past that important but in the future the ability to be able to understand process and make meaningful uh results

out of data to visualize it that's going to be a hugely important skill and more and more you see ramp up of people who know Hadoop or people who know uh data science that's their specialty there like tons of jobs and tons of companies that are looking at it and as a security professional whenever you see huge amounts of interest in a field you got to figure there's lots of opportunity there for Flaws exploits resources are exposed because whenever people jump into something they usually don't know that much about security so big opportunities here for security and that's sort of the beginning of it but I'm I'm also a big fan of Edward tufty how many people have heard of Edward

yeah he's a pretty popular guy so he's got some cool stuff and I had the chance to meet him recently actually because he has a museum he just opened in New York that was across the street from a conference and I just walked into it and I was like oh my God I didn't know so he had just opened the museum and he's standing there and I started talking to him and he's done some amazing stuff like this is his chart of a Carl Sean chart he took basically a scatter plot and you'd look at it and be like okay it's a bunch of dots and he put in these animals and you can see just by looking

at it what's what so you can see that this is brain mass on the left here and then body mass on the bottom so as body mass and brain mass go up you sort of see this like evolution chain there's dinosaurs there's birds there's bats stuff so it has meaning and you can deduce right from it like all sorts of stuff I don't get some of it though like he's got Babar I don't really know how Babar fits in I don't know if that's toughy humor or what's going on but I thought it was relevant because we're talking dup which is all about uh All About Big Data elephant and so does anybody know what this is I don't know

if you can see it in the back it's pretty small I'll just anybody know what a cican is anybody heard of a cican Okay cool so I'll talk about that later but these are some examples actually within here of you know there's Big Data out there there's a lot of analysis of that stuff going on so you know what have we learned from Big Data in the past and people who are geniuses at charting it out and making meaningful charts of it like this this is one of T's most famous examples of the Napoleonic campaign like he did this Russian campaign he started off with a massive Army he got to Moscow and that turned turned him back and by

the time he got back here this little tin tiny black line is the size of his army so the devastation of his army and by temperature you can see on the bottom so visualization is pretty awesome but they've been doing it since I forget this is the 1800s they so you know when Google's saying like the future is visualization of Big Data it's like well they had lots of big data so what's different about it so I walk up to tufty I'm in the museum and I say I'm going to write a book about this I'm really excited about big data security you know what do you think Mr tufty and he says uh Big Data doesn't

exist and I was like what do you mean like he's like well there's just data is data there's tons of it everywhere we work with it all the time there's no big data it's all crap you know marketing blah blah blah he's super hard noosed about what's what and he also always says like look at the white space that's his big thing you know don't look at what's on the space look at what's around it the more white space you have the more meaning all this crazy stuff he's a pretty cool guy so with that in mind I went back to the drawing board I was like uhoh what's Big Data what is really different about Big Data

whoops so has everyone heard of the 3vs everybody familiar with 3vs all right so Gartner said a a long time ago I think it was 2001 Gartner had a white paper or some publication that said there are 3vs the volume the velocity and the variety and that's pretty early so 2001 they just republished it like 2012 I have a link here in the bottom but I tried to do a a high cou I just couldn't get it to work but basically it's many data points that grow very quickly and they are fairly unique if you want to put into a ha coup and the bottom line is it's different now because the magnitude and the speed when you think about the

guy who's drawing that Napoleonic campaign that was was like in a library or he collected it but it was static information that he could gather over time and what's really different now is the size of data is much larger and the amount that's coming in is much faster so I'll talk a lot about that and it's coming in in Fairly unique ways so we're having a harder time of making sense of it or making meaning out of it so that's why it becomes so important for people to be good at it right so I also looked at why we're so fascinated at it or by it and you know last year I spoke about HAL in 2001 and 2011 I put a spin on it

but everyone's familiar with Hal and how he's a genius but he also also went off the rails right here's a computer who's supposed to know everything but he ends up killing the crew because he knows so much he considers them a threat so obviously we have examples in our big data science fiction novels of Big Data gone wrong then we have 42 everybody familiar with Hitcher Guide to the Galaxy right put everything in the world into this computer you ask it for the meaning of life and it comes up with an answer so we have a lot of examples of Big Data gone wrong anybody know who this is all right yeah cool Red Dwarf Holly

one of my favorite shows of all time and then of course we have data himself right so we have a lot of examples of people who are really fascinated by Big Data in science fiction we have a lot of examples that we have to differentiate from maybe one of the most famous is RoboCop because it's all about data right everything about him has like data points and Analysis it's a classic azimov style you know like you can learn so much and I don't know if you saw recently but there's a parody on this where this this little screen pops up right the heads up display everyone seen the Google Glasses but this is the old

heads up display it pops up and it says you know 68% rating and out of style genes offbrand backpack split ends right so it's the future of Big Data analysis so we have a lot of good science fiction but in reality you know seriously speaking a lot of disciplines are based on Big Data so we have astronomy Carl Sean's favor saying we have billions and billions of stars you all that sort of stuff astronomy is looking for alien life or meaning in life in plasma and so forth dust in the space meteorology they're always looking at storm patterns I blog a lot about stuff like this you know how they reducing risk to ships by figuring out

storms faster and figuring out wave patterns it's like the science of waves is this huge mystery and so much data on waves and we still don't understand how they work or why they're so big and when rogue waves are going to hit you and kill you so geology has earthquakes tornadoes hurricanes all that sort of stuff Anatomy has disease you know things we try to understand or health longevity economics has fraud and then Espionage I wrote about recently you know people have been telling me a lot about how the Chinese are spying on us but I'm a big fan of studying probably because my background I studied history and political science for a long time

but United States was famous for its spying and especially through uh the 80s on France and it got caught a few times and I posted something recently on my blog about how in the early 90s the ex director of the CIA just came out and said yes we spied on you and you deserved it because you're a bunch of French bastards and you you know you you you bribe everybody and you don't play fair so if we didn't spy on you then it just wouldn't be right you know we have to spy on you so there's a lot of use of Big Data Echelon was this huge like fields and Fields and Fields of supercomputers digging through

everyone's data and trying to look for keywords to to win in economics basically prove that the economic system of America was better because it was free market didn't need government interference but of course Echelon and spying is government interference so you can see where that argument goes all right so what does it look like today like to the average person what this future is really coming to mean is you know you can just sort of look down a Suburban Street and maybe put on your glasses like the Robocop and be like oh yeah that guy's a felon which you can do right you can look for child Predators or convicted felons in your neighborhood you can see someone who's 24 years old

probably by looking at some dating site or something you can see someone's lost weight or gained weight because they might be on a dating site there's a lot of there's a lot of examples now of people recovering equipment from dating sites by setting up fake profiles and trying to find out who might be a profile that fit the person they saw when they stole something uh literally yeah in fact there's a good case of where a laptop got stolen and they had some tracking information on it and they were able to see that that person signed onto a dating site and put the profile of what they look like in the computer so they saw this person was signing on

as you know a white male 165 lb with this and then they told the police this is the person that stole my computer or at least has it right now you can see someone might have bribed a politician if you look at the voter registry this has been actually a documented case you can look at information like that and find out a lot about people you can tell that they're on a trip looking at Twitter of course you can tell that they're stealing MP3s you can tell that they have guns you can see that they're having a heart attack even sometimes there's information you can get or you can see that they're downloading or learning about rat right so remote

Administration I'll show you an example of that but here's a real world example that's kind of meaningful I think a lot of people thought this was cool in 2009 you know about the same time that guy was talking about Big Data at Google they're also showing that Google's mapping queries on the flu Trends and showing that you can see the swine flu epidemic before the CDC even talks about it so you're looking at these like basically Google sitting there looking at what people are looking at the flu Trends and telling CDC hey there might be a flu going on you guys should get on that so there's some really good examples of how that's being used this

is a little tongue and cheek but I don't know if you saw this on Reddit this was going around for a while but this blew my mind like literally so humans have trillions of cells so if really added up each human has 144 trillion gigabytes of data in them right so that's a lot of data so if you think about health and big data and what the meaning of the the new Revolution in data collection means there's a lot of data that can be collected and and deciphered and then the example on Reddit was that they did an analysis of sperm because they figured out that by using a simple I didn't do the the math I just took their

math and copied it so it could be wrong but this is basically what they said at 31 pedabytes a seconds that's the average speed of an ejaculation which is they said 5 seconds so Google only processes 20 pedabytes a day and we're talking 31 pedabytes a second so if you really want to study data there's a lot if you get into a lot of fields of study there's a lot of big data so again it's volume velocity and variety and if you really dig into it that's what we're going to focus on a little bit here today but I'm going to try to take it down to a practical level not the theoretical or the scientific because

for me that's really the difference it's not so much about systems that are out there like Echelon or supercomputers that are doing research that are funded by governments it's about the average person what you're going to do with this data can you do big data analysis at your company and do you have risk that you need to address right so There's 7 billion people in the worlds right this is again is Google they've already they've sucked up all of the World Bank information that's coming in and they've now presented it on the the web so you can go and you can as a person mine World Bank data and you can chart it very easily so 7 billion people in the

world it looks like the United States has about a 90% penetration if you look at mobile subscriptions so that's pretty good right we're ahead of the world which is averaging 80 so I'm like woohoo America number one but then I started looking at the data and realizing South Africa has 100% penetration it's only 50 million people but 100% of people in South Africa are using mobiles right that's pretty awesome so in other words if you're a security person you want to figure out how to get at somebody Mobile's a really good way to do it in South Africa because everyone's got one and everyone's collecting probably data about where they're moving what they're doing you know if they're doing Netflix

reviews you can just Mobile's a great way to get into the space but then I started really getting crazy and looking at data and I figured that you know the United States is way down on the bottom and anybody know where Vanuatu is anybody heard of Vanuatu yeah an island but where South Pacific yeah so Vanuatu is about 200,000 people in the South Pacific look at that line it is like literally like 2009 bam everybody in the island has a phone and so you can see some crazy stuff coming uh if you wanted to figure out how to take over Vanuatu this is what I used to study right coups uh military engagements I don't know if

you know about the SE shells but the reason the say Shell's coup failed was because these guys showed up with golf bags and they walked through the items to declare line by accident it was all all planned out everything was going to plan everything was great they walked through the items to declare they took the heads off of the you know the golf bags have those little fuzzy things that sit on top of the golf they pulled that off and there's a 50 caliber machine gun in the bag and then the big firefight broke out and that was the end of the coup they all got shot in the airport and so so anyway these islands if you

wanted to take them over today you probably wouldn't go through with your machine guns and the items in De Clare you would probably attack mobile you'd probably send people stuff through their mobile devices you'd figure out the area code and you'd be able to get it every single person potentially anyway you can see Israel that's slowly creeping up so there might be a lot of old handsets compared to Libya surprisingly boom they're right at I mean that's not the impression you get in the news that everybody in Libya has a a new phone that they bought in the past two years and then Saudi Arabia is up there at 27 million at almost 200 right 200

subscriptions so for every hundred people there's two phones which is kind of interesting why would they have a second phone so example of big data but mobile itself isn't as interesting as what they're doing on it so let's look at Twitter so tweets per day you can see this ramp up of the the tweets going unfortunately with the Twitter data a lot of that is people who either talk a lot or have automated the the bot and it's not really interesting data so it's not like you know stuff that's for sure going to be useful to you it might be some program that's running that's just tweeting a lot but if you look at it

from this example you can see that there are growth areas in Africa this is one of my favorite charts because it shows the the tweets by area the size the amount of tweets so Egypt for example has 1.2 million tweets potentially related to political action but you can also see you know you've got crazy amounts down here in Rwanda and so forth some Somalia is there so if you really want to figure out where stuff's going on or how to access people you can look at the data that's available to you from mobile uh one of the things I talked about at RSA I mentioned rat before but I started looking at where people were

training on tools so if you want to look at black hole rat which is an attack on Mac operating system Cameron and Uganda have a high incident of attacks on blath hole rat they're training on that tool also Australia and the United States so you can see who's really accessing these videos what they're doing so this data about you is easily accessible and tells a lot about what you're doing another interesting example is we talk a lot in the United States about our volumes of data but if you look at China Telecom and China mobile they have some crazy amounts of data they're putting into their databases so 1 billion texts in a day in February of 2011 anybody know

what 3rd of February 2011 was yeah New Year so on New Year's Day what they did was actually everybody in New in China was saying send to all of my friends happy New Year and the system is like boom million tweets you know just and it's up 133% so you can see it's the rabbit was 2011 but you can see there's a lot of stuff going on that you can collect and those people are collecting them telecom companies are collecting everything you say right there's even more uh requests for information than ever right so the FBI is looking at your data that's going through those mobile systems so in summary just to get past the what is Big

Data we see there's a lot of volume and we have more sources creating more output than ever before the mobile phone is really the future and everything you're doing on the mobile phone you know I just somebody this morning was showing me how their their mood is being recorded by their mobile phone and I was like you want to put that in your phone oh yeah I'm like telling it I'm happy and it's like telling me I'm taking a picture of myself and then it's you know recording my mood with my photo to have like my I was like that's like a psychological experiment nightmare like why do you want to do that to yourself

so people are putting all kinds of crazy information on these small devices it's expected right we like to do stuff we're curious people and we're doing it incredibly fast speeds right talking like you look at China for example you see the types of speeds that just seem inconceivable so and then we have lots of variety so you've get audio and video blobs binary large objects you get lots of identification of people and the relationships so once you have friends of course and you connect to people and the people you talk to the most often you have opinions ratings and reviews I'll talk about one of the examples of how that actually caught people out uh

because the time that they would review something and you know whether you could identify them as an actual name is important so yeah like if you say to somebody hey did you review this really obscure movie like me and Tito and they're be like yeah yeah that was a great movie yeah you just go and look at who reviewed it right you can see now I know that's that person that's what their opinion of it can be really dangerous I do a lot of Yelp reviews and I work really hard to make sure no one ever finds out who I am right the anonymous thing is huge and it gets harder and harder every day so

configurations of course there's a huge amount of configuration data that's coming in um settings if you're a hacker You' course want to get into a system that's been misconfigured so I have dummy systems up I have Drupal sites up for example that people people are constantly trying to log into and I can constantly see people trying to get in because of how it's configured so it keeps me apprised of when the next attack will come and then logs of course but at the end of the day it's about accessibility to that information and it's about being commodified anybody know who came up with the word commodified no KL marks so this is one of the old Marxist sayings but he was

basically talking about Commodities but his point was that stuff that doesn't have value gets value and I think the future is really in the fact that more and more of us are going to have access to this data and we're going to do stuff like you saw what with the Google charts you could probably start doing that with mobile information especially if you can figure out how to hack and and steal that mobile information that might seem implausible until you start looking some of the apis that providers give you and then stuff you can do like that's the choice Point breach right some Nigerian dude in Los Angeles called up and said I'm a business and boom he downloaded

everybody's identity information what he did with that he could have been doing like marketing he could have done uh China just shut down this company because they said that it was a legit company collecting uh investment information 150 million people they collected investment information on in China and once they figured out that it was actually being used for illegal purposes the Chinese government shut it down you'd think that that would be like Yay you know awesome you're protecting identity information because that's the story We Tell in America when people do that kind of shutdown but instead the American Press CNN in particular said it's getting harder and harder to invest in China because they're shutting down

our source of information about individuals right so it's not a black and white picture right A lot of times that information is useful to us especially if you're an investor to figure out whether someone's a legit business person you look at their loan loans you look at their history of income you look at their tax returns and that information is what you use as a bank but if you're an individual investor trying to invest in a small company in China you're now representing a microloan or something you're like a bank so you're going to start trying to hack in and get people's information so that's the problem it's more accessible than ever and it's become commodified it

has real value to us that's why it's big data now whereas tuy is saying it's just data this is what big data is about so I put it in terms like this data management tools used to be like this right expensive and slow but you could get the work done if you had a horse in the plow it was good enough but there's so much data now we need tools that's going to go out there and harvest this stuff like a combine just take it up really fast so that's the next section why is stuff vulnerable right when you think about what the combine's going to be mowing down and what it's going to be

accessing and how secure it is a lot of scary stuff comes up because essentially you're talking data harvesting who do you want to have a data Harvest tool I could go on and on about pesticides but I'll skip that so the fourth V vulnerability is is really the 3vs being done without CIA right confidentiality integrity availability so it's a little contrived but I'll try to fit it in security terms don't always work but I'm just going to try to use it for convenience CU we all know those but first a quick quiz let's say that you're in charge of a database that has a usernames real names dat of birth location sex that sort of information

and Pepsi comes along and says to you we need all the birth dates right you've got the date of birth information we want to give you $10 million to give us that date of birth information so that we can run a campaign on 13-year-olds who drank Pepsi what would you give them where's my clock take take contest HR department HR department there is an answer I mean you can give them something but what would you give them so they can get what they need done you can't give them the birthdays because you've given them the data right now they have everybody's birthday and what can they do with log into their accounts and steal all sorts of stuff

cuz that's their birthday they're used for registration and password

resets yes summaries so you do ranges so you start to say to them well we can tell you how many people are between the ages of 13 and 18 or between 13 and 1 and stuff like that and so they can get this many people have or this address is associated in this age range like you're within the range that we're targeting that sort of thing so you begin this is classic statistical analysis I'll talk about that in detail later but basically what you're trying to do is say we don't want to give you this data what do you want to get done it's a classic role for the security person what are you trying

to do just tell me the thing you're doing and I'll tell you what you can do don't just tell me to give you the date of birth so big data is even more problematic because there's so much data coming in so fast everybody's trying to get their hands in it because they can make so much money and you have to sit there as the gatekeeper and say uh how about I give you a range of dates would that work all right so let's move on to confidentiality here's a quick pop quiz anybody know why New York Illinois and Florida even though they're in the top five most populated states are way below on breach data like I threw this graph

together for go ahead yeah exactly so the regulations don't exist that really force people to say that they've had problems so confidentiality loss you know breach data is all about confidentiality laws and California is like huge trending way up there and they do that because they're super strict about the laws we don't really have really good breach data laws across the board in all places so when you think about confidentiality and you're looking at like where it's done best it's pretty much a reflection of where people regulated it and so you can do some it's my attempt to be like a cool chart person but that was a quick five minute job all right so

confidentiality some of the things that come up when we talk about confidentiality of big data is like deletion like you don't really know where the data is going or where it is can you ever get back and delete it especially on large farms and clouds because these storage arrays without getting too technical right away I'll get to the technical stuff a little bit later but the super technical problem is that once you set it out there it's like classic problem of disc storage you write it somewh and then you decide to write it somewhere else you just move your pointer but you don't actually get rid of the stuff that you've written before it's sitting out there somewhere

in some disc array and so it may actually still be there and somebody could potentially go get it so this worries people a lot of people stay up at night and then they go could we write a deletion algorithm that could go well really only the storage vendor could do that and that's not happening right now the next thing is self-destruction okay so if they're not going to write the tool can we actually have the data kill itself right so there are actually tools now being discussed like vanish out of Washington they're talking about you know this data sits out there for a little while and then if you don't access ACC it it's gone and you can't

use it you don't have the key anymore so it's useless which brings up all sorts of other issues about you know strength of encryption and can somebody reverse it because wiping supposedly is irreversible encryption theoretically is reversible all right but aside from the technical stuff there are all these statistical control methods that pop quiz I gave you about Pepsi is really about these types of methods right you can do aggregation so you can do age groups you can do salts we're familiar with those for passwords but you can actually add a lot of noise into your stuff so when people get it it's got this secret salt in there that tells you that it's your data and it actually

makes other data look hard so they can't figure out how to get the other stuff out it basically munges it but you know what it is so you can go back and do a one way or something uh you can do imitations and synthetic so you only give them some representation of the data you can replace the data doing swapping with like data so they don't actually know the true data they only know what you've given them uh you can do wiping and suppress certain aspects of data if you're familiar with PCI you're familiar with some of this stuff because every time you go into a PCI environment they say like what are you doing to get rid of your card holder

data well this on Big Data is a massively larger problem because you have so much more data to figure out the value of and to delete destroy get rid of so with that in mind there are all these studies you should look at that are really relevant to this like in 1999 I think she might be the most famous but latata Sweeney basically said that she could figure out almost 90% of the United States by just looking at zip sex and date of birth so if she could get your ZIP code your sex and your date of birth she'd be good you know if you could figure out a way to obscure that I think sex is the easiest to obscure but

you know if you could figure out a way to make that obscure then she might actually drop down a little bit so City for example if you can't get zip you can get City she drops down about 50% but still a huge number for in terms of an anonymity people can find you really easily and what she did basically was she took a cross reference of a of a healthc care survey I think some big data repository of healthcare information and she mapped it to the Massachusetts voter registration database and by doing that she was able to prove this and she was able to prove to the governor of Massachusetts that she knew exactly what his symptoms or

what his cause was that was kipped by the insurance company so it was pretty shocking 2006 also very famous AOL research came out and said here's 20 million queries of 650,000 users and they said it's totally animiz so we can release this amount of information because no one will ever track anyone back and the New York Times went out and said oh here's Thelma Arnold we figured her out right away because she wrote like the stuff about her like town we know how searches work right you type in something that's very easily identifiable as you so you know you type in your name Thelma Arnold and they go well that must be Thelma Arnold for

herself and then you type in you know why do dogs pee on rugs and well that's Thelma she has a dog that pees on a rug all the time true story she actually typed that in so 2006 the Netflix prize right this was where these two guys uh there's a prize set by Netflix uh I forget exactly what the prize was for but they released 100 million records of 500,000 users pretty large and what they figured out was that if you could get a title of a movie and a rating of the movie and the date that it was rated you could get 99% identifiability on the people people that use the database in Netflix right so confidentiality kind of

sucks in these systems and the most late latest one the most recent one that I've noticed is like this apple versus Bit Defender clueful argument where Apple's basically trying to pull the clueful tool off of their site you can't get it anymore I guess but they did a quick 60,000 app study and they figured out that about 42% of the apps aren't encrypting Network traffic 41% of apps are accessing your actual location information without your knowledge they don't tell you they do this so the mobile thing I was talking about before with the penetration if you want to release an app and have people download it you can get a lot of information out of people and so the point of that is

obviously that there should be something done about this and what's being done is Apple is removing the tool that tells you that you're vulnerable so at the same time there's 25 billion apps that are downloaded from Apple and 20 billion apps downloaded from Google so the magnitude of this problem they're studying 60,000 apps you know there's probably a lot more data analysis that could be done on the confidentiality issue but let me go ahead into Integrity much more interesting in my opinion confidentiality is the classic security issue privacy and stuff but Integrity of big data is even bigger and even more interesting one of the biggest problems is of course that the data is just

simply wrong you get in your GPS you drive through the desert you end up in a dead Road you get stuck and you die true story if you talk to the uh Park Rangers now they say do not use your GPS if you go through the desert if you drive through the Mojave they will warn you don't use your GPS because it'll tell you there are roads where there are no roads anymore it'll send you down paths that don't have you know pavement and you'll get stuck so there PL lots of examples in England of people driving to the river driving off Bridges cuz it say in fact when I drive over to the Bay

Bridge Bridge in San Francisco a lot of times it says turn left right in the middle of the bridge right so there's serious Integrity issues but some other examples are uh the inner city blue milk issue where they tested a lot of inner Kitt inner city kids and they said what color is milk and they said blue and they were like oh you know inner city kids are dumb but then they realize that it's like this issue with context where they actually get cartons of blue milk like the milk cartons themselves are blue colored and then they realize when you look inside they it looks blue like because the cartons are blue and also because it's like thin milk skim milk

and stuff like that it has a sort of a blue tinge so a lot of the data may actually be polluted by the way that you're setting it up the way you're querying people and then there's the old diaper issue where Arthur little did a study and he actually polluted the study by going in well this is during the controversy of disposables versus cloth and he was hired by a company that made disposables and he went and proved that disposables are much safer for the environment because you use less water to make them and all that sort of stuff you know the big picture argument so he created a context that gave the argument a whole new meaning and from that really

that's why disposable diapers I think became so popular one of the reasons was this Landmark study that changed the context to change the meaning of all the data but it really questions like whether the the data was valid another example of diapers in terms of poisoning I talked about this in New York and people thought this was funny but it basically was you know people want to sell diapers so they try to figure out a way to Market diapers to a community and then they try to figure out who's going to buy them so they set out these Market study areas and then somebody went and just bought a ton of diapers like they went in that area and they just bought

all the diapers to make sure that the study looked really good and then they started based like American shipment of diapers was totally based on this little market study this extrapolation of course nobody bought them and so it was this huge mess to clean up all I didn't go over big here but I thought that was funny all right so frozen Coke homework does anybody remember frozen Coke yeah no nobody frozen Coke was huge it was like the biggest it was like the future of coke right well that's the problem with the study there too so they set up this thing where they would actually pay it wasn't going so well when they were studying this region so

they paid kids to go to the big girls big boys whatever those like uh places they were like if you come in and do your homework we'll give you a voucher for frozen Coke and so all these kids would come in and do homework so they get free Whopper free uh frozen frozen Coke it was basically like a frosty so all these kids started going and buying frozen Coke tons of it so they were going to launch this huge campaign and then an accountant an auditor figured out that all the data had Integrity issues it was all fraud and Coke wrote off like $9 million and that's probably why you haven't heard of frozen Coke cuz

otherwi they would have shipped it to the entire country and you would have been inundated with like frozen Coke's the best look at this Market data right big data and then there's also just flaws in the data that comes from like the Intel penum floating Point Unit error um you can get bad sampling this is actually someone else's story so I don't know if I want to tell too much of it but in 2011 there was a huge rabbit flu scare in San Francisco that no one heard about it was Fleet Week all the Navy was there all the brass and one of the sensors tripped that said there was a rabbit flu anybody know what rabbit

flu is yeah I didn't know what it was either but apparently it's like this obscure hunting thing where rabbits die and get they get the flu and they die and then if you kill an animal that ate the rabbit you can get the flu or if you work with a rabbit that's got it you know you can get it doesn't seem like that big a deal you get the flu you get over it but in the 50s the United States did a lot of chemical warfare analysis of rabbit flu they were going to drop it into countries and bomb people with rabbit flu so when the government found out that there was a rabbit flu sensor

trigger they didn't think oh yeah somebody with a hunter the situation might be tried or a dead rabbit they thought you know chemical warfare weaponized rabbit flu so they were like ready literally moments away from shutting down the entire Bay Area and then they discovered that someone's dog had like dragged a rabbit carcass out of the backyard into a closet and that person wore a coat and they got on the roof and the sensor was next to their coat and then bam they were about to like go into DEFCON 5 all right availability so that's big issues with the Integrity this one of the problems with big data is there's so much to talk about and there's so many

issues with data and securing it so I'm trying to get through this quickly so we can get to the really interesting technical stuff but basically the purpose of a lot of the Big Data Systems is to have massively available amounts of data you want huge amounts of storage and huge amounts of processing so you can work with it and so you want simple inexpensive and fast and also shared data uh and examples of that are like your fraud department wants to look at what people have been doing your service reliability Engineers need to look at what people are doing your QA department needs to look at what people are doing so in a Google like environment

everybody needs access that's the shared part which is usually left out of simple inexpensive and fast discussions which is why security gets screwed in this equation because people build stuff that's these three things and they don't really talk about how shared it is and they don't really design for shared access so what happens is in the simple inexpensive and fast world if you remember raid of course everybody's familiar with raid I just saw they do raid 160 anybody know what 160 is it's crazy yeah they do raid one then they do raid six then they do raid zero on top of it it's awesome it's like they just keep stacking them up they call it the

three layer cake or something but you get to these massive levels of raid but people hate raid right it's expensive now and there's like I worked with storage works way way way back in the day uh not storage works it was um I forget it was digital I used to work with these original digital systems that had tons of discs and and so it was cool but we would charge people a ton of money to run these systems so they got into the distributed file systems right the Unix folks are always like yeah we don't need like tons of and it's important to note that it was inexpensive discs when raid first came out it wasn't about you know security it

was really about getting massive amounts of storage for science that's why they did a lot of raid um but anyway they moved to distributed file systems then there was this resource Scavenging Trend remember that seti at home where it was like oh man there's all these like free Cycles out there I'm going to get some of my own so I can do cool stuff and find aliens so you put all this together and you see a trend in availability right people are trying to use resources that are available to get better availability of information to them and so there in about late 80 I mean late 990s you ended up with a bolf cluster how many people have heard of or run a

bolf cluster all right cool yeah cuz that's kind of falling off the radar it doesn't really even exist anymore as a but it at the time it was really awesome cuz you're building a supercomputer out of old parts and for schools and stuff that have tons of computers sitting around in Labs that's really awesome uh you can basically scavenge hardware and you can use it to create this massive NFS share with some rsh well NFS and rsh not terribly secure but even worse in terms of availability it had this master node issue where if you shut off the bow Master node the whole thing was down so you're zero availability doesn't matter if you're doing scientific research

because you batch process so you just start the batch again and if you have a supercomputer it's really fast so it's like oh it took 10 minutes to process instead of 10 days it died I'll do it again it'll take another 10 minutes I'm still saving 9 days so availability from a big data perspective has this bias implicitly built in all right so Google built something was very similar to Beowolf in theory and they had a single Master with these large chunks of data doing batch operations and here too you see the same problem their Master node would go down dead index was dead they'd have to go back and build it again but again didn't really matter because

they'd get to it eventually and their index would be built they'd use an old index remember how he used to say like please index my site and it would take a while well maybe they had a master node death and it was taking longer than usual but usually it was pretty fast they replaced it with Colossus because they can do multiple Masters and the reason for getting speed up was because they started to do stuff like YouTube and they started to do stuff where you had they changed their chunk size because they couldn't do big chunks anymore so they're all these like usability drivers so Google's improving their their system so that they could have better usability for the users on

the back end again still no security except for you know this multiple master in terms of availability they also index in real time and stuff so it's pretty cool but I thought this was really interesting if you go back to 2003 and you look at the original paper that Google wrote they don't really this is proprietary if you don't have access to this you can't run your own Google version of this right the GFS or whatever it's called but they said in this paper they exposed themselves a little bit and said we treat component failures as the norm rather than the exception and yet they have this master failure problem and then if you look

later in the paper it says having a single Master vastly simplifies our design so think about yourself as a security professional arguing this case well yeah it simplifies it but it goes down so what if we spent this much and had more availability doesn't it weigh out in the end they actually did that so let's look at Hadoop then Hadoop is something you can actually use and everybody talks about Hadoop all the time because it's like you can do cool Google stuff but you can do it right that's my whole point really is now this big data is moving into your realm because you can run home right now and you can set up a Hadoop cluster and you

can start going you can start had hooping so the basic architecture here it's very similar you have this map reduce which is a fancy way of saying that you're going to give lots of little things to do out to lots of resources and then those are going to be run on these little nodes so stuff that you want done and things that are going to do it but they had a name node issue since they based it off of Google that was very similar so it could die you lose your name node you lose your whole thing that actually has bitten a lot of companies and I've seen more and more companies say we didn't understand

there's this big availability issue our Hadoop clusters is glowing down we're trying to do real-time processing and like lots of data analysis the data is coming in super fast why can't can't we do it super fast so name node failure so let's look at a little closer basically you have the name node that's trying to set up these tasks and on data nodes and it does this map which I talked about a little bit and a reduce which I talked about a little bit so here you send your data it gets mapped to all of your resources and then once it does the processing in record time like a supercomputer it hands it off and then

you reduce it back so that you have a final answer so that's how a lot of computers end up looking like one giant supercomputer and you have a job tracker well obviously you need availability here and you need redundancy but what actually came to mind when I looked at this was the communication between these devices remember the shared part you have lots of stuff working together the shared part is where your vulnerabilities really get introduced so you move through confidentiality integrity and availability and you start to realize that if you're going to look at availability designs which is what the systems are based on you start to find risk around confidentiality and integrity because if that RPC read is

not secure well then anybody can read or modif ify or change and so that's exactly what's been happening is that you're seeing authentication by assertion only in the Hadoop clusters you're seeing authorization that single user and the single user happens to be named Hadoop so it's pretty hard to guess and get in there and change the data remember the diapers you know remember the coke the frozen Coke if you want your marketing company to you know be successful and really invest in stuff you go in and you modify a Hadoop query you're golden right you don't have to go hire kids anymore with coupons you can just go and modify some data and then and there's obviously money in that so

and then there's accounting right with no authorization you can't really figure out who did what because you have one user doing everything and really no authentication so it's like your accounting systems toast well Hadoop knew this and so they started to get in trouble with some people and they started to fix it so it's cool we're seeing a trend here but basically the ease of use led to a design that was for arbitrary code execution you tell that to a security person they're like what you're doing arbitrary code on every system in your entire Hadoop cluster like yeah yeah it's really fast it's inexpensive it's easy you know okay but what are you doing on it oh yeah we're

doing like you know health checks for people's you know I don't know heart making sure they're safe you know doing some big data analysis on heart risk so the authentication issues there's no verification this what I just talked about oh and tasks are not isolated so people talk a lot about virtualization clusters and how you have Co tency and task issues this is a major problem in these big data clusters that a lot of times the task can Stomp all over each other you can create a malicious task and it can blow away the other tasks and then you can create your own tasks that do different calculations so there's a lot of really cool opportunity for

hacking through the Hadoop clusters unfortunately though or fortunately they're getting you know some idea of what to do and it follows the classic trend of security what you're seeing is they've built keros RPC in strong authentication so keros is not my favorite form of Key Management but it's at least a strategy where it does ticketing so before you can actually go in and do something you're going to get a ticket from the keros server it does some group resolution on the master nodes so if the master node doesn't know who you are it might not even do it which which is cool cuz now you're getting access controls and tasks are run as the users not as some single user

so you can see the the basic controls being brought in and this came in in February 2012 so the chance of somebody doing this upgrading and using this is marginal to slim right now but it's a good future it's definitely going in the right direction they're also putting resource controls which on Linux would be your U limit controls and they're giving ACLS to jobs but all that being said I noticed that there's a threat assumption right in the document if you go and look at the Hadoop documentation the threat assumptions are like nobody's going to be on your network in between the communication so there's nobody who's able to intercept communication no Bastion hosts even so if you get into a

h Hadoop environment and you think okay my Hadoop environment is really vulnerable so I'm going to put a Bastion host I don't know if that's the right term anymore but I'm going to put something that people are jump box or who knows what people call it anymore but going to put something there that you get on to get in so it's secure well that box can actually do man in the middle so it breaks the entire security model if you don't have really good control over how people are getting in that environment it also assumes that nobody has root like users never have root on those systems because then you can do bad things just kind of goes

without saying but still they' call it out and then there's no control over packets so if users have any control over packets in the environment there's really no network level security in these big data environments that's basically what they're saying all right so putting it all together basically this is an example that I came up with by looking at a guy's blog who does a lot of Hadoop work and super smart guy and what he basically what he says is I took this study that's about narcotics officers and narcotics officers are looking for people who are dealing in drugs drugs so they've got a few users and they go and look at the call logs

and the call logs tell them they're calling bogot and they're calling you know San Diego they live in Los Angeles so the idea is can you tell who's calling Bogota more often than the average person CU that's going to tell me that they're cocaine dealer in theory so it's the study that was written in 2005 so he takes their calculations which is basically here you know John calls Bogota John calls other cities John or other callers call Bogota and then he runs a simple Hadoop query and he pumps out 7.88 to see if there's a higher than average rate of calls so going back to the beginning big data is about all this rapid accumulation of

data on mobils Hadoop is about taking that information and doing queries on it and you can imagine right now who's doing this type of query right and you can do it too so you can build a Hadoop cluster and you can download all kinds of crazy data from people's mobiles especially if you write an app that allows you access that no one's going to stop you from getting so you get all their location data build a Hadoop cluster you can see the vulnerability angle on this one thing that jumped out here was that the guy who the guy who wrote this really cool blog post about how to do it he actually has the queries

and tells you how to do it he said you know well we couldn't do a lot of this because this is something only the NSA would know but I would question that I think it's I think I've shown you it's pretty easy to get the kind of data you need about people's calling patterns or whereabouts or whatever you're looking for you don't have to be the NSA anymore to get this kind of information out of companies as the choice Point breach showed and other breaches showed right you just have to impersonate the NSA so what do we do right we're heading this trend it looks kind of good but it's also super scary so let's go back to the kind of TR but

useful CIA I say encryption is going to change fundamentally I'm already seeing it change I'm getting calls all the time about key management but I think this is definitely the time to get into Key Management strategies because it's going to be super useful in these massively shared environments of huge amounts of data and so Key Management standards are lacking we need people to work on Key Management standards Kip sucks it's good but it's not NE it works on Storage level it doesn't work on the level of application like mobile application data that we need we're at self-destruction strategies you need more evidence and like that that works we looking at decryption levels that are impossible

because we can't do wipes so we have to get to levels of encryption that really can't be reversed and then we're looking at poison pills so if you got the data it would like do something to you like we could tell it was yours it's like marked bills ink blots stuff like that so can you can you put an attribute on data that would be only your unique attribute so You' know when somebody stole it for integrity we can do configuration access control so better and better configuration tools like scap gives you a lot of information from nist that's it's a it's a standards based system so you can go and do queries of systems to see if they're in a

configuration standard that's secure in order to tell if somebody's able to get in and corrupt or pollute that data uh containers and segmentation is a big part of it too where data that's in this segment is protected but data that's over here can be messed with so examples of that are new hypervisors that are coming out to allow you to run an instance and then you destroy it at the end so the more hypervisors you can run and destroy the easier it is to prove that you have integrity of data because in that instance it was Secure and when you destroy it and then start it again it's back to that original state it has

Integrity all right so the last one's availability I think more and more we're going to see peer-to-peer communication I I'm sort of disappointed that there's a cool new tool that came out that does encryption of mobile communication but it calls to the server every time it goes to them and does communication and I'm like I don't want to talk to you if I want to talk to my friend securely I don't want any relationship with you whatsoever but anyway for availability I think there's going to be more and more of that peering structure because when you think about that guy goes down I can't talk securely anymore to my friends I would much rather have five

friends and I can peer traffic through them which is like the Skype model and a lot of people use that already but I would look at more and more of that uh a lot of information is going through peers as opposed through Central which makes it harder for people to get central databases of what's going on but that also begs a trust framework and people impersonating so from that I go to the Loch nness Monster I knew it was going to be hard to bring into this equation but basically Here We Go Again confidentiality Integrity availability you've probably seen this picture this is the original picture that showed up in 1934 that said there's a

monster since that time they've tried to prove over and over again that the monster exists it's like they've taken in so much data using Sonar most recently but they use binoculars they use radio signals they use all kinds of crazy stuff pictures every time they do it in 2003 they said they really proved it didn't exist but every time they do it there's an Integrity issue and I'm like really like they couldn't protect the data every time they do a test it was really that hard you would think there'd be so much on the line they really want to prove it but sure enough it's like the loches Hado monster this is literally it's literally a theory

it's one of the most compelling theories there's a circus in town and it like had an elephant and it was swimming around with its trunk up and they took a photo of it but the one thing that makes this not compelling is that somebody said if the circus really was doing that why didn't they like reveal that it was them and just call the paper and another theory is that it was staged by someone who hated the paper and wanted to make the paper look bad but no one ever called in and said you're idiot so a lot of the theories are hard to prove but anyway back to the point in ' 87 they

did this deep scan by this guy lauran pretty famous guy I guess he invented like sonar technology so he goes out there and he uses his sonar and he says there's something here we don't know and there's something here that's larger than a fish and I'm like are you trying to sell lauran sonar systems or like you don't know what's down there like that makes your system sound bad right but it's like The Lochness Monster makes people want to buy the technology to find the answer so it's this weird thing that we're really bad at and then this study in the' 70s this guy said no no one's sure how the original came to be

altered and no one's really sure how like he took all these photographs that supposedly proved that this head was in the water but it strangely looked like someone had altered the photographs and no one could explain why so when you really look at it like we're really really bad at Integrity in this issue but we're really good at this which is being creative and coming up with crazy answers about data and that I think is a huge warning a lot of the Big Data analysis we've done historically has been way off the scale in terms of creativity in this case there were guys who were doing gold mining than goby desert there were nomads and when they

would take the gold and sell it to other civilizations this just came out this book that talks about this but they would go and sell this stuff and to the Greeks for example and talk about these amazing Griffins and the Greeks have Griffins now if you look at like Greek like literature and history and you go and look at NS and stuff there's Griffins and there are these wings and beaks well this Anthropologist went back and found that the protoceratops Left Behind These bone structures that had these beaks and stuff and so she was thinking well she they just looked at these bones while they're digging for gold cuz they'd show up as they're digging and just go and talk it up

and that's why they have this relation so this whole idea of these like people out in the field doing data analysis with all this stuff and then telling stories and coming with these crazy explanations and then believing it and then handing it off that's the key other people believing it you see there's a trust relationship there you see there's an Integrity of data issue there so we're really good at making things bad like making the data even worse than it should be so the Integrity of Discovery is sort of like where I'm headed with this right the cican which I mentioned at the very beginning is this fish that people said they discovered in 1938 but

in fact I'm sort of personally familiar with it this is a picture I took of a cican in a museum but this is a fish that was caught all the time by Fishers like they would catch it they would pick it up and be like Oh not another celic can this thing's oily it's disgusting it's spiny it's gross so they'd throw it back and they're doing this for a long long time well then some Western scientist shows up and he's like wait a minute this has been extinct for 60 million years you've discovered the first you know case of the Caan and it became this huge International Phenom and by the 50s they were being

documented and talked about like this massive Discovery but in fact it seemed like a lot of people already had this data it was floating around and there was sort of this question about you know are we really when we look at this data are we really being honest about when we discover things and when we're finding answers so I think it's another warning that we have to be careful when people present data analysis sometimes what they're doing is representing stuff that everybody already knows just in a new and novel way but they're not really helping us with answers so that's basically it big data is 4th V what the data is why it's vulnerable in all the

different ways and then what we're going to do now thank you very much [Applause]