BSidesSF 2019 - Anti-Privacy Anti-Patterns (Sarah Harvey)

Name: BSidesSF 2019 - Anti-Privacy Anti-Patterns (Sarah Harvey)
Uploaded: 2019-03-18
Duration: 28 min 50 s
Description: In this talk, we will examine key research findings and technological innovations in the past 20 years that have led to the accepted practice of collecting all of the data. We show a difference between tangible (e.g. PII) and non-tangible data and show how seemingly harmless data can still be used t

BSidesSF · 201928:50621 viewsPublished 2019-03Watch on YouTube ↗

Speakers

Sarah Harvey

Tags

CategoryPolicy

StyleTalk

About this talk

In this talk, we will examine key research findings and technological innovations in the past 20 years that have led to the accepted practice of collecting all of the data. We show a difference between tangible (e.g. PII) and non-tangible data and show how seemingly harmless data can still be used to derive behavior (with examples!). We also examine how privacy perspective can change depending on your role or background and propose a perspective shift if we are to try to maintain digital privacy today.

Show transcript [en]

hello everyone good morning thanks for joining I'd like to introduce our first speaker Sara Harvey we talk about anti privacy and anti patterns take it away Sara hi this is pretty cool because this is actually a pretty full room so thank you all for coming it's very clear that you really care about privacy and you find this time interesting I will try and not bore you to sleep because I'm formerly an academic and I talk about boring things anyway so today we're gonna talk about privacy my name is Sara Harvey I assumed that my bio was gonna be said before this but it was not so if you want to look up at my bio you can look online in

scared I'm currently a privacy engineer at square with about 8 years of combined industry and academic experience and seeking privacy I know some of you are into Twitter please feel free to take pictures of my slides and tweet them this is my Twitter if you're not sure you're on the correct account I posted this this morning right in preparation for this talk so what is privacy I think it's important first to establish kind of where I'm coming from in terms of my definition so that all of you are on the same page I think that privacy is an extremely fluid term it can mean a bunch of different things so I'm gonna try and narrow the scope of it so that it's

easier to see specific problems and I'm also gonna try and set up some assumptions so that you know we're not taking by surprise I'm the speaker here so you're forced to listen to whatever I say so I'm just gonna establish my set of assumptions and the rest of the talk about privacy so we'll start off with I think privacy is about the control of the data about an individual so I can say that I'm more digitally private because I have control over my information about me this is a far more rigid definition than what other people may think because it doesn't suffer from the subjectivity of what a person considers to be private or not specifically privacy is different

from secrecy I can be private about my data and it's not because I have anything to hide and privacy is also different from anonymity I can be private about my data but so pieces of there might be I might still have a presence or a persona on the inner not completely anonymous I like to say that data privacy includes inferred or derived data this includes behavioral data we derive from any existing data any existing like collection of data from an individual an example might be information about my sleep schedule can be derived from when I turn on and off specific lights in my home so this is not tangible pieces not name or email address it might be things that you

derive from that pieces of information so maybe I use Gmail based on the fact that I use something at gmail.com let's assume privacy is not dead I'm not here to debate this please don't ask me questions about this I will also assume that everyone has a right to privacy and so therefore they have the right to and the ability to maintain privacy I'm not gonna talk about ads please don't ask me questions about ads so that's my whole spiel about privacy and assumptions I feel that since my talk also contains the word pattern in it I should define the word pattern so what is a pattern so a pattern I come from software engineering a design pattern is a

software engineering phrase it basically means it's a behavior you are using to solve specific problems because they're really effective so therefore an anti-pattern means that you have a really ineffective way of solving problems as illustrated by this diagram where I'm trying to go through the happy trees and the happy flowers and then I end up into like a maze full of monsters okay cool we got the definitions out of the way let's talk about the state of the world today we have breaches everywhere oh wait that's not actually the breach I mean these are the breaches I mean they're data breaches so you know bank and loan bank loan and mortgage documents leaked online health records

leaked online hotel records leaked online we all remember Equifax so what happens in these data breaches so one instance is we get credit card data stolen I'm sure many of you have had credit card data Solon in some way where I like Bank of America calls he was like hey I saw you try and use like a ticket meter in New York and I'm like I just went to Safeway in California that clearly did not happen but this is a tangible piece of information social security number does a tangible piece of information if I lose my social security card I get identity theft I get you know people are filing taxes on behalf of me people are

opening credit cards on behalf of me Equifax that's not going to that further other tangible pieces of information might be you know your name your address your phone number your driver license information there's just physical pieces information that kind of describe you so we saw all those breaches earlier so when we see something like this how do you react to that it says here facebook says Cambridge analytical harvested data of up to 87 million users I use Facebook let me open Facebook what was taken oh I don't know I don't know what data my boasts like really like I do I care about this um actually I do even though like it seems like there's

nothing really interesting there to be taken but the thing is that you know as a member of the general public I don't understand what these news mean but it still feels bad it still feels wrong it still feels like a violation it feels like a breach of trust now we're going to go and examine why that is the case but to talk about why it feels wrong we need to talk about the Internet so how do we use the Internet I don't know about you I like to use the internet to search for pictures of cats so you may have heard me talking about okay I like to use the internet to search so what is

search I come from academia I used to be a PhD student I did information retrieval research and I also did privacy research for us and academia search is an entire field of research it's actually called it's a part of a broader field called information retrieval you're like great Sarah so what is information retrieval simplified information retrieval system works as follows we have a user we have a system the user makes a query to the system I guess my transitions aren't working super well today and then when so after they make the query to the system then the system makes a decision to return a particular document or series of documents back and a document here could be a webpage it

could be images here's a web page here's some images it could really be anything it could be videos anything stored in that information retrieval system so you might say well that sounds awfully like a database and you would be right except for one key difference important in the system is a concept of relevancy now when I say relevancy that is not the same as accuracy relevancy is the measure of whether or not a document list is of interest to the user so this means that relevancy is actually a subjective measure because what I consider to be a relevant document might be different from what you consider to be a relevant document so here we go

ranked system so you know this user prefers a top document versus the bottom document so in information retrieval systems unlike database systems present a ranked list based on what the system thinks is important to the user so information retrieval has been around for a very long time since the 1800s lots of stuff has happened and now we're in the world where everything is creepy so it's interesting to talk about I think what's really happened in the past 20 to 30 years we're gonna look at some things on a timeline compared to the state of computers and the state of the web I II what everyone manages to see so I'm gonna take you back to the 90s what

happened in the 90s this happened in the 90s we probably saw that movie last night hopefully so computers looked a bit like this this is an example of a computer back in 1990 we were already moving away from big hulking mainframes of the past in computer computing power was you know increasing in like pot like the meters were getting faster way way faster but also the form factor was becoming smaller and smaller this was the advant at the advent of personal computing that's why you have a computing a computer system that is personal to you on your desk in 1989 you may remember Tim berners-lee had a proposed the idea of the World Wide Web

so by the 90s we had a really really early presence internet presence or web presence for instance the website for Space Jam which by the way if you go to the web search for Space Jam this is still up from rank 96 however information retrieval was becoming a pretty established field at this time there was already theoretical research into the idea of how to rank results how to translate free from queries to the point that there were already two established conference venues Cir and truck that were created so it's probably not a coincidence that it was in the 90s that we also start seeing information retrieval research getting more applied because up until this point it was all theoretical is

like you know it would be cool if we had some system that would return documents in some way but you know computers would never become that fast computers will never become that interesting how would like there's no way to apply this but personal computers are coming into being so now we have applied research track for instance is one of those applied conferences you can think of track as basically a hackathon for researchers basically researchers come together and brainstorm and try new tactics and search for about two to three days and at the end they all talk about it and being like this is cool this worked and or no this was a total disaster so by the mid-90s then we started seeing

all the search engine companies coming into being you might remember that Google was formed in 1998 and then okay so all the search engines came into being we have Google but then by the early 2000s we hit this max of you know what search and well was capable of which you might remember happens to coincide with the dot-com bust so the dot-com bubble came into being because search was starting to become to really take off and then it hit this kind of ceiling and then everything crashed so at about the same time coincidentally in the research world we got this paper starting to tie in some ideas about something called context in web search so this is one of the earliest papers it

was published in 2000 and it was to say it started suggesting this idea of personalization and context as a way to improve web search why is that that's because technology in 2000 all we could do to make search better was ask users to input their preferences you may have remembered search in the 90s like click on the category you want to find the thing you want it's like I'm going to look at art and we're gonna look at autos so this is Alta Vista I don't even know this is all sofas around anymore but this is there were bajillions and billions of search engines it was like hot bod Alta Vista Yahoo also at one point looks like this

in fact I think Yahoo still looks like this so this is an example what the search engine looked back then and it turns out that the reason why we were trying to do that was because people are really bad at knowing how to search if you read all of the personalization papers after the year 2000 you'll find that they all have the same introductory theme let's look at an example this is one of my favorite pictures to describe my state of work at any given time so I use I often use it for presentations like this one however I never remember where it's from so I often start off by googling you know the guy with the map and then that didn't

help me like at all so let's revise our search maybe it's the crazy gut looking guy with the map that still doesn't return anything but I don't know why there's a weatherman up there but then I remember oh it's a meme that's why I like using it in slides so I put the crazy guy with the map meme and I'm like ah that's that's that's the picture I want you might remember that Google was revolutionary because it had none of those weird text links from before right you just had if you notice all of this those search things was just the text box with the buttons and the reason why Google dominated was it got

better at guessing what you were trying to do I didn't need you to specify the preferences you just had to revise your search a bit and then you would to this so you there might be a hint and you might choose a hint that you might be seeing as part of my slides if you look at the trend of the success of any major information retrieval company it's probably because they adopted personalization I have said some words what is personalization personalization is used to cater things just to you examples of personalization are like this it's basically in any kind of recommendation system you see like Amazon or Netflix that's a good example of personalization it's based on

building some kind of profile or context on you so here Netflix recommend some videos based on what I've watched before and based on what I've not watched so you can clearly see I have not watched the office and down here you can see Amazon recommends products based on what I've purchased or looked at or any kind of interests I've specified so as you can see here because I'm a security nerd that's why there's not just one but two UV key products in this list now personalization is made up of two parts the part you see here is individualization ie catering to me personally based on my preferences the other part is contextualization so which is the act of making a scenario have

context let's look see what contextualization might look like I open Google music a few days ago Saturday actually because that's why it's a Saturday and here's some examples it says look it looks like you're home on a Saturday you must be you must be really sad let's play some party music or you know you've looked at you listen to this in the evening maybe you want to listen to it again because last Saturday evening you were also sad and played this music so the question is how do they know why does this feel weird the reason this feels weird it's all this kind of behavioral inference about you you didn't tell these systems that you

wanted this kind of information it just kind of made a guess for you and implicit in this is that there's a lack of consent it wouldn't be so terrible if the guesses were wrong but the fact that the guesses are becoming better and better it just seems really really creepy so it feels like another violation so good eye this talk is got all about anti patterns so I'm gonna introduce the first anti pattern it's inference without explicit consent why are we inferring why are we inferring context without consent remember I talked about how we're bad at searching it propose so the paper from 2008 proposed you know okay people are bad at searching what if we could automatically infer

search contacts to help people find things well it's the year 2000 so the idea on inferring contacts or a little weird like okay let's look at the contents of what you're editing in Microsoft Word or let's look at your email or email messages or research papers while you're editing a document in Emacs or you know let's suggest content from the web based on like what you're reading or editing if you notice all of this this is kind of a all short term context and B you notice this is all pine based this is a problem because it turns out none of this is implementable because remember this is a machine from the 90s and even

in the 2000 it still look like this and so like if you want to look at some speeds and some computing power around this era you know processors at most was maybe gigahertz but a lot of them were like around 500 to 800 megahertz I couldn't find an article that said talked about describes around this period but apparently some sources say the largest disk drives of this era are about 100 gigs most consumers had disk drives and the tens of gigabytes for context Wikipedia as of 2015 has if it's compressed entirely compressed I think just text it's it can be stored in 100 gigabytes but uncompressed some 10 terabytes so we have the situation of

we're well into the era of personal computing everyone has their own device but trying to store the entire index of the web is basically impossible so like we can't do client things clearly what we need to do is move everything to server based systems so this is a prediction in 2000 that said with the cost of running a large-scale search engine already very high it is likely that server based full scale personalization is currently too expensive for major web search engines however advances in computer resources should make large-scale server based personalized search more feasible over time that is a scarily accurate prediction from 20 years ago so it basically says personalized search would become feasible in the future when CPU

and storage got cheaper so so summarizes research decided research community decided pure search method some capture actual intent humans are bad at expressing content contacts is valuable for deriving intent and client-side methods are not powerful enough so what does this look like on the outside well by 2008 Google started rolling out personalization as an opt-in basis you might have remembered this it was called iGoogle before we started talking about iPods I guess so you could opt in to have this customization and you know people were really excited about this here's some guy on the internet writing about it but then by 2012 as we moved more into the personalization world some of that kind of control starts going away and this

guy gets pretty mad about it he's like look I'm having all of this contextualization and personalization and I'm not choosing to sign up for this Google is just doing this for me so we're talking about anti-patterns here's another anti-pattern lack of transparency it it's not clear to the user what's happening behind the scenes we also talked about how it's impossible to control so anti-pattern number three erosion of control I'm going to very awkwardly switch to a different topic here around the same time we started seeing the democratization of machine learning techniques if and this is to do with the fact that the search community and the machine learning community are actually very intertwined so for a lot

of the information retrieval and machine learning for a lot of information retrieval researchers anytime they do a new breakthrough in information retrieval it would coincidentally also make a breakthrough in Michigan learning if you don't remember what machine learning is here's a helpful diagram data goes into mysterious black box derived by computer scientists and results come out though the problem with this model is that it suffers from the garbage in garbage out problem now if we look at the fact that the applications of machine learning are just about everything oh you end up something with something like this if you remember today which was a chatbot created by Microsoft and eventually after talking it was it would create responses based

on who talked how people talked to it and eventually came up with this so we come up with anti-pattern number four you know we have this trust in black box algorithms we don't know what they're doing they're producing garbage and we don't know why so let's recap personalization is maybe doable in today's resources personalization research has sped up machine learning development personalization research has democratized machine learning machine learning is useful for a lot of these products which require data there's a precedent for collecting context data and there's a general move to personalize everything let's talk about apps and services how apps and services are today this kind of scenario might be familiar to you I'm trying to use a

Productivity app because I'm bad at productivity it's called click up it has some means of handling things but you know it tells you it'd be really cool if you know click up could talk to other things I click up is a server base system like it's all all the data stored in some service server service remotely it's not like on your personal device so this might look familiar to you it'd be great you know if I could integrate with slack with github with bitbucket with Google Calendar look at all of these apps I can integrate with to make it productivity even better even within Gmail I can click to get add-ons and integrate with a third-party look at all

of these products I have no idea what any of them do let's go to Dropbox Dropbox is a name I recognize let's see what what happens if we try to integrate with Dropbox you will see that Dropbox requires a ton of permissions like trying to view your emails manage your emails connecting to some service it doesn't even even specify what it is I'm connecting to an external service and seeing and downloading all of your contacts I have no idea why Dropbox needs to see and download all of your contacts so anti-pattern 5 we now have a situation where we have an internet of integrations so we have Internet of Things I now propose internet of

integrations we want to personalize and make better products but that can be done that cannot be done locally so we do it on the server with all the features we'd like prompting us to integrate with other services which brings us back to this oh wait I mean this so remember this article about how you know well how do i how do I understand this article how do I understand the situation where Cambridge analytic has taken all of my data even though Facebook was some somehow somewhere in between Fox has is actually pretty good diagram and explanation of what happened was at Cambridge channel analytics and like why it was so devastating so to recap there were two hundred and

seventy thousand users who took a quiz they took a quiz that was provided by an app that was on the Facebook platform but the app was built by a third party which users agreed to and then that third party took that data even though it was from Facebook and used it on some somewhere else so point being in the past twenty thirty years it's led us to the situation where a third party needs to collect all the data and it's unclear and because it's theirs and not face once they get to do whatever they want anti-pattern number six unclear data ownership so let's recap again we got here because technology with good intentions is led us to all of these

situations we wanted to build better and better products but as a result these are all the anti patterns I got derived these so therefore these are all the things we need to re-evaluate and re-examine before we can truly get a grasp on privacy so things like okay inference without consent maybe we need to ask people you know I'm gonna do this thing with your data are you cool with it lack of transparency for examples like by the way this is exactly what's happening with your data here are some controls to you know check box yes I want this data to be transferred here I want this data to be manipulated in this way for black box algorithms maybe let's

not use machine learning for everything or use better algorithms or better statistical methods that actually have proven track records of you know not garbage in garbage out or at least we know how the garbage is getting manipulated and then in terms of like uncle's like unclear data ownership maybe we want to provide some of that ownership after the users some of that control back and internet integrations can we not build more services that talk to more services that would be great that's all I have thank you for listening and if you have any questions I think I have about five questions left five minutes left sorry not five questions I don't have questions you have questions and here's my Twitter

again [Applause] any questions the question for those who didn't hear was do I have examples of companies respecting those so one of the things I've started to notice is that especially with GDP are coming out there are more and more European companies that have very explicit privacy policies and very explicit text about how data is coming in how data is coming out they also have very clear wording they have some there was one that had a suggestion of like you could install a little snitch which is a firewall thing and say you could block all of the outbound access from this app and we would not care and I'm like that's great this data is not flying elsewhere I know

it's all local on my system I know what it's doing locally so there's a few examples not a ton but I think there's gonna be more coming up have you seen a shift and how end-users perceive these problems in the past few years we care I have started seeing people carrying more I will be quite honest for those of us in the privacy research community events like Cambridge analytic take us by surprise because we were not expecting a major shift for another five years so the fact that that happened in the past few years and has kick-started a lot of questioning and a lot of the general public asking why is this happening what's going on is actually

great for us because now we can be like we need to build better controls and get a better handle to the situation but we were not expecting this to happen for another five years

so do you know if there is any research and how the explicit consent from the user should be taken effectively because users tend to click on you know just whatever is presented to them there is a very good question that is a very open question and so a developing research question I don't have good answers for that I do know a lot of my colleagues are or my former colleagues in my research lab are working on that kind of thing but unfortunately I don't have any good answers I'm hoping it improves in the next five years in terms of GDP are and the other privacy regulations coming down the pike do you think there's an

acceleration of this effort coming um I would say so I think from the folks I talked to and the folks I interact with in the community there is a growing I mean okay we're privacy nerds privacy engineering nerds so we talk about this all the time but I think there is a growing interest and from a lot of different folks about okay this is clearly becoming more a more important issue so I would suspect more and more companies are gonna try and be more transparent in the future the problem is is that and I didn't really cover this in my talk is that a lot of the technology and a lot of the stuff that

has been built now it's really hard to retrofit so if you remember like the the mantra of the rule bill security from the start don't retrofit security I think we're in the age of you know you have to build privacy from the start you cannot retrofit privacy and that's the situation we're grappling with today great talk and thanks for presenting is confidentiality synonymous with privacy just curious to see if there's a difference or what the what the connection is between the two that is a very nuanced to answer a nuanced question and I'm happy to talk talk about that after the talk because I know I will spend another 20 minutes talking about that so find me after the talk if you didn't

know I'm staff and a presenter so I'll be wandering around through the venue for the rest of the day so if you find me and you want to chat about this feel free I also have this crazy like bad data earring so pretty easy to spot so feel free if to have longer discussions with me during the day

BSidesSF 2019 - Anti-Privacy Anti-Patterns (Sarah Harvey)

Related talks