The Importance of Engineering Privacy From the Get Go

Name: The Importance of Engineering Privacy From the Get Go
Uploaded: 2023-10-25
Duration: 25 min
Description: Engineers are stewards of user data and must build privacy into software from the ground up. This talk covers the current landscape of personally identifiable information (PII), the real-world dangers of reidentification even from anonymized datasets, and practical technical strategies—including dat

BSides Las Vegas25:0029 viewsPublished 2023-10Watch on YouTube ↗

Speakers

Christina Liu

Tags

CategoryTechnical

TopicDFIR Privacy

DifficultyIntro

StyleTalk

About this talk

Engineers are stewards of user data and must build privacy into software from the ground up. This talk covers the current landscape of personally identifiable information (PII), the real-world dangers of reidentification even from anonymized datasets, and practical technical strategies—including data minimization, retention policies, deidentification techniques like masking and generalization, and encryption—to protect users' privacy and comply with fragmented privacy regulations.

Show original YouTube description

Ground Floor, 15:30 Tuesday The software we build has a human impact even if at surface level it doesn’t seem that way. We as engineers are the stewards of our users’ data so it’s important to know how users are expecting us to protect their identity because it is the right thing to do even if it takes a little more time and effort to build in. This talk will cover the current challenges to securing user data and provide tips on how to protect it. Christina Liu

Show transcript [en]

let's get started please welcome

Christina good afternoon everyone I am Christina Lou oh my goodness this mic um who am I I am a senior security engineer at Cisco moroi I am also a certified information certified information privacy technologist I am at Kulu on I guess it's Twitter and I have website if you want to find me later so what will this talk cover first we're going to get into why this is important second we're going to get into what is personally identifiable information or data then we will go into the dangers of reidentification then we will end with some practical takeaways that hopefully you can go and Implement so to better understand the power of personally identifiable data I want to

play an imagination game and to do this I am going to make a game for all the burrito lovers out there I'm here to espouse the wonders of burrito match this app is the hottest thing in burrito recommendation engines this will take this app will take all of your likes dislikes run it through its algorithms and find you your perfect burrito so to do that it will need your dietary stuff like are you pescatarian vegan omnivore and do you have any dietary restrictions are you gluten-free do you need your food to be Halal kosher do you for some reason hate avocados it will take all this information run it through its algorithms and find that perfect burrito

but not only is it a perfect burrito it's the perfect burrito that's closest to you right now because time is of the essence when you're hungry and kind of hung over so this app is so freaking good good you use it like every day for 6 months cuz ain't nobody got time to cook but what if this app was not forthcoming about its data sharing policies what if the information that you like every burrito with extra sour cream and two modelos which exceed the doctor recommended weekly good idea stuff um that information gets sold up to health insurance providers and for some reason your health insurance premiums goes up now even worse what if the information

gets then sold up to organizations and companies that do religious surveillance which is possible because it has location data and Halal kosher filters now this app suddenly goes from like Whimsical and fun to dangerous and disturbing so thank goodness that this app is completely imaginary um but there were and are apps that are personal data nightmares does anybody remember with the super old iPhones how the light only turned on when we took flash photography which was a feature that nobody liked what we actually wanted was the light to stay on so that we could use it like a flashlight and because of this user-driven demand there became this proliferation of flashlight apps um all over the place I'm going to talk about

one in particular and it's the flashlight app done by I handy and an analysis that was conducted by appthority which is a mobile security software company they found that the flashlight app had access to the user's location could read the user's calendar could use their camera and had access to the unique ID of the device itself and with that information they also had the ability to then send that information up to advertising networks all without user consent and users care what happens to their data so in a 2022 consumer privacy survey done by Cisco um 76% of the people surveyed said that they would not buy from a company that they do not trust with their

data and not only is this a trust issue this is a user respect issue because whatever code you write whatever software you're involved with building it will impact people and you want your impact to be positive and you want to be building better and safer products you don't want products Rife with unint unintended consequences in the code and in the architecture because when privacy and security are mishandled the consequences affect people in very real ways so here's a chart from 2017 from experience was a little out of date but it shows the dollar value of people's uh information on the dark web so your social security number was worth about a buck which is kind of surprising but

passport information is worth ,000 to $2,000 so in this chart you can see some real quantifiable harms um that this generate and it should be obvious now that privacy is important but like what is privacy so privacy usually gets talked about in terms of like buzzwords and rants in our industry and we hear usually hear about privacy in terms of Damages and the millions of dollars loss in terms of like data breaches but at its core privacy is an individual's rights to maintain control over their personal information because privacy allows people the ability to be themselves it gives them the ability to control what to share where to share it and with who they're sharing it with and

thank goodness privacy can be achieved through policy such as legal and corporate policy and also technical engineering controls and hand inand with privacy come security especially in this Modern Age we should all know the answer to this one what is information security but again this industry is ripe with buzzwords rants so many Rants and our talk of thread actors and our most popular thread actors are always hackers hackers and more hackers Now privacy Angelina Jolie aside privacy uh excuse excuse me security at its core is the systems and the controls built to protect information that's it's what it is and the information that we are protecting are things like proprietary code credit card numbers and

yes personally identifiable information often referred to as pii um in an acronym form so security can help achieve privacy but it alone is not enough to protect privacy or and pii when we get into pii it's usually categorized into two buckets um sensitive and non-sensitive and what counts in different buckets will depend on your country depend on a jurisdiction depend on your laws so be very careful when you are creating the classification for what is sensitive and non-sensitive pii so sensitive pii as defined by the Department of Homeland Security is data that if loss compromised or disclosed without authorization could result in substantial harm embarrassment inconvenience or unfairness to an to an individual but the tldr is any

information that can quickly and accurately identify an individual so some examples up at the top is social security number that number follows us throughout our lives we need it for everything from employment to housing driver's license numbers we don't usually change these unless we are moving to another state and Biometrics information because I don't have the Men in Black zapper so I got to keep these so now we go into nonsensitive pii and that is basically information that by itself is not generally considered to be a risk to an individual's privacy or security and so this information is just generally collected for things like marketing customer service research but care is still needed to ensure that the this data is protected

from unauthorized use access disclosure destruction all that good stuff because though even though by itself when you take one piece of data if you can have multiple pieces of data that then can quickly and accurately identify an individual then that becomes dangerous so for example if you just have an address you might not be able to find that person especially if they're living in a giant apartment building but if you have an address a gender and a birthday you may be able to find the person you're looking for so to protect data and to be able to use collected data there's a concept called deidentification and that is the tools and the techniques that organizations use to minim minimize the

Privacy risk of storing and Publishing data containing pii here are common D identification methods note here they may be called different things depending on your industry but the idea very very similar first one we're going to talk about is redaction is the idea of removing your removing data from a data set so I like to think of redaction in terms of like Hollywood military movies where they'll have a letter and then they'll cut out sensitive information from it so you have like a a letter with a bunch of holes in it so same thing removing data so for example if you have a data set that has people's names and Social Security numbers can you remove the

social security numbers and still get what you need done um the next one is masking is the uh also known as pseudonymization it's the idea of just like uh obscuring your pii so instead of having clear Tech Social Security numbers can you replace it with all stars or or make smaller stars or can you run these fields through um functions that will generate real random uh strings and Fields next is generalization the idea of grouping your data to get rid of those specifics so for example if you have a data set where you have people's ages so instead of publishing their actual age or using their or storing their actual age can you instead say

that these people are over 18 under 18 over 65 under 65 then we get into opusc which is the idea of adding noise to your data so again if we think about the data set with ages so instead of having the age can you round their age up or down to the nearest decade can you instead average out everybody's age in the data set to be like you know between 46 and 54 or something like that um a note here is that obfuscation can be aggressive and can make your data harder to use but if you are dealing with data sets with very sensitive information like healthc care records this could be a good way to

go so safe data handling and disclosure is more important now than ever before because companies are getting fined big time so back in May of this year the Ireland data protection commiss Commission find meta 1.8 billion with a B dollar they ordered meta to stop sending EU users information to the US So Meta is actually still appealing this so only time will tell whether or not they're going to have to pay this and what they're going to be doing with that data so protecting pii is important cuz we're really not Anonymous anymore on the internet and and reidentification can happen from data sets that have been deidentified this was proven in 2006 by two researchers um from the University

of Texas their names are Arwin Nar yanan and vov they're not actually Lego people I just couldn't find real images of them so in 2006 Netflix had a contest called Netflix prize it was a $1 million prize contest where they asked Engineers to help create a better movie recommendation algorithm for Netflix and to facilitate this work Netflix released a data set that had information of over 10 million movies had information of almost half a million of their subscribers and six years worth of that data so what our researchers did was they took the information from the Netflix data set and was able to cross reference it from the public records in IMDb and from that they could

re-identify the user in the Netflix data set by simply matching whether uh how someone ranked a movie so did they like this movie or dislike this movie and two of those rankings could have been inaccurate and the posting day could differ by 14 days and with just this little tiny bit of information uh narana andov were 99% confident that the users could be reidentified so kind of scary but also they publish that other traits like sexual preference political party um and things like that could be inferred about this person because how we rank movies on whether or not we like them are very specific to our own personal interests another example from reidentification happening from unlikely

data sources is an experiment done by Dr latata Sweeney she is the founder and the director of the data privacy Lab at Harvard so she had an experiment that show show that you could match hospital records to newspaper articles so she got a data set from the state of Washington and this data set had information about patient demographics clinical diagnosis procedures this was deidentified so names and addresses were removed but some of them still had their zip codes so what she then did was she went to Lexus Nexus which is a uh newspaper database and found 66 articles that matched her search term and her location which is Washington so newspapers are in the business of informing the public of

current events so they do publish specifics like name age treatment hospital and other information so she basically matched the information from a newspaper article to the record in the patient data set here is one example of that so in the newspaper you can see in yellow that this was a 60-year-old man which matched back to the record in teal the location is soap Lake man which matched back to the zip in blue over there the time of the accident which is Saturday afternoon which match back how this poor this poor soul got in the accident um they had a motorcycle accent which is in green and then the treatment Hospital in Orange Sacred Heart Hospital

and in pink you can see that the poor person's name is Ronald Jameson so now that we know it's Ronald Jameson we can see also see in the patient Set uh demographic set is that they he was charged $71,000 for care and also he has a slew of other things he's dealing with like pulmonary problems from this accident so I am not a lawyer um and there are different laws uh that companies can be sued for or fined for if they have data breaches and mishandled data um at the time of this talk the US there's there's no comprehensive Privacy Law federal law that standardizes how pii should be handled there is one called the American

data privacy and protection act but that's still a bill it's not a law it was introduced last year no idea where it's going to go from there so as you can see from this map um there are different states with different privacy laws and different statuses of whether or not they're there so all the pii enforcement stuff is really peace meal by States so what can we do even though this sounds very ominous there's five things that we can do the first rule of pii Club is don't collect or store unnecessary data the second rule of pii Club is don't collect or store unnecessary data if you remember nothing else from this talk just don't

collect or store unnecessary data if you do nothing else this will get you far if you're going to be storing data two you want to be automatically deleting old data create a schedule for when that data is going away that's called a data retention policy um luckily most cloud storage systems like AWS have configurations to make this a scheduled thing so that you can set it and forget it three use only the data needed to get the job done so Advocate to be incredibly selective of the data that will get processed and shared because we want to make it harder for reidentification attacks to succeed four this one we all should know build for privacy and security right in the

beginning because it's never cheaper less effort or faster to bolt it on later and if you try to force it in later you may end up building Mission critical systems that then have to be materially changed BEC or retired because they are Privacy Law violations so also build to the strictest standard for most of us that's going to be gdpr finally work with a privacy lawyer so Privacy Law is complicated varied and quickly changing even in California there were new laws coming in in January in Colorado they literally had a new law coming in in July and with the new stuff who knows when that will be coming in and with what things you'll need to do

so just remember the code that you write the systems you work on has a human impact even if at the surface level it doesn't seem that way so we as security Engineers we're the stewards of our users data so it's important to know how users are expecting us to protect their identity because it's the right thing to do even if it takes a little bit more time or effort to build because after all at the end of the day I know that you would want the company that's responsible for your pii to be doing the utmost care and consideration with your data and be doing the right thing too once again my name is Christina Lou

I want to say thank you to bides that is a QR code to a LinkedIn not anything nefarious um thank you to everyone here on the camera our volunteers and our av staff thank [Applause] you is there time for questions no okay two I have two questions going once going twice oh

okay uh do you think some form of anonymity could help in applications with saving users data is there any research in that area yeah so there is there are like functions and things they're called um my brain at the moment like K anonymity and um oh I can't it's like L something it's very matthy um but basically like a lot of data scientists are already doing these type anonymization functions they're kind of unfortunately the best things we have right now but they're not perfect so they can still be um reverse engineered because true onization is actually incredibly difficult yeah are there any uh tips for for uh unstructured data like large oh yeah so um if you just need data to play with

there is a great website called kaggle which is k a g g l they have data on all sorts of things my favorite one right now is like the Thailand tourism data one so it's fascinating to see what Thailand is doing um and if you just need something to generate like a Json blob or something real quick there is a website called macaroo that will absolutely do that for you with actual random data cuz you shouldn't be testing with prod data ever is that it yeah there was a talk a little bit earlier um this morning about machine learning and kind of the remote spwar bossar as people are calling it right um one of the things that I I think is

interesting I want your perspective on is as companies have started tracking employee usage heris data generating all this stuff to put together Insider threat profiles all kind of other things you can make an argument that that could be considered pii because it's singer printed and unique to you like biometric data what do you see that kind of turning into so actually California this year one of the um expansions on the California privacy Rights Act is that there is a pii uh category specifically for employees so this was like the old stuff was just general like regular people but California specifically has one for um their employees internally so and I don't I don't know like what other

states do or don't I'm not a lawyer so hash not legal advice not legal

advice I think a lot of people in the crowd are probably in security engineering and stuff like that how do you justify a lot of these initiatives to management and The Wider organization within what you're doing right because sometimes these things can fall down in priority or go up in priority well yeah that's a good question um well the gdpr fines are very expensive so like if it's a less severe violation in gdpr okay so they give you they give you choices um the less severe violation it's 2% of last year's revenue or I think it's 10 million euro whichever is higher and then if it's a severe violation it is 4% of last year's

revenue or 20 million EUR again whichever is higher so not getting fined is probably a good impetus is that it do you are you are you wooing once twice Thrice sold [Applause]

The Importance of Engineering Privacy From the Get Go

Related talks