← All talks

Your Arch-Nemesis is a Data Scientist: What's the Difference Between Security and Privacy Work?

BSidesSF 202629:3925 viewsPublished 2026-05Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
A practitioner-level comparison of security and privacy engineering, arguing that the two disciplines differ fundamentally in mindset, tooling, and legal grounding. Parker-Wood explains why personal data is broader than PII, why privacy laws function more like a compiler spec than a controls framework, and what security teams must add — richer data catalogs, purpose tracking, irrevocable deletion, and DSAR completeness — to adapt their programs. Along the way she explains why well-intentioned data scientists may be a bigger privacy threat than nation-state attackers.
Show original YouTube description
Your Arch-Nemesis is a Data Scientist: What's the Difference Between Security and Privacy Work? Aleatha Parker-Wood Security and privacy look very similar at the 30K foot level, but they have some major differences in requirements, tooling, and mindset. We'll go over what the key differences are, how to adapt your existing data security tools to cover privacy, and why your data scientists should terrify you. https://bsidessf2026.sched.com/event/28709e05e419fb191c4fde15b19872ac
Show transcript [en]

I'd like to introduce you to Altha Parker Wood. What she's famous for is that one, she's been named one of the top women in machine learning, three years running. So, it's just amazing kind of accomplishment and she's actually really known for building high-owered technical teams and she's also a CEO whisperer about security. So, let's welcome Altha. Cheers. Sorry that was loud. All right. I am here to give you a practitioner level view of the difference between security work and privacy work. Why am I qualified to give this talk? Well, uh I have worked both sides of the house. I'm currently field CISO at Clearly AI. We're amazing. You should look us up. Uh, and prior to that, I

spent five years as a principal privacy engineer at Amazon, which as you may know has data. Um, I spent five years running a security R&D lab for Semantic. And then I did a mid-career PhD in data governance. And before that, I did a lot of very boring things that you don't want to hear about. But, um, I've worked both sides of the house. I've done both kinds of work. Uh, I have spent a lot of time explaining to security teams why what they're doing is not working. So, I thought I would encapsulate all of that into a talk. I'm assuming since you're at a security conference that you're probably familiar with the core concepts of security. So,

I'm just going to speedrun the core concepts of privacy and then we can talk about how that relates to the kinds of work that you do. So, first things first, security people tend to think of personal data as being PII. Uh, and that is a horrible trap to fall into. Uh, in fact, most personal data doesn't even contain PII, or at least not the way that you think of PII. Uh, so it's a very different vibe. I will go into more details in a second. Um, privacy gives you a whole bunch of rights under the law. Privacy gives you the right to ask for your data to be deleted. You to ask for it to be

corrected. Uh you get to know what kinds of data people have about you. You have purpose of process limit limitations. If you didn't tell the customer that you were going to use their data for this, you can't do it now. You have to have told them in advance. Uh and then there is consent which we are all familiar with from those lovely little boxes on every website. uh and those as it happens are like chaos monkey for data pipelines. It's amazing. So personal data, why is personal data not PII? And why do I not like to say PII? There are a handful of laws which do contain PII, HIPPA being one of them, but GDPR, CCPA, CP don't have the word

PII anywhere in them. Instead they talk about identifiable natural people as opposed to unnatural people. Uh and personal data is anything where you can identify who this data came from either directly or indirectly. So that could be tied to PII. You could tie it to a social security number. You could tie it to a driver's license. You could also tie it to a user ID or it could be tied to a session ID which is tied to a user ID. So anything that you can reach starting from an identifier that goes with a person is personal data probably asterisk. Also please note I am a security engineer not a lawyer. Please consult your lawyer. So these are the core rights that come

up. These are the things that you're solving for when you are trying to do privacy work. Uh privacy work unlike security work is very much focused around what you do with personal data. It's not just about confidentiality. It's not just about repudiation. It's really about how you use it, where you use it, who you gave it to. Uh so it's a much broader idea of data handling than just the just the CIA triangle. Um they have the right to know what you have also known as a DAR. You can ask them you can be asked to fix data if it is incorrect. Users can ask you at any time to delete their data. Uh and users can

at any moment in time revoke consent. Even if they gave you consent, they can change their mind. They can take it away. purpose of processing. This comes up a lot in cookie banner, so you're probably at least a little bit familiar with it. Um, you have things like essential purposes. You have I'm providing you a service. I need to know who you are to provide you service. I need to do security. I need to collect tax data. Um, and then there's all the other stuff. There is analytics. There's ads. There's training machine learning models. there is resale of data. Those are all non-essential purposes. Uh and just like anywhere else in life, you need to ask first and you need to get

affirmative ongoing consent to do anything. And if they say no, you have to stop. And that means getting having a way to check consent everywhere in your data handling infrastructure, every place, every time the data gets used. Not a small undertaking. So what are some of the core differences between security and privacy in terms of in terms of mindset in terms of everything else? Uh first off, what does bad look like? Security bad looks like I had an outage. I had a security breach. I had ransomware. I had IP theft. Privacy is like I failed a deletion. I failed a purpose limitation. I, God forbid, got child data in my machine learning models and now I'm screwed.

Who is the problem in security and privacy? In security, you have a bad guy. Yes, maybe somebody made an innocent mistake, but there's a bad guy waiting to exploit that innocent mistake. In privacy, there's almost never a bad guy. Your worst adversary is a very well-intentioned data scientist. They're the most likely to reidentify data that was supposed to be anonymous. They're the most likely to keep things they shouldn't. They're most likely to not tell you what they have. And they're doing it because they really, really mean well, and they really want to help the business. So, it's a very different mindset. um you are instead of trying to fight off bad guys, you are explaining to

people why it is that what they're doing is incredibly dangerous to the company. It is a kindler, gentler, more patient practice than doing security work is. Next up, I think I preaged this a little bit but security thinks of privacy first and foremost as being confidentiality. When you ask a security person about privacy, they will start talking to you about cryptography. They will start talking to you about maybe homorphic encryption if they're excited about that. Um, they will talk to you about access control. Privacy is about appropriate use of retention of and deletion of data. So that starts with confidentiality in some sense. There's a list of people who can know the data. But there's a lot more to it than just

who knows. It's what do they do with it. Um, and then security tends to think of data in sort of these big buckets. We're like, okay, this is critically sensitive data. We have a critically sensitive zone. We keep the critical data in the critical zone. We're good. Um, privacy has a much more granular notion of how to handle data and what's important. You don't want to give child data to your machine learning model. You don't want to give it to the ads targeting system. Uh so you have a much more nuanced view of what the different zones of your enterprise are, what kinds of controls you need in place to keep data from going from one place to another

where it should not be. And then last and not least, security is very concerned with making sure that data doesn't get lost. We want to make sure there's data integrity. We want to make sure there's backups in case there's ransomware. We are very concerned with not losing data. Uh and privacy is all about losing data. We love losing data. We want to delete all the data everywhere. We would prefer you didn't collect it in the first place. But if you're going to collect it, we want to delete it. So it again, there's a little bit of a mindset shift. You're like, we want to get rid of this. We don't want to keep this. Unless of

course it is a security log or tax and accounting data or a bunch of other reasons. Like there are exceptions to the rule, but mostly we like to get rid of stuff. So wait, you're telling me that I need to actually know things about what's in my data set? If you think security data inventory is hard, and it is, don't get me wrong. Security data inventory is very hard and just getting to the point where you know what data you have is a struggle. Privacy is worse. So privacy means you have to know not just what you have in what databases. You need to know at a very granular level what it is. You and

you need to know what purposes systems are used for. Is this the ad system? Is this the machine learning system? Is this the tax and accounting system? What is this data going to be used for? And can it go there and be used for that purpose? So to get there, you need to have very fine grained notions of access control and filtering and that's going to line up with the laws that constrain what you can do with stuff or the, you know, just being a decent human being and using it in the ways you said you were going to use it. But either way, you need to understand what your data is at a pretty detailed level.

Okay, good. This is not hideous on a large screen. I was very worried. Uh so just to give you a detailed example of the kinds of things that come up in data privacy, you need to understand both what's the state of the rows in your data set and the columns in your data set. So for example, here we have a data set where we have two minors can't use their data. We have one person who has not consented to the use of their data can't use that data. And because this is ads targeting only because this is ads targeting uh in some jurisdictions we can't use biometrics like for example a face ID fingerprint for ads targeting. So out of this entire

data set we have a little tiny bit of information we can use. Everything else is off limits. So you think about what you would need to put in place in terms of controls to make sure that what finally winds up in your ads targeting system is something that they can use. That's a lot. Next up, scanners. People love scanners. Everybody loves scanners because automation is amazing. We really hate doing manual work. I hate doing manual work. The problem with scanners and personal data is that unfortunately you are not looking for PII. You're not looking for social security numbers. You're not looking for phone numbers. You are looking for a UYU ID that you assign to a user which unfortunately

looks like a UYU ID which you have UID scattered all over your system. So it's very very difficult to go and just scan and find personal data. So it's much easier to catalog in advance. Uh and once you find it, you then need to figure out, okay, I found a UU ID. Is this a consenting user? What are all the sessions linked to this UU ID? Or, you know, whatever foreign key relationships you have. Uh and if you didn't capture this information at the outset, you need to be able to answer questions like, I have an integer and it ranges from 40 to 220. If it's a heartbeat, that's health data and I have restrictions on how I can use

it. But it might also be the number of times that US East went down this week. You need to know the difference. LLMs. Everybody wants LLMs to save us. I work on a product that has LLMs at the core. I am not a complete naysayer of the LLM boom. You can do amazing things with them. But uh if you didn't capture enough context when you ingested the data in the first place, you can't recreate it. Nobody can. A human can't. An LLM can't. If your developer uses in, sorry, incredibly opaque names. The LLM also can't read their mind. If you can't read their mind, the LLM cannot read their mind. And LLMs are not lawyers.

You should not use LLMs to make decisions about the risk envelope of your business. So, if you're trying to figure out whether something is or is not personal data, you don't want to ask an LLM. You want to go talk to your lawyer. They're surprisingly nice people, as I discovered, but you want to go talk to your lawyer. You don't want to use an LLM to do this. Uh, getting semantics wrong is expensive. It's usually worth the extra time and the extra money to go get a human judgment. Okay, great, Altha. You've depressed me, but give me some good news. I just need to follow the law, right? Yes. And uh unfortunately the laws that we have are not a controls

framework. Uh they are I like to describe them as more like a compiler spec. Like there's a bunch of things that you should or may or must not do. Uh and there's also a huge amount of undefined behavior in that spec. And the only way that behavior gets defined is through the wonder of case law. So at the time when you are doing building there may not be an answer to the question of can I do this or not. It may lie in that vast gray area. Uh the privacy controls frameworks are at least 10 years behind security. They are improving. Lots of people are working very hard to improve them but they're not great. Uh they don't map on to legal

constructs very well. uh and most of them were developed with the US government in mind, which turns out to handle data very differently than your average enterprise. So uh and last and not least, you can't just go down a checkbox list and say, "Okay, GDPR, CCPA, HIPPA, I'm done." Because you may have other nuances to your legal landscape. Large numbers of companies in the US have a consent decree against them. A consent decree is a law that is just for you. You have promised to the US government that you are going to do this kind of reporting or not collect this kind of data. They're super specific. Uh you get one when you mess up, but then you need

to be able to follow it. So on top of all of these legal constructs you need to follow, you also need to know about these additional legal burdens. So it's not a checkbox exercise. You can't buy an out-of-the-box tool and say, "Okay, I'm done." You really do need to sit down and do the work of understanding what your company has promised to do with data before you can start standing up privacy work. So, if you would like to become the very model of some modern data handling, here are some practical tips for adding privacy into your existing security program. I'm assuming here that you are already doing security work. you're already good at it and for whatever

reason you want to or have been forced to pick up privacy work. What are you going to need that you don't have today? First off, you're going to need to upgrade your data catalog. You've got your inventory and now you need to collect a whole bunch of additional metadata. To figure out what you need, you're going to have to sit down with your lawyers and figure out what your legal obligations are and what metadata you would need to meet them. There's a whole bunch of things in every law that you probably don't need. There's a number of things that you do. You need to figure out what those metadata terms are so that you weave

them into your catalog. Uh you need to have detailed semantics that uses those labels. You need to be able to catalog table by table what is personal data and you need to be able to keep it up to date. Uh and that often looks like integrating into your um integrating into your code stack or understanding what your vendors are collecting on your behalf that sort of thing. Uh you need to understand what sorts of purposes exist in your enterprise and figure out who's doing them. You want to add a you want to automate cataloging as much as you can. But again, privacy spines are really expensive. And so getting your data hand annotated is sometimes the right answer.

Like 8020 rule, automate as much as you can, call your lawyer or call a privacy expert and have the rest of it done. I want to specifically call out child data because child data is the glitterbomb of privacy hand of privacy data handling. Once it has touched something, the thing that it has touched is forever tainted. Uh regulators get incredibly upset when you mishandle child data and you need to be able to prove that you actually did handle it correctly. If you can't prove that, then everything that might have touched child data is now tainted. So, as a case study that I love to cite when I'm explaining to people why it's important to get this right, uh, the FTC

told Weight Watchers they had to not only pay a fine, they not only had to delete their data, they had to delete their entire trained machine learning model and start over from scratch because they couldn't prove they hadn't used child data in it.

consent. Unconsenting user data poses a lot of the same problems as child data. You need to be able to filter it out everywhere data flows. You need to be able to identify it. Uh unfortunately, unlike child data, consent can change over time. A child is a child until their age majority, but a user can change their mind three times in a week. So that means you need to be able to check consent status every single time. You can't just build up a cache of users and say, "Okay, here's their consent status." Um, so it is absolutely you have to check it every single time. I know this is horrible. Sorry. Uh, and once you have all of this in

place, now you have laid the foundations for your access control and your filtering. uh you need to find a way to model purpose within your access control systems. That might look like roles. Uh if you're doing roles, you need to make sure that both the humans and the systems that are consuming data are using the right role for whatever that purpose is. Um or you can do something like attribute-based access control where one of the attributes you're passing is the purpose of processing. uh but you need to do something to model what you're using the data for. And you also need to get filtering controls into the data pathway to make sure that whatever you're getting for that purpose

is suitable. You need to make sure you scrub all those children out, all of those unconenting users out, all of that pesky biometric data out so that you have a clean data set that your data scientists can sit down and work with without causing issues. Uh, and you really, really, really want to make sure that user IDs and timestamps of when you collected the data go everywhere the data goes. You're really going to hate yourself later if you don't because doing that means that it is impossible to know where your personal data got off to because you don't know which users this data set goes with. Uh, and it makes it impossible to know how long

you've retained the data. And I'm going to talk about data retention in a second, but you really want to know when did you get this data. When it comes to deletion, privacy means absolute 100% irrevocable deletion. Hashing the user ID won't do it. Dellinking the sessions from the user ID won't do it. Uh backups are unfortunately right out. You can keep them for up to your compliance SLA which companies usually pick a threshold around 15 to 60 days depending. Uh but you have to get rid of it completely unless of course it is tax and accounting data or fraud and security or litigation holds. So you can't just scrub through all of the systems, find

the user data and say, "Oh, this has user data in it for this user. I need to delete it." You need to know whether or not that system should have user data past the point of deletion. Is it allowed to have it?

How long you can keep data depends on what you collected it for. So if you are collecting data for some of these essential purposes, you have a nice long runway. In some cases, you have a really long runway. Tax and accounting data needs to be kept for years. But other types of things like ads targeting have a very short runway. Uh, and when a purpose expires, you can't use data for it. Even if there is a purpose that has a longer expir expiration date. So when you run out of valid purposes, you need to delete it. Uh, and if you're keeping around tax and accounting data, you can't use it for ads targeting after your runway runs

out. Data subject access requests. Interestingly, this one seems to trip up a lot of security people uh and engineers because they say things like, "But this data isn't interesting." Like, okay, yes, it's related to the user, but it's really boring. That's not the point. The point is that it is their data. It is data that has been generated by their activity and you have to give them all of it. Uh so, you need to follow all of those links I was talking about earlier. You need to follow all of the session IDs. You need to collect the whole ball of yarn and package it up and give it to the user. Testing uh privacy testing is mostly the

same. So you still want Canary users. You're going to have perhaps a wider range of Canary users. You are going to want to have children. You're going to want to have consent canaries. You're going to want to have canaries for deleted users. But it is not too different. The only thing that's different and very difficult is testing the completeness of your data subject access request. Making sure you're actually returning everything to the user that by rights belongs to them. And this is why I said you're going to hate yourself if you didn't send the user ID everywhere because you won't be able to find it again and you're going to miss stuff. I have apparently talked very fast

because she has not yet given me the five minute warning. >> Oh, okay. Perfect. Good, good, good, good. Okay, in a nutshell, security versus privacy differences. Your nightmare scenario as a security engineer is that somebody has exfiltrated data. Your nightmare as a privacy engineer is that there is child data in the ant system. Security engineers need to worry about nation state hackers. Privacy people need to worry about data scientists with very good intentions. Data set inventory for security people is systems, data sets, big buckets like critical data. Privacy engineers think about columns, row levels, many tiny buckets like biometric, child data, and accounting. And for security, you get to pick what you call the buckets. You can call them

red, you can call them critical, you can call them kangaroo. That's entirely up to you. Uh, as a privacy engineer, the sets of categories that you pick are tied to whatever the regulator thinks is important. So, you don't get to go off pist. You have to figure out what the categories are. There's a global standard that you need to adhere to. Um, and then that needs to be encoded into your systems. And last and not least, control frameworks for privacy are still a very nent space. And so uh there is a thriving practitioner community. We all help and support each other. We all try to figure out what needs to be done and how to build it. But uh it is not nearly

as mature as security. There is no 853 for privacy that is good. And so, uh, as you stand up the privacy program, you want to reach out to this broader community and figure out what those controls look like, what you need to do. Thanks. Thank you, Lisa. We got uh a couple minutes for any questions. Okay, got one. as a cyber practitioner who's interested in learning uh more about how I can I can reduce uh privacy risks uh and bolt that onto my my skill set. How would you recommend getting started in that if you were to be in that boat? >> Yeah. Um so some of those steps I called out around you know how do you do

catalog how do you do cataloging what kinds of metadata do you collect? Um there are a couple of good books in this area. Uh Nishant Bargaria has a really nice book called something like uh I'm going to say the privacy engineers run book something along those lines. Don't quote me on that. Hit me up after the talk and I can find you the actual name. Uh but that has actually a really nice um set of components in there. It's a good starter kit. Okay, we got a question down here. Yeah. Hi. You you talked about tax and accounting data, I think, in a privacy context. >> So, is that tax and accounting data about individuals or is that the

organization's tax and accounting information? >> Since we're coming at it from the privacy direction, uh we're talking only about tax and accounting data that is related to individuals. Obviously, your company has more than just tax and accounting data about individuals. You may not have anything about individuals depending on what business you're in. >> So, >> so just to clarify that if a company operated an accounting system that's not a privacy problem in general, right? I mean, >> no. Uh I would say not in general. Uh and if you're running an accounting system on behalf of other people, then um it is likely that they're the ones on the hook to make sure that they are

handling data responsibly. So you just need to make sure they have the capabilities to do the right thing. But yeah, >> and we had a question down here. Um you mentioned tracing uh user data across the end toend life cycle es also especially in terms of like data scientists using those data but with the number of data transformations data cleanups and all the different steps of a model training that goes into it. How is like what's a good example of here is how someone meaningfully still establish that trace of user data. So if you have a machine learning model, uh, a lot hinges on your training process. There's a bunch of fantastic security papers about how you do data

extraction from trained machine learning models because machine learning models are subject to memorization. So a lot of times you can just coax the data right back out again. Uh, and so when you're talking about user privacy and machine learning models, you're really asking, did I train this in a way that is robust against this kind of attack? Can I make sure that data isn't going to come back out of it again? >> That's an overlap of security. >> Yeah, that's that's one of those nice places where they overlap. >> Okay, let's give Alisa another hand. Thank you so much.

[ feedback ]