← All talks

Privacy Nightmares while Using ML/AI

BSidesSF · 202024:4997 viewsPublished 2020-03Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
Sameer Ahirrao, Yogesh Karpate - Privacy Nightmares while Using ML/AI in Your Applications Everyone is excited with using ML And AI in applications but what about privacy while mining personal data within applications? See privacy nightmares can happen without due care. We aim to address issue by introducing a framework that blends ML and privacy/data security technologies seamlessly.
Show transcript [en]

All right, folks, welcome to Privacy Nightmares. Quick note, you can use Slido for this talk to ask some questions, and if there's time at the end, we'll do some Q&A. But Sameer, take it away. Can you hear me? Yeah. Hey, thank you, Justin. I hope I'm not keeping you from the next party. There is one more session there. Thank you for coming. I was actually expecting home and the day is kind of winding down. Everybody's tired going home. But thanks for making it. But I'm pretty sure we have some engaging topic here. We talk a lot about security and I've been there for the last 20 years and I think It's really time we need

to talk about privacy a lot. I mean, there cannot be a privacy without security, but it has its own angle and we need to really worry about it for one kind or overall. So it's really putting the attention towards that and I thought that ML and AI area would be a good area to start with. So first apologies, my throat is all... bad with the cold, I hope you can't sense it, but my buddy, so we were actually haunted with all this sickness all over the week, so Yogesh couldn't make it. But we actually worked on this together, so I thought we'll keep it there, so apologies for that. So privacy nightmares while using AI/ML in your applications.

Biographics, a little bit about myself and Yogesh. I started Arden Privacy. It's a privacy tech startup, what we call a data minimization and data centric security company. So the whole inspiration behind that after working so long in the enterprise security industry is we talk about securing everything around the data, just not the data as much. I mean there was a survey out there, we only spent like six to 12% of the, the annual security budget on securing data itself. So we put so much money around the perimeter and whole bunch of other things and of course malware takes a lot of attention there. So I was security architect at Lockheed Martin, a defense contractor that was almost last

eight years. Before that I did some product as well as implementation with Symantec and also worked with Big Four Deloitte. I do take volunteering seriously and that's the part of this. So I volunteer for IC2, some other besides and so forth. On Yogesh's side, he's a data scientist. He's been doing that for the last 10 years. He's from IIT Mumbai, post graduate and also undergraduate from same college called PICT for us. And he sent his sincere apologies for not making it. So what's the impetus here? We're going to go through the various things and the goal really is how much we think about privacy when we do it and there are some learning model. I mean

there are big companies and smart people out there who are inventing this area. But I thought I'll just put there at least some pointers to that, whatever we could cover. in 25 minutes or this short time period. So I thought that you like it and at least it gets some kind of starter to think about that in general and which is kind of important as I said. So we'll go through an impetus then I will see what problems are there, solutions and then what are the enhancing technologies, PET, what they call solutions are out there. I'm gonna just take a few examples what people are doing in that area. Then the similar line, what are the techniques for PAML, the privacy-aware machine learning. So really about the

privacy by design, how you can think privacy in itself before actually doing your project, doing your software development, or doing your application development and so forth. And then we'll see some conclusion, what are our responsibilities and what really my message towards it.

So these are two big events. I mean, there are a lot out there, but I thought I'll point out there was Department of Health Australia. They actually published some data in 2016, and it was actually de-identified health data, and they actually published that for the research purpose. And what later they found out, there was some hackers called Madash, they cracked the logic to identify the PII. And that data was basically re-identified. And actually that kind of put a primer in Australia. Hey, no, we need to actually think about that, right? So if there is a particular condition in particular zip code, you can really find. And I think they use that even without. So there were some encrypted fields there, but they actually didn't have to touch

that using unencrypted fields within that data. They were actually able to identify people. The other one is New York Times. I think this became pretty famous recently where they found the location tracker without consent. And I think New York Times actually gave a very good animated map there, how your data is tracked and how your location is tracked. Somebody actually took a person and she was a school teacher and she went to school and during whatever four or five hours or eight hours she was in school, her location was tracked like 850 times. So we see this anonymous tracking by all big tech as well as everybody who's trying to use good or bad use

of data depending on how you look at it. So it's kind of privacy invasion, right? We are able to track or they were able to track people right from the White House. Privacy by design actually gets, is very important and that's what we wanna go through. So let's see what we can do here. So the problem, so there is a status quo, right? So and we actually talk this about a lot, the collect and store everything possible and monetize. So I don't think data has become such a commodity. We actually care about deleting data. We just collect whatever is possible and we'll see what to do with it later. And I think that status quo we gotta break. We gotta only store what we need

and not 1% more than that. Then there is a lack of privacy by design approach. Privacy is kind of area I think we got some time to get the security on the ground first. So privacy is even actually later in the phase, it's kind of stuck with the legal people. You do have sometimes privacy officer, but that's pretty much isolated. So I don't think we have a privacy professionals actually looking through your projects and data and hey, what kind of privacy impacts are there for different kind of projects. There was lack of regulation, of course, except the GDPR. GDPR came in 2018 or so. In US, still we have, it's getting better, of course, California being the first revolutionary state to have CCPA,

at least we got some consumer data protection. But overall there is still big gap in terms of regulation. More states are passing it, so it's getting better, but there's still not everything out there except GDPR, so EU is pretty good in that sense, and then Brazil had LGPD. So it's picking up globally, but still not finished yet. So privacy invasion of citizens, so all the impact of this whole data market out there or data intelligence gathering people are all these big tech companies. There is a mass, sorry, sorry about that. So there is a lot of privacy invasion in everything this whole big tech companies do and we really need to look at it because

personal life is at stake in various means. We really need to take the privacy at serious juncture and I think that the crowd over here, the security engineers and developers and all that, I think they can make more difference than anybody else because eventually they are the one who design things.

So solution is number one and I went to an interesting conference from IPP in Vegas, they call privacy security and risk. The 32nd FCC chairman was there, Tom Miller, and he actually wrote a good book called Gutenberg to Google. But that's good to add to something to your reading list. So he mentioned particularly that we do have a duty of care. There were like three revolutions. One was printing, and then there was railroad, printing and telegraph, and then railroad, and now it's data revolution. So we really have a duty of care. We must own... the privacy whenever we actually develop applications and so forth. Regulations are definitely helping the need and awareness. We covered by that. We covered earlier that. Privacy by

design approach, it's a must, must. And to very least the data minimization, it's very important. I don't think we have actually focused a lot on data minimization. So what are you storing? Whether you really need it? and collect everything is really not the good idea. So we need to really look at data minimization while you collect data and while you store data and throughout the data lifecycle. Then techniques for PAMEL, the privacy aware machine learning, I think there are differential privacy federated learning, a lot of other techniques we'll go through. Those are must. in terms of the solution that can actually do privacy a little better but there is a trade off between what can you do versus what you cannot. But we gotta try

best whatever we can, how best we can de-identify data or basically protect the user privacy that's better for mankind. So far in technology landscape and this is still evolving, The one is statistical disclosure control. It's kind of ancient method. We'll go over that. And then privacy-enhancing technologies and really the area which should develop in near future and all the researchers are still working on it, but I think we need more development in terms of that.

So data privacy, so just a little bit of a removal for that. It investigates the models and methods. So why privacy? So privacy investigates the model and methods aiming at the prevention of leaking of sensitive information. So whatever you do, we basically should be avoiding the leaking of the privacy, the information which will compromise the privacy. The origin of this whole concept actually came from statistic community efforts with the Census Bureau because they wanted to actually give the data, statistical data to people, but at the same time they wanted to make sure that we do not compromise the privacy of users. So there should not be a way to identify people's identity while we throw that out there. I mean, the whole goal of

census is to publish and give the good statistics which people can work off. But at the same time, it should not be misused to compromise the user privacy and identify people. It's now adapted in communication databases or traditional applications, not so much in the machine learning area, and I think that's something evolving. I'll say adopted not fully, but still at least there is an attempt to implement that. So statistical disclosure control. So for research and policy making various survey of administrative data, so census, tax record, health data, educational data. So what statistical disclosure control ensures is no person or institution is recognizable from the data analysis. So you can do data analysis as much as you want, but when you do that data, you should not be able

to identify a person's identity. And how do you do that? I mean, there is a whole paper out there, but mainly you introduce the noise or do the ranges. There are different techniques within statistics where you try to avoid basically revealing people's identity. So as this is sensitive to the disclosure risk and analysis result may be changed to protect the confidentiality. What that means is it's not perfect all the time and the results may be changed. I mean sometimes you just lose the purpose of that. So it's not a perfect technique but I think that's where the whole concept started and then other methods actually evolved with that.

Then this is the privacy enhancing technologies. There are a whole bunch of techniques out there and more and more are being added. But what are the principles all behind that enhancing technology? So informed consent. So if you see all these privacy tools and technologies, everything is consent-driven. We actually see every time you access location and all these apps and basically even mainly the mobile phones and stuff, we actually see more and more consent tracking. So that's user grants consent for commercial purposes for everything we want. It's still misused. I don't think it's 100% or even close to 100% perfect, but at least we do have what you call motivation there to give more and more consent when you do a cross application access. What are you trying to access?

Either you are trying to access location and you see more things pop up, but it's still in the developing phase. Again, data minimization. Of course, if you don't need something, don't collect it. Whenever you collect it, store it responsibly and only store whatever is really needed for the business purpose and we actually gotta go through the definitions for that. Then data tracking and data provenance, that's one area. Data anonymity and control. So, data menu. I don't know what was that. refers to practice of limiting the collection of personal information which is directly relevant. And it's a GDPR definition by the way. I just pulled out that from GDPR and I like that definition very well because it really says that the sense of minimization

and it really, we need to think about that whenever we actually handle data in its whole life cycle. So personal data should be adequate. So it should be adequate, no more, no less. So say for example, you're buying something or maybe you're selling something. You only gotta do whatever data you need, not something, everything you can. It has to be relevant to the purpose. It should be limited to what is necessary for purpose which they are processed. The data minimization principle represents best practice for maintaining customer trust. So what that means is privacy we are taking that from, I mean, and I think Apple was pretty vocal about that. So a lot of companies, so we should not think just the data minimization privacy as a burden basically

which kind of kills innovation but necessarily a lot of companies are taking that effort to make their branding better that hey we really care about your data. and we protect your data to full context. So I think we can take data minimization as a company or as a corporate to show that your consumers, to build consumer trust that they are responsible for that. So privacy aware machine learning. So what are the problems currently in all the machine learning. I mean, problems are, you can say, the opportunities depending on which side of the aisle you are, but I'm talking from the privacy perspective. It's very data-angry, right? So when you say data hungry, I can have as much

data I want, so I can run some analytics on it. And every piece of data I can gather, and whether it's useful or not in future, that I will see later. But whatever data we want, we have to basically have. So it's highly data hungry. It demands data, data, and more data all the time, and we need to... Careful about it. So what can we do about it, right? So de-identification is a good idea. Anonymization is a good idea. Pseudonymization. And earlier in the phase, you do in your development lifecycle or usage system development lifecycle better for you. Along with the minimization, I think as much data you can de-identify right from the beginning, as long as you can make use

of it, so it can be disputed again. It's again a trade-off better. And data masking, that basically we can use to get rid of the data hungry problem. So use as much data if you have to, but definitely use some kind of techniques to pre-process that data. with removing the personal identifiers, removing the personal data with those techniques. General lack of guidance and practices for privacy in the IML, I don't think, I mean of course the big companies have started doing some of the big initiatives for the, I got five minutes. Thank you. So big companies started doing that, introducing some frameworks, but I think still there is general lack of guidelines in more, what do you say, mass usage of

data. Businesses are not adhering to standard data privacy related practices. So algorithmic level, if you see privacy by design, that's not as much in the algorithms, and those algorithms need as much data as possible, and they may not work perfectly if you try to do so. There has to be some more and more innovation needs to be done. So some of the good and interesting things about the standard methods. The statistical method for data transformation They may not be effective in the downstream ML frameworks. Then there are some cryptographic methods like homomorphic encryption, so that's a good introduction here, where you can keep the data encrypted and still can use it. But it's still at the nascent development phase. We do have some implementation

of it and definitely some startups or some initial companies are working on it. But still it's not perfect and I don't think they work with the existing algorithms, the ML algorithms which you want to use on a day-to-day basis. So they are catching up on it. Another one is synthetic data generation. It requires a lot of domain knowledge. So you can do it sometimes, not always. Federated learning, it was introduced by Google. Of course, keep data at the source. With the time restriction, I'll go a little fast on that. So defense and healthcare actually, they use that on. And I think there are two examples which can actually explain that a little better. So don't actually process the data

at the centralized repository. Basically use those at individual node points. And I think that can protect privacy because you're not actually gathering data at the same time. And that's what the federated learning does. This is another use in personal healthcare where you can actually use the local data at the hospitals and then aggregate that only higher level where you actually process the data. The differential privacy you may have heard of. So basically using the ranges versus actual data. And I think this is pretty good technique very easy to use for pre-processing. So what you do is you use the, say for example, this is particular case. If you wanna do a particular medical data where you

have to mention the particular age, what you can do is you can put the age ranges versus actual age of the user. So you can actually use that for pre-processing which makes it more privacy specific. So in ideal case, This is the framework, what we can eventually do. The user data, differentially private data collection. You can do a k-anonymity in pre-processing, and within ML model, basically you can do predictions and federated learning at the same phase. So what life cycle you can have is if you can do a privacy by design for a whole life cycle, I think we'll be better off. There will be always utility trade-offs, so you have to pick between one and choose how

much privacy versus the use of data, like we get the data in Google Maps in terms of, so at the time you've got to give location to get the traffic, but the traffic is such useful information. So we've got to take the sweet spot there as far as utility goes. And the conclusion slide, I think we should be on time. So we have a duty of care. It's our responsibility to preserve privacy and I think nobody else other than people in this room or similar people can do it rather than business side because we are the one who code application we design and implement that for everyone. Family and privacy design should become de facto

standard for most of the AI/ML driven applications. There are some initiatives but we need more work on that. Regulatory efforts are also needed and that's a good Washington actually passed the law. So but those laws need to be more strict as far as citizen privacy is concerned. And there is gonna be always trade off in terms of inclusion of data versus what you need to achieve. So we need to balance that efficiently.