
so our next talk will be defining a data masking framework at scale in this talk Works explore tokenization and of specification as alternate data security techniques in addition to Conventional methodologies like encryption and the challenges that needs to be addressed like implementing a data security framework at scale we have sohini on the left so he needs as a senior security departure at LinkedIn so he needs a purple team evangelist and is also passionate about data security and Cloud security she has been a speaker at B-side San Francisco 2020 and I was also spoken in several other conferences I'll now passes out to Tim thank you I'm Tim I'm a data security architect and I haven't done this before so all right hey everyone I hope you are doing well and enjoying the b-sides conference today we will be talking about defining a data masking framework at scale I am sohini and with me I have my co-presenter Tim So today we're going to walk you through an example of data security over time how it evolves how the needs and the usage of the data change and we'll be going through data privacy and how that interacts with data security and we'll go over some strategies that we didn't touch upon during our example Evolution and then go through some gotchas that kind of apply to all the strategies we've outlined okay so the year is 2015 our example company real Corp is collecting various information about the global Workforce now this sensitive data is being used in a variety of jobs and workflows to party platform it's stored internally and it's made accessible to all Engineers because it's faster to develop that way and as far as the company knows there are no legal regulations that restrict handling of the data fast forward the companies Implement strict technical controls to protect personal data Now using this as an opportunity to re-evaluate data protections real curves legal team a certain stat salary is indeed super sensitive and that it might need more protection after careful consideration they decide to encrypt the entire data set now this is easier than doing targeted or field level encryption and offer a similar level of protections a feature team wants to start running analytics on this data they want to see future engagement in regions a versus B but they don't care who the users are or what their salaries are so our previously selected protection of encryption Falls a bit short here while the data is protected against unauthorized access we do have a very valid use case that is the one running analytics that doesn't need access to the raw data and has the ability to leak keys for the entire data set so to help show up protection a real Corp makes a copy of the data and hash certain Fields like user ID and name this maintains uniqueness without providing the ability to read the raw data eventually though someone discovers that we've got two copies of the same data and says man we could definitely do better we need to tear that down to one so we reevaluate our protection mechanisms encryption has been useful in stopping unauthorized access but it doesn't allow you to run analytics without access to the raw data and hashing allows you to run analytics but transforms the data making those analytics fairly Limited now also hashing is vulnerable to many known attacks right so by adding a record with known input such as creating a fake profile with the name sohini and finding that entry somewhere in the resulting output it's possible to de-anonymize the data additionally if the hashing algorithm is known and the input space is small enough you could potentially create or use an existing rainbow table and de-anonymize the data additionally when you hash data at scale it typically introduces a really large time delay which is rarely acceptable Okay so a combination hashing and encrypting isn't ideal so a real Corp settles for a special type of encryption called format preserving encryption now if P encrypts the data that is it prevents unauthorized access to it and retains the format of the data in terms of its length and character positions and what it does is that it enables analytics to be run on it without the need to decrypt the data at this by by using this method we can potentially eliminate the Redundant second copy of data that we were referring couple of slides back and it also comes with an added benefit that we can implement it fairly seamlessly on Legacy systems like databases or applications without the need for changing field sizes or formats so we've got a lovely picture here of some of the more well-known sensitive Fields like Social Security numbers or credit card numbers but format preserving encryption can be done on other things like IP addresses as well okay so fast forward a couple of years again and we've discovered that this encrypted data has been actually written in clear text the incident response team does an investigation and finds that authorized users have written that data by mistake in clear text because the encryption wasn't done automatically for them so let's go looking for another solution we settle on tokenization and this usage tokenization involves sending data to a server which generates a random placeholder value optionally sharing the format and returns that value back to the caller this data can be used to perform specific operations but because it's random data linked back to the original data there's no way for anyone to be able to operate on that data outside of the context that we deem acceptable we can also essentially monitor all usage of that data since it requires accessing the mapping table management has struck up a deal with a third-party ad provider in order for them to serve ads accurately we need to permit analysis on our sensitive Global Workforce data but we shouldn't be sharing a raw data with the third party just like that right so let's walk through some of the attributes that we want our system to have when we are sharing sensitive data with an untrusted partner now this isn't really limited to third-party data sharing and can be extrapolated to minimize unintended exposure or sensitive data each time it crosses trust boundary say for example to a lower security domain even within the same organization we want a solution which is one way such that the shared data doesn't get leaked even if the keys used to protect that data were linked leaked compromised or otherwise shared additionally since our example here is for ads we need to ensure referential Integrity which means when the same input is provided we need to provide the same output and lastly we need ongoing key rotation in short in order to ensure ongoing protection of that data so key rotation poses its own challenges right only being done for the most sensitive of data or when a breach is clearingly obvious and this is due to the associated cost and complexity which really scales with the volume of data that's being protected now if you combine this with referential Integrity that is the need for the same input to map to the same output and you have a very large sphere of compromise if any key is ever leaked now a key update must maintain consistency with respect to the data being masked by the new key and the data that has been already masked by the previous key right now really a key update might be easily done by simply re-king the entire data set at once but that really doesn't work in a real life scenario as that will pose data usability or availability issues so let's see if we can push the the need to re-key onto the destination rather than doing it ourselves we're going to look at a paper authored by some people over at IBM where they describe a set of algorithms that use encryption and hashing in order to dynamically update the protection of data to know as Tim mentioned the setup considers a data owner producing data and tokenizing it the one that we refer here as a source and an untrusted host that is destination storing the devalued data only so the scheme operates in epochs starting with e0 as a first step the owner generates a new tokenization key for e0 referred here as key zero and it uses it to tokenize any new values that's added to the data set the owner then sends the devalued data to The untrusted Host and now it it repeats it for for the next Epoch that is E1 and so on uh the owner at this stage also sends an update tweak to the host thereby rolling forward the values tokenized for the previous Epoch to the current debug now it's really a very complex set of algorithms that we have tried to represent on a high level in this slide because of time constraints but if you're looking to solve similar problem statements in your organization we highly recommend you to refer the paper the link to which has been linked at the bottom of the slide and explore it further now before we move on a point to be noted in this context is that the correctness condition of the algorithms is such that any newly devalued data from the owner in a particular Epoch must be same as the tokenized value produced by the host using the update tweak thereby ensuring referential integrity okay let's switch it up a bit now and talk about privacy for the context of the next few slides we're going to refer to security as stopping unauthorized actions and privacy is stopping unauthorized use cases on back-end jobs let's take an example of salary forecasting as depicted by this awesome looking bar diagram now due to privacy requirements the salary modeling system has access only to cohort level data containing anonymized compensation submissions for example salaries for security engineers in the San Francisco Bay Area let me try to elaborate a little more now I might be very interested in having salary forecasting as a critical data point while zeroing in on my next job now the gentleman sitting over here and the person over there they might have been generous enough to provide their compensation details to this salary modeling system and for us to be able to gain that Insight but of course they wouldn't be wouldn't be comfortable being attributed directly or or categorized specifically with their data as per attributable to that particular individual right so yeah let's let's try to elaborate more on what we can do about that so in this case ID is attributable to the user are hashed hashing in this instance can be used as both a security and a privacy control and indeed often security and privacy controls have overlapping needs and requirements which means that when you select a control for one the other needs to be in your mind or you may end up in a case where you select a security control and the organization is unable or unwilling to implement the matching privacy control so next we'll cover one more example of privacy control on facing applications so user facing privacy controls are relevant to protect certain categories of private actions a user can take such as clicking on an ad or reading an article now this is very different from a social action like posting an article or sharing a certain article uh let's try to elaborate a little more let's say I have a startup which has a next-gen AI security solution I post an article about this somewhere say on Hacker News and I want to see if it's resonating well with leaders in this space uh so I go on looking if um how many scissors have read this article in the last couple of months now let's say my good friend John Doe at ciso at real Corp reads this article now John reading this article should definitely contribute towards the stats right but again because of privacy concerns I shouldn't be able to explicitly figure out that it is indeed John who has read this article so how do we ensure this to ensure this the platform prevents me from creating a filter for csos who have read your article if the results of that filter are too few I could still however set up a filter for people who work at realcorp and people who live in San Francisco California and people who have previously worked at another real Corporation and combine those despite the platform's efforts I can still view I can still attribute this action to an individual so the way to protect against that is generally called differential privacy and it works by adding a random amount of noise to the results of these queries now you can also attack that by issuing the same query repeatedly and averaging out the results to remove the noise to protect against this you can use a pseudorandom noise generation Raider where you add a predetermined amount of noise to the query which gets redone each time the query is run about some common data protection techniques like encryption FB tokenization hashing and so on so let's quickly glance through some of the other options we might have now homomorphic encryption allows computations to be performed on encrypted data without first having to decrypt it sounds promising right so this picture shows one such additive operation for homomorphic encryption however this technique has some significant limitations in terms of the operations that can be performed and also has latency concerns and high storage requirements so because of obvious reasons it's not mainstream but that being said it's still an emerging solution especially in high security Cloud use cases AI has been exploding recently in popularity in places such as chat gbt and Tesla and Uber and typically models like those are trained while they're being developed on real data in some cases it may be possible to build those models on synthetic or completely fake data and then when the developers have some level of confidence in their models move it over to real data masking is another commonly used technique to protect data it involves replacing characters of the data with random value with a placeholder value it's very easy to do and can be done on the Fly rather than needing to be done at the data at rest making it a very lightweight option okay so Mr Cat Here agrees that data protection and data security is indeed a big deal right so we talked about quite a few options each with its own security properties so how do we rationally categorize which data security strategy aligns better with certain specific use cases so since there are so many options out there with different security properties and each with their own caveats and nuances we tried creating a flowchart to help decide we focused on a combination of data properties and security needs note that the Green Arrow represents a yes to the question and rate is obviously a no the blue boxes denote runtime protections as opposed to protections applied to the data as it's written to the store the two key points about this flowchart first it does not take into account business needs beyond the need to reverse the data and second it doesn't handle layering options for example if you have unstructured data that needs to be reversed as well we could enable encryption to protect the data at rest and mask all data from any reads to the data set to protect it in use so let's take an example of data which is structured that we need to be able to reverse and we want to protect both in use and at rest we can see that we say yes to is it structured and uh yes to it is reversible that immediately immediately eliminates the options of hashing synthetic and tokenized data we've said that the data needs to be protected at rest so we can eliminate masking and adding noise lastly because it needs to be protected in use we can't use traditional encryption or format preserving encryption our vision with this chart is to codify it into something that engineering can use to get some idea of what protections they can use without the delays that are typically involved in talking to security about pretty much anything okay so through this present ation we have alluded to gotchas in in several of the techniques we'll now try cover some of these let's talk about hashing now hashing is only really helpful if the data domain size is significantly large enough for example hashing cities isn't a good idea because an attacker can very easily and effectively use rainbow tables to bypass the protection format preserving encryption May leak data types like IP addresses and may even allow attackers to infer the values based on the surrounding data tokenization has few weaknesses other than the limitations around when it can be used but if you can successfully attack the token mapping table it fails catastrophically masking is tough to do if you're not doing it to the entire field for example it's common to mask email addresses by masking the user alias if my email address was Tim at timsdomain.com and we masked the user Alias we'd be left with star star tim'sdomain.com and that assumes we do the entire user Alias which isn't always the case either lastly our threat models so far have really described problems one at a time with a single database in mind but it's conceivable that an attacker could gain access to multiple data sets and join those data sets on unique data that isn't protected so the bigger picture is important to keep in mind as well data security is tough there are legal requirements to do it there are lots of options to do it and each of those options has a lot of nuances and limitations but hopefully we've given you some ideas around what you can do and when thanks any questions so the question was homomorphic encryption can you explain it a little more uh the questioner has heard about it but not seen it in in use so there's like additive operation the one that we had in the screenshot and there's multiplicative operation as well and by virtue of certain algorithms there's no need to decrypt the data and you can directly perform it on the the resulting output what does it actually mean is the question yeah so it enables you to do computations on the data without knowing what that data is the slide that was up for it kind of went into the math behind it I'm not a mathematician I can't explain to you how the math Works sorry cool uh one other question you mentioned some of the pitfalls with the different methods and some of them are format preserving and that was listed as a pitfall for example for IP addresses there's algorithms like cryptopan where you you know can still perform top K prefix searches on the anonymized or to normalized data points but that isn't that precisely the trade-off because you want to be somewhere in the middle of the spectrum of full privacy and nothing at all so I think I would have questions you know even after the flowchart as when I'm not at the two ends of the spectrum but rather somewhere in the middle like what do I get when I get any of these things like um because I do want some extra utility from say the structure preservation what do I give away and like to me that navigating that space is uh would be like the next thing I'd have to do after having heard your talk yeah I think the flowchart really should be two flow charts one where we walk through the data use needs and one where we walk through the desired security attributes being applied to that but we didn't have time to break the flowchart into before this talk I have to talk about the isn't that a plus a format preserving encryption yes it is but only if you are knowingly taking that on sure and that you say I'm okay with people being able to do to deduce information about this for that particular data thank you any other questions I'm sorry I couldn't hear you which of the masking techniques do we wet the most which of the mat of the masking techniques to the developers like the most probably what we have described is masking because it is the easiest any other questions not just ipv4 yeah that goes more into how we make certain libraries available to the developers to handle that and I don't have a good answer for you right now let's actually end the Q a right now if you h