← All talks

Multiparty Pseudonymization and Anonymization

BSides Budabest · 202321:3873 viewsPublished 2023-07Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Dániel Zentai - Multiparty Pseudonymization and Anonymization This presentation was held at #BSidesBUD2023 IT security conference on 25th May 2023. Anonymization and pseudonymization may look like very similar concepts, but according to GDPR, it is not the case. Pseudonymized data can be recovered using some secret information (e.g. a decryption key), anonymized data on the other hand cannot be recovered under any circumstances. In this talk, we will examine a hash-based anonymization protocol and a public key encryption-based pseudonymization protocol. https://bsidesbud.com All rights reserved. #BSidesBUD2023 #Anonymization #Pseudonymized
Show transcript [en]

okay to continue ladies and gentlemen we have Daniel Zentai with our second presentation of the afternoon thank you okay welcome everyone my name is Daniel azante and I'm the head of cryptography research at the extender uh we are a small company located here in Budapest and currently we are working on a data collaboration platform what does it mean uh basically we we want to have several clients and customers and they and they want to upload their data in a common database and then run some queries over the encrypted data without gaining too much information about the data uploaded by the other participants so the presentation will look like this we will have some introduction first

then some legal definitions and and then I will describe an a custom anonymization protocol developed by developed by us I will describe it in in details and and then I also will describe a pseudonymization protocol at least the sketch of it we will we won't go that much details into it and then I will talk about a bit about the fully homomorphic encryption and then our future plans so as I said in this data collaboration platform uh we we have some customers who want to upload a bunch of data in a custom database and the and the data should be stored in an encrypted way and and also the customers want to gain some statistics out of it or run some

queries in the database why the data remains encrypted so potential use cases could be gaining statistics out of medical records without knowing the identity of the the patients or a or fraud detection in in insurance companies to detect double claims but if nobody does any suspicious we we don't we don't want to know about the the clients of the other insurance companies so our approach is we want to partition the data into two sets identifiers and attribute attributes and ID will will be anything that that can uniquely identify the data subject like passport number or or ID card number or something like this and and the attributes will be any other data like a hair color color or age or

something that that doesn't identify uniquely the data subject sometimes it's not that easy what counts as an identifier because the height of the tallest person in the world is is a pretty unique identifier but but let's say we can discuss it with our customers and and we can separate our database into base into two sets so we are using a custom anonymization or psycho dynamization protocol on the IDS and we will use fully homomorphic encryption on the attributes now what is the difference between anonymization and pseudonymization so at first sight it doesn't look any different but but in gdpr it's these are two very different definitions because uh gdpr said these are not the most precise legal definitions because

I'm not a lawyer and uh this is just the one this is just how I interpreted these definitions to myself so basically anonymization is a process that is not a reversible so you have some personal ideas and you do something with it and the output of this something can cannot be reversed and and you and you under no circumstances you can you can get back to the original IDs pseudonymization on the other hand it is a reversible process you can you can you can get the original identifier if you have some additional information like a decryption key or something like that so with some additional information you can get the original ID from the pseudonyms

and according to gdpr anonymized data is no longer considered as personal data but pseudonymized data is considered as personal data so the first thing if we want to anonymize the IDS we we we have to discard any encryption techniques because encryptions by definition they are reversible they can be they can be decrypted with a decryption key so the first thing that comes to our mind is is some hashing so the problem with it is uh is identifiers I usually usually pretty easy to Brute Force so for example the in Hungary the ID card number is is six digits and two letters that is not not a very complex password so you can easily root force it if you get the hash of it

so then the second thing that comes to my mind is a is some keyed hash so put the identifier and some secret key in the hash and the output is a is a anonymization technique the problem with it we we don't want to let the the customers to to calculate the the anonymization functions on their own because if they can calculate it on their own they can locally Brute Force the the anonymized data so so we have to protect the customers from each other not only some attacker so our requirements in this anonymization protocol we want a transformation that is one way so you from the from the anonymous output of the anonymization function you

can't get the original identifier the results should be hard to Brute Force and all of the clients are required to to perform this anonymization protocol and the other more tricky stuff the the output of the anonymization should be the same regardless of of which participant initiated the anonymization so if if the first insurance company have an ID of a customer and and and they want to anonymize it then the output will be the same if a second insurance company have the same customer will be the same ID so that's that's what we want to do with the IDS and also on the attributes we we want to enable the clients to to run some queries

now uh we have to talk about some mathematics so it's a good time to run out from the from the room if you were afraid if what you see here is a is looks like pretty complicated but uh but believe me it's not so basically we have two big prime numbers like 2 000 bits or something like this or really big prime numbers and and we choose two values Alpha and beta and they are primitive elements of zp which is the sciatic cyclic cycle group with cardinalityp so if you don't understand what this means then don't worry about it it's a pretty complicated sentence that describes a a pretty easy algebraic algebraic structure so this this zp is just a set of the

numbers between 1 and P minus one and the cyclic group means if you if you calculate anything with these elements audition and or multiplication or raise it to a power or something like that then you always take the the reminder modulo P so for example if if P equals 7 then I don't know two times two times four is eight but if you divide it by seven then the remainder remain there will be one so the result is one so anytime you out or multiply two numbers together you get the remainder module of this prime number P that's all we have to know about it and this primitive element means if you if you start to raise this Alpha and beta

into various powers then eventually you will get every element between 1 and P minus 1. so to give you an example if if again P equals seven then uh and and you choose choose a number two then if you if you start to calculate the powers of two the first power is two the second is four the third third power is of two will be eight but you get the reminder if you divide it by seven so it's one and after one you get two again and four and one so this is a situation that we don't want because if P equals seven that we have a set of six numbers and if you if you calculate the if we

calculate the power powers of two then out of these six possible numbers we we can only get three of them so it's a it's a smaller subset this is what we don't want so after these definitions there is a not so well known hash function this is called fan Heist and fitzman hash and you can see you the hash of two values X and Y is Alpha to the power x times beta 2 to the power why and you take the reminder module of P so that is a hash we we want to use using the anonymization protocol uh this is not a well-known hash uh not because it's not secure it's it's a it's

called it's Collision resistant if the discrete logarithm logarithm problem is hard but uh it's it's much less efficient than than the more widely used used hash functions like the Saj family so it's a bit slower now how does our anonymization protocol work so we have a set of participants let's call them P1 p21 b n and we have a server let's call it s and suppose that a client for example P1 wants to anonymize an identifier so P1 all clients generate two random secret Keys these are x i and y y sub I so these are two private skis for private keys for all of the participants and if the server receives an an upload

request from from the from from the from the participant P1 then the server generates and different random numbers and and sends the encrypted version of e to the participants with the private key sorry the public key of that particular participant so this part is is not for the anonymization this is just a simple challenge response protocol because the the server sends out the random values and oh and if all of the random random values came back to the server then the server can be sure that everybody participated in the computation and that's what we want so this part is just the challenge response this is not for anonymization but what the only anonymization part is

about uh is the the first participant who calculates the the hash of X1 plus the ID and Y one so this is the the hash of the first positive participant the first participant includes into the hash his uh his private keys and the ID and sends it to the second participant every other participant multiply the the value that he that he received from the previous participant multiply it with his own hash value so in participant number seven will multiply the the product of the first six value with the with age of X seven y seven that's what they do and the last participant we will send this value to the server and this value is the multiplication multiplication of

every participant's hashes and because the hash is calculated this way that I showed you so the hash is just a multiplication then uh regardless of who initiated the the anonymization the final result is if you multiply all these values together the final result will be the same denoted by H and so the the exponent in in the alpha base will be ID plus the sum of x's and the exponent of beta will be the sum of Y uh wise regardless of who initiates the anonymization now uh we we have a proof of concept of this protocol and the the running time looks like this uh it's it's not so good if if we have many participants

or or many uh Records that we want to anonymize if we do it over zp but if we are using elliptic curves uh this is much better so if two participants can run the protocol with the with one data record in in 15 minutes milliseconds so it's it's okay not the fastest hash ever but but it's okay so we are using elliptic curves it since the the security of the hash relies on the discrete logarithm problem then every algebraic algebraic structure can work where the discrete logarithm problem is hard so that's the only musician part uh I will say a few words about pseudo-animization just a few words because we are not developing this yet

because anonymization is is a is a stronger technique pseudonymization the only case where it's important if you if you want to recover the original data from the anonymized identifiers but for pseudonymization we can use a simple algonal encryption alcohol is a public key encryption based on the discrete logarithm problem so it it has a similar setup with the with the cyclic groups and the and the private Keys is is the exponent and the public key is a primitive element raised to these exponents and the every every participant can can have their own key and if and if they multiply if they add add together the private Keys then they have a master private key that's why if

they multiply the public Keys together they will have a master public key so it's I won't go any more details into it because this is not the first version of of our platform we are we are not that far from being finished with the with the first version as I said we have we have a working proof of concept and the and it it should be a finished product in in the next one or two months but that's with the anonymization protocol that's our first version so uh I don't have too much time left but uh I want I want to say a few words about fully homework quick encryption basically it full is a type of type

of encryption where you you can perform arbitrary computation on the ciphertext and the and the result that if you if you decorate the result it will be the same as if you perform this computation on the plain text so the we are we are encrypting uh the the attributes with fully homomorphic encryption and this way the participants can run queries on them without revealing the the identities of of the data subject because the identities that are not stored with fully Automotive encryption they are stored with the anonymization protocol so the main problem with it is it's a bit slower not a bit it's a lot slower than than the widely used encryption schemes but uh we are working on this

one so some future directions we we want to have a version where where we can eliminate the the server so decentralized version you may ask why don't we start with this one because from a security perspective the decentralized version is is better than the centralized centralized version it's because of legal concerns because uh if if we have a centralized version then every client have to have to sign a contract with only us and in a decentralized version everybody have to sign the contract with everybody else so it's a premier league perspective it's a much much more complicated version another future future direction will be at the differential privacy into the system differential privacy is a is a technique

where you sacrifice some some accuracy to gain some more privacy for example if if the clients want to protect their data even from each other and and they don't want to don't want the other participants to know the exact output of of the of the queries then they can add some carefully chosen random noise into their data and the and the output will be a somewhat accurate the button but not completely accurate uh answer so another future direction would be a threshold protocol when not all the participants are required to perform the computation but but only a subset of them uh we have a solution that on paper that that works but it's a bit more

complicated and and I'm not sure it will be it will be fast enough to to do it in practice uh an even more interesting future direction will be to make it a post Quantum algorithm and we we don't have too much idea about it I mean we have some idea but uh but it's not even close to being a finished product and currently we are experimenting with the with some Hardware accelerations of because it's it's pretty slow so I think that's all that I wanted to tell you so thanks for the your attention and if you are interested check our web pages uh because we we should finish our our product in the next one or two

months so thank you very much

Daniel