
so the title of this talk is what the Yandex leag tells us about how big tech tech issue you big Tech uses your data and uh please welcome our speaker Kaye McRee over to you hello thank you so much everybody for sticking around especially post happy hour all right so this is going to be a fire hose I have a lot of information to throw at you in a very short period of time so bear with me all right so today we are going to be talking about yendex it's a Russian search giant a massive platform it has over 90 different Services under its umbrella it kind of owns everything in Russia and its parent
company Yandex andv is actually based in the Netherlands and some of its products are actually used globally like its taxi service for example is in Israel that just made the news big drama there we'll talk about it a little bit um so in January 2023 almost 45 gabt of yandex's source code was leaked on breach forums which has since been shut down by authorities rip uh this is the post screenshot courtesy of a bleep Bean computer which had a lot more foresight than I did in capturing that screenshot uh Yandex confirmed the leak so we know it's real um they denied that it was a hack instead they said basically a former employee leaked this
uh this code base so first we're going to go into a teeny bit more background then we're going to dive into the code we're going to talk about what they're collecting what they're doing with that data and who they are sharing it with and then we are going to wrap up and hopefully have enough time for Q&A so there's been there's been much drama this year I'm going to try to do the very quick and dirty version of it so after Russia invaded Ukraine in February of 2022 the West began to express concern about Yandex collecting and storing the data of Western users in Russia where it might be vulnerable to Kremlin abuse um privacy researcher Zach
Edwards sounded the alarm about yandex's at metrica SDK sending analytics data back to Russia now at metrica is sort of like uh yandex's version of Google analytics so it's supposed to be for growth and product teams to see how their app is working it's embedded in hundreds of millions of apps globally some of them are VPN which are you know supposed to be protecting your privacy and some of these apps specifically Target Ukrainian users you can see why that is now a concern Yandex pushed back um pretty strongly that they asked for consent and that their data is anonymized but even at the time experts and researchers disagreed and uh you can see even their
own website uh is pretty fuzzy on just how uh anonymized and non-personalized their data is so this is their their big big statement in response to that drama from last year Yandex acknowledges its software collects device Network and IP address information but it called this data non-personalized and very limited it added that although theoretically possible in practice it is extremely hard to identify users based solely on such information collected Yandex definitely cannot do this very carefully worded statements a lot of lawyers put a lot of time into that so at the same time thousands of Russians Engineers including yandex's are fleeing the country and Yandex faced increased pressure from the Russian government through increasingly strict media laws
to spread propaganda about the war in Ukraine and then at the same time um its Executives faced um sanctions from the EU because they were spreading that disinformation so they were sort of squeezed on both sides um its Executives because they're getting hit with sanctions its executive director and its CEO both stepped down last year raising big questions about who is even running Yandex right now it appears to be just the board um and uh yandex's parent company decided the one in the Netherlands uh decided to sell off first its news products because propaganda drama um to the Kremlin controlled social media app VK and then they decided they might as well just sell everything else
um raising big questions about where all of yandex's user data is going to go when it gets sold to whoever it gets sold to so almost immediately um this Putin Ally kudrin he's Alexi kudrin is the former Finance Minister very close Putin Ally he agreed to become yandex's adviser on corporate development to advise on this restructuring re the sale um which just further trens the kremlin's control on Yandex so as of now Yandex is still up for sale um in June Putin actually approved a bid from a Consortium of billionaires read oligarchs and vtb bank but um that appears to have been vetoed by yandex's Foreign investors because they have to find someone to sell it to who they
won't F face sanctions for selling it to not a lot of people left and there's also this really weird new Russian law where foreign um investors selling their Russian assets have to do so at at least a 50% discount and there's a 10% tax on top of that um so it's not going to be pretty so there aren't too many people left who Putin will um approve a sale to and who yandex's foreign investors will uh also accept a sale to so essentially um nationalization is rumored to be on the table so now we are going to dive into this code so the code in the Yandex League it's broken down by service or application um it's written in a mix of
Russian and English they use a variety of coding languages but it's a lot of python and C++ and then yql Yandex query language it's a flavor of SQL um and this leak it's just the code itself it's not a git repository so we don't have the version history we don't have the databases we don't really have the machine learning model just sort of the very basic bones of them um so basically I can say what this code most likely does but I can't say for sure what was actively being run at what time disclaimer so we're going to start with metrica so that serers side at metrica data is in a service called metrica with
a K which encompasses the data both from that um SDK for mobile and also the desktop analytics version these are some of the raw data fields that at metrica collects remember when Yandex told the financial times that the data it collects is non-personalized and very limited sure so you can see at the top that it is going into something called an anonymizer but nothing about this level of detail is uh non-personalized or anonymized and it certainly isn't anonymized when it gets to this point in at metrica servers you're going to see from how it's used that it it they never really anonymize it so to start with these unique identifiers that were at the top they're getting hashed and
that's lovely and theoretically anonymizing um they're still going to be very unique and uniquely identifying because that um hash is going to add a ton of entropy and that's going to make it very easy to match probabilistically with other data as it comes in so that they can sort of combine and get a bigger picture of household activity um all you really have to do is Hash any incoming identifiers and then see if the outputs match so it's going to be both private and functional in theory uh but whatever this is actually supposed to accomplish um they they render it completely meaningless so also app metric is taking in some really precise location data uh
it's not that uncommon for app analytics to take in latitude and longitude so this product and growth team can see where their users are but what's not very normal is taking in a user's altitude direction and speed which together with a timestamp gives you a very disturbingly accurate picture of a user's movements unless the app you're using is like a Run Tracker or Pokemon go there aren't a lot of use cases that justify that with this uh information if someone is using your app on an airplane you could tell how high it's flying how fast it's going and in what direction and I think that's overkill for product and growth teams so let's take a quick look at how
ineffective that anonymization is starting with these fields a Wi-Fi SSID if you don't know and if you're here you probably know is essentially the name of a Wi-Fi network so if you're connected to the hotel or conference Wi-Fi right now that's your SSID so here we have these same fields in crypto um so straight out of metria that's the source we have that device ID we have the original device ID whether either of these fields are hashed at this point unclear and honestly irrelevant you're going to see why so here are those exact Wi-Fi Fields again and thanks to the hashed or unhashed who knows uh device ID they are attached to a unique identifier and here both that
device ID and SS ID are being matched to a Yandex user ID which is very important because that Yandex user ID gets matched to a whole lot of other pools of data in yandex's servers and we're going to get into that so one possible motive to select both device ID and SS ID as you've probably guessed is that an SSID will have multiple devices associated with it so you can sort of use that to detect a relationship between devices and say oh this user has multiple devices or this is a household Etc you can do a lot with it and they do so here we have identifiers that come in through a click event they're being matched with
any IDs any IDs hash unhashed uh until a match is found in the system so that the events can be processed and correlated with the pre-existing data about that consumer or household so for example it's comparing the plain Android ID the md5 hash the SS S1 hash which if you're here you probably know those are not great hashing algorithms at this point pretty out of date um the fingerprint right here it's generating generated using some of the raw fields that we looked at like client IP OS version it's a pretty standard fingerprint you know you take device information put it in a dictionary hash it you get another unique unique identifier that makes it easy to find that device again when it
uses your your app so even after anonymization this data is still being effectively used to identify matches and again that's how anonymization is supposed to work it is still supposed to be functional so then as this new information comes in it gets matched with user socio demographic attributes and they update them as necessary sorry this is a pretty small example um here they're using age band which is a lovely but pointless uh tribute to privacy because they do have exact age at other parts of the system um but it's a nice attempt um and then gender they just have male or female in this case and it's all Associated again with that device ID which you can see at the
bottom has to get hashed before it is sent over the buffer which suggests that it was probably processed and stored unhashed um which is is pretty inconsistent and renders well it renders the hashing pretty pointless doesn't it so metrica also has code related to this yandex's audience product which allows users to generate segments for targeted advertising or user profiles using data from at metrica thirdparty data Brokers or their own data and in the first two cases consumers who end up in the segment don't have to have used your app before because it could be used to generate fresh leads so you can use audiences to get information about whatever users basically so let's get into crypto so
Krypto is yandex's behavioral analytic service it analyzes all of this stat it has access to and it has access to a lot of data um and it identifies specific characteristics to put into segments for ad targeting theoretically free andex ads and also that audiences product and it takes data points from all over yandex's services in heart because Yandex ads advertises all over yandex's services so here are just some of the examples of the segments that crypto generates here we have smokers which seems to track users who purchase specific smoking products uh nothing exciting we're talking ecigarettes tobacco summer residents which tracks which users have dasas which are wret and summer homes and how often they
visit them and then we're going to look at a few more of these Travelers uses geolocation to track when you've gone from your main region which they have already determined to another region and whether that travel was domestic or international mail data appears to pull from email data to track whether you have any boarding passes remember Yandex has an email service so it's pulling from email data to trct whether you have any boarding passes plane tickets or hotel confirmations in your email this gas station segment uh seems to process where you bought gas like physically visited a gas station and bought gas um and you know it seems pretty plausible that if Yandex can make these
segments than anyone who buys Yandex and gets access to yandex's data points could easily make a segment like you know young men of military age trying to leave Russia very quickly or generate segments based around vices and blackmail so this is a very basic example of a household composition that crypto stores we've got that household ID that's important size gender any elderly Unfortunately they said has old little cringe um and has children but of course they we've already seen they have much more precise information than that the household information is honestly the least creepy thing about crypto so once again we've got app metrica data It's associated with a device ID and it's being used to pull
Wi-Fi information apparently they haven't heard of 5G yet um this time is tracking connection types for a a little segment once again at metric ssids uh being used for processing um presumably to D duplicate user records cuz they can say oh this is associated with a common Wi-Fi access point they're using it for cleanup basically these are some examples of data pools that crypto uses for processing for the purposes of its fuzzy matching in its graph segment crypto pulls login and email data um actually no I want to back up really quick so here we have a lot of email data we've got geolocation for home and work locations household been there Reen that is search data SSID they love it I
don't know they can't stop using it so here we are crypto pulls email and long and data and it Associates it with the Yandex user ID so if you connect any so-called anonymized data from app metrica to a Yandex ID crypto can associate it with email login information which pretty effectively reidentified here we have crypto just like shamelessly scraping for every type of identifier it can think of Yandex um uses the passport system it's like one Yandex login that rules them all logs you in across all of its services this form takes in first name last name phone number Krypto has some of this data it can definitely take a passport user ID and match it to a phone
number you can see that source type passport profile it implies to me that they probably also have all that other information like first and last name but I did not catch them actually using it passport phone dump certainly suggests that they're just scraping these phone numbers on mass so one of the things that happens in graphs is that the process of lat they process the latitude and longitude of your predicted home again they've already done some processing here and they associated with your Yandex ID and everything associated with that and they plotted on a little geograph which they then used to find and plot your literal neighbors and their Yandex user IDs and all of the information associated with
that so here we have data from two Yandex products being used by Krypto in a super creepy way uh no method should be called to extract children from Taxi that sounds terrible um and this is part of a very long process that involves pulling children in ages from search data then from app metrica and then from this taxi app um and they pull it all together to create this very holistic IC picture of how many children are in a given household and if you zoom in on that last section you can see that once you have one kind of ID whether it's household passport Crypt ID Yandex ID you have them all transitively anyway and all of the identifying data that is
associated with them which we've just seen is plenty of identifying data you can do plenty of that so these profiles also integrate by aetric data and it's most likely from yandex's Smart speakers which use um their smart assistant called Alice who is supposed to be able to interact effectively with children play games with them make up fairy tales cute um so Krypto uses voice biometric data to identify children and their age range by voice to further build out the household profiles um that voice biometric data probably from this Alice product um it's not unreasonable for a voice activated product to be able to identify children's voices so that it can interact um appropriately with them but
this isn't in the Alice product this data is in crypta it's been taken out of Alice and its original function and its original purpose and intent and now it's encrypted being used for Behavioral analytics you can also see from that socio demographic that they have birth date they have specific birth dates and crypta has a UI portal to display some of this household and user profile information like uh marital status income children some very basic interests and you can search these individual profiles by crypto or Yandex user ID which suggests that they're not just aggregating them you can search for more information on individual consumers and Yandex appears to be able to associate all of these IDs email Yandex
user ID IOS and Android ID passport ID Etc with social media accounts Instagram Facebook nvk which remember is a very Russian social media site and um they have code here called matcher that sinks fingerprinting events with major Russian telecoms providers one of them Ross Telecom is Russian state-owned it also happens to provide broadband service to Crimea so fingerprinting events that are synced with this provider through krypta could be accessible to parts of the Russian State honestly irrelevant because it's about to be bought by a Russian oligarch if we're being honest so here's the matcher we're going to zoom in a little bit so you can see they build a connection to the API pretty basic they pass in this fingerprinting
event and uh then Ros Telecom basically looks for a match in its own system if it finds one it sends back its own user ID and what you get out this is some test data that they had um it's like half log half fingerprint um and it has that new external ID it has that Source all right so wrap up just in time um so here is what they have here is just some of what they can do with that and they have a UI to display this information and they're probably about to be State Control so so these metrica sdks remember give Yandex a very broad International reach of data subjects who probably don't even
know that they are Yandex data subjects blah blah blah blah blah sorry I have to skip right through this now um yeah they make some gestures towards anonymization but they undo it pay attention to who runs your SS sdks and remember that whatever you're sharing with an app um that app can be sold at any minute or you know things can go very crazy in that app's home country and then who knows who gets access to that data so here we have some useless QR codes but they will be useful tomorrow if you want to take a picture um for later so any questions
sorry can you I can't hear uh for the information that's being collected you were saying collects Wi-Fi SS IDs and all that yes do you know if there's any difference between Android and iOS in terms of what it can collect um I don't see them like they certainly have like iOS and Android IDs and that's associated with that information but I don't see them doing anything different with that data per se other than where it's useful for segmenting yeah yeah okay makes sense thank you um do do you happen to know if the Yandex browser was included in the leak like the source code for that uh I'm pretty sure it was but I haven't had a
chance to look at it yet okay because I I mean it would certainly be nice to look at it for explode search and stuff yeah oh yeah there's probably a lot there the data for um passport and Alice were also leaked and again I haven't even I just looked at metric on crypto basically that's so there's oh there's a lot more there any more questions all right thank you G and thank you everybody for a great