BSidesSF 2026 - Follow the data to learn the secret (Dylan Ayrey)

Name: BSidesSF 2026 - Follow the data to learn the secret (Dylan Ayrey)
Uploaded: 2026-05-12
Duration: 35 min 17 s
Description: Follow the data to learn the secret Dylan Ayrey The datasets powering our agentic world are littered with secrets. Thousands of training sets on HuggingFace contain live API keys, passwords, and credentials. Whether datasets have duplicate keys with one another, or unique keys, exposes more about

BSidesSF35:17219 viewsPublished 2026-05Watch on YouTube ↗

Mentioned in this talk

Tools used

Comfy UI TruffleHog

Platforms

GitHub Hugging Face

About this talk

Follow the data to learn the secret Dylan Ayrey The datasets powering our agentic world are littered with secrets. Thousands of training sets on HuggingFace contain live API keys, passwords, and credentials. Whether datasets have duplicate keys with one another, or unique keys, exposes more about the data's origin than we were prepared to learn. https://bsidessf2026.sched.com/event/3ab7437ef8270f2e5a5652f2847c29f4

Show transcript [en]

Everybody ready for a good talk?

So, I'm Dylan. I'm the CEO and co-founder of a company called Truffle Security, which is built on top of the popular open source tool called Truffle Hog that you might be familiar with. Uh, I give a lot of talks like this. Uh, and you can follow me on my Twitter here. Today, as you might imagine, uh, Truffle Hog focuses on API keys, password, secrets, that kind of thing. So, a lot of the research that I do is related around finding API keys, passwords, and secrets. This talk is centered around hugging face, but the more I looked into hugging face, the more I realized there was more there than just API keys, passwords, and secrets. And to really get into that, we

need to go back to the origin of hugging face. In 2016, three people, Clement, Julian, and Thomas, raised about a million dollars to make a business around an AI chatbot. That company was HuggingFace. The infamous Attention is all you need white paper that Google put out that powered the AI insanity that we're all living through, that didn't drop yet. That came out a year later in 2017. So in 2016, these three co-founders that were working on that AI chatbot, it would have been technically limited. Think more like how Alexa, Alexa and Siri used to feel back then, that's what it would have been like technology-wise, their initial chatbot. So when Google finally dropped their Transformers paper

in 2017, and then importantly a year later when Google released BERT, which was like the world's first open- source LLM, well maybe the world's second open source LLM, you could say it got the attention of Hugging Face, to say the least. Um, I mentioned it was the second one. The first one was from a lesserk known company that no one had heard of, and they dropped a model called GPT1 a little bit before BERT was released. Both these models follow the same basic formula. A foundational model, Can we turn the volume up a little bit? Sorry. A foundational model following the attention is all you need architecture um with two stages of learning. An unsupervised stage where they blindly

sucked up as much data as possible followed by a supervised stage where they would reinforce it on particular behaviors. Behaviors like acting like a chatbot. Here's me interacting with the real Bert. Uh, no one or sorry, no one was paying attention to like OpenAI's GPT1, but BERT was getting like a lot of attention. Um, me interacting with it, I asked it what is the capital of New Jersey and it correctly says the answer is Newark. Uh, I then asked it what is 2 plus 2 and it correctly says the answer is 1. Um, and so for a company working on an AI chatbot, well, they took notice and HuggingFace wanted to adopt BERT, tweak it, fine-tune it, you might even

say transform it. And so they released PyTorch, pre-trained BERT version 0.1.1, open sourced it. That's a mouthful of a library. So they changed the name of that library and importantly it became the foundation of the popular transformers library that's become an industry standard for working with open source models today. This represents a transition in the company's history to go from a company focused on a single chatbot to a company providing the open-source AI infrastructure that powers all of the open-source models today. And so a couple years later, they launched the hub, a platform that now hosts all of the popular open source models in the world, all the code required to run those models, and a lot

of training data. And to put into perspective just how much training data is on HuggingFace, I want to zoom into one particular data set. The stack. The stack is one data set on HuggingFace. That's three terabytes of permissive code scraped from GitHub. I mean, it's humbling. It's like all of humans writing code for decades. The collection of almost every single line of code that was meant to be given away for free with permissive licenses all in one collective data set. And I wanted to just visualize how much data this is, how many lines of code humans manually banged out to make this. And I want I was thinking to myself how I could do that. One idea I had was maybe I could

print it out. like maybe I could go to FedEx, ask them to print a giant copy of the stack. Like how big would it be? How big would the stack of the stack be? And so, you know, while I was kind of looking into this, I realized the cheapest way to print hundreds of pages is actually not to go to FedEx. It's to publish a book in the Amazon marketplace. And so, that's what I did. And you could find this book listed on Amazon. Um, and you can see how thick it is. But this is not the entire stack. It's just volume one. Volume one out of a really big Roman numeral. And if you

can't read Roman numerals that big, don't worry. I can't either. I'll help you out. It's volume one of 14,29,866 total volumes that would be needed to print the entire stack data set. And if you're trying to visualize how big that might be if stacked cover to cover on a shelf, I'll help you there, too. It's 288 miles of shelf space that you would need, which would stretch all the way from New Jersey to Maryland to print the full stack out in the smallest typography that Amazon supports. The publishers of the stack published a second version, a V2. I printed the first copy of the first version. The second version is 10 times larger than the first version, representing 3,000

miles of books if placed on a shelf. And the stack is one of 883,000 data sets currently hosted on HuggingFace. It isn't even the largest data set. In fact, it's two orders of magnitude smaller than some of the larger data sets on the hub. I'm going to tell you some stories about some of this data. But what you have to keep in mind while I'm going through it is there's no metaphor that I can possibly give you that truly wraps your mind around just how much data this is. And there isn't enough time in a million conference talks to tell you all the stories that's contained in this data. And at the time of making this slide,

which was about a week ago, I hadn't my, you know, we hadn't had the time to scan all of hugging face. It was too much. Like we thought we could get through it all before this talk. Um, but it that just turned out to not be the case. Um, we scanned millions of individual files and found um almost 100,000 unique live secrets and API keys using Truffle Hog. But my shame was we we couldn't scan the whole thing. It was just too much data and it's going to take months more to complete. And if you know anything about me or us or the research that we do, you know, we scan a lot of non-traditional places looking

for secrets. One of my favorite was finding an Amazon key in my smart bed, which we did a blog about. But the reality is all these previous places that we've scanned looking for keys, they exist inside hugging face. Data scientists have gone out and scraped all that stuff. GitHub, npm, S3, Postbam, public websites, they've slurped it all up into separate data sets and published them to Hugging Face, plus a whole bunch of stuff we haven't thought to scan yet. And they're all just subsets of the 880,000 data sets. Hugging Face really is the big one. And my shame is that I wasn't able to complete 100% of it by the time of giving this talk. In fact, we were

only able to get through 25% of it. Um, and so the stories that I'm going to be telling you are from that 25%, not from the 100%. Um, they say one exposed API key is a tragedy. Well, hundreds of thousands just becomes a statistic. So, I'm going to zoom in on individual tragedies and tell those stories because it's the best I can do. I can't tell you 70,000 different stories today. But before I do any of that, I want to tell you a story that lives outside of the hub. The first story I'm going to tell you is about data and the very definition of what open source means in the world of AI. Like models aren't

exactly source code. They're weights. So how can you know we really call them open source? Is open source the right word for an open model? That question almost tore apart a nonprofit called the Open Source Initiative. Who is the OSI? You might have seen their logo before. I mean, you may know who they are through what they've done. In the '90s, their founder coined the term open- source. And many governments for years have recognized their definition of what open source is. So why does it matter how a government defines open source in the first place? The short answer to that is regulation. And let me introduce you to the European Privacy Act, which gives special carveouts and special privileges

lowering the legal and technical compliance burden of open-source AI models. So when it came to defining open- source AI models, it almost broke the OSI. Why? Well, the short of it is the definition of open- source, whether or not it should include data. data like the 880,000 data sets that are already open and hosted on HuggingFace, should those be a requirement in order to call a model open source? A question so charged that the OSI suspended their board election because they didn't want to allow on a new board member that had radical different views on this topic. And the radical view in question was just whether or not the data should be required to be open as a part of the

definition of an open source model. And you can find a petition online that has hundreds of names on it out upset with this outcome of the OSI. Why? Why did they resist opening the data as a part of open models? What is it about this question that seems so impractical and so charged that the OSI board put themselves in frankly an embarrassing crossroads? To answer that question, we need to zoom in on some of the data. way way in to one data set and one file in this massive pile of data. So picture this, it's a warm day in March and you are a Nanix employee. You have this great idea of how to make AI better at

identifying security vulnerabilities. Your idea is simple. You're going to go collect a bunch of Semrap scans of GitHub repositories and then you're going to scrape the GitHub issues from those same repositories. Your thinking is if you fed those two things into an AI model, you could teach it to better learn what's a real vulnerability and what's not, how to filter out the false positives based on the real GitHub issues that are created. And so if you go out and you scan all this different GitHub data with Smra and scrape in all those issues, you'd have a pretty compelling data set to do that. And so you get ready to do this massive scrape, but you realize something pretty quick.

If you want to scrape this much data, using GitHub's unauthenticated API is just not going to work. You have 50 requests per hour or something like that that you can send, no, you need to upgrade and use their authenticated API and you need to generate API keys to be able to do that. So, you make a personal access key and you kick off a massive scrape and a massive SER grip scan, the world's largest SER grab scan ever assembled, and you pull it off like a huge undertaking. But you put together the biggest collection of semrap scans anyone has ever done and you give it away for free and open to the world. And when I say open, I mean Apache licensed.

You put out almost 10 gigabytes of SERP scans and GitHub issues with an Apache 2 license free for the world to use, to consume, and make their AIS better at identifying security vulnerabilities. Except for one small problem that you didn't notice. that little personal access token that I mentioned at the top of the story. Well, that was a classic personal access token and you gave it the repository scope. Now, some of you might be starting to put the pieces together at this point, but let me lay it out for you. When you generate a classic access token with a repository scope, it gets access to all the repositories that you have access to. Not just your public repositories, but

also the private repositories that you have access to. the 1,800 private Nanix repositories that make up all of Nanix's internal code. And when you kicked off your scrape, you didn't exactly tell it which repositories to scrape and which ones not. You just said pull it all. And so, unbeknownst to you, a part of that 10 GB were SEM grabb scans of every single internal repository. But it gets worse. Not only did you Apache 2 license all 1,800 internal repository Smra scans, but because Smrab scans saved secrets in Clear Text output when it finds secrets, it meant all of the hard-coded secrets as well that Segreb could pick up, they got Apache 2 licensed as well. And I was able to go

through those secrets and let me tell you, there's one or two in them that are just about as bad as it gets. And what's challenging here is although they can go and revoke the secrets, they can't really take back the SEMRAP scans. They've already been given out. They've already received an Apache license and it was downloaded like thousands and thousands of times. They took the data set down, but copies of the data, not only is that going to live out on individual machines, but it's probably going to live out in subsets of other data sets that have picked up and aggregated. But let's zoom out. Zoom out of that one file from that one data set.

and then back in again to another file in another data set. And this story takes place in 2023. An anonymous developer in 2023 in the AI community named Comfy Anonymous, it's a weird name, started a project called Comfy UI. And their idea was simple. They wanted to take stable diffusion, which was an image generation technology, and make it more flexible. What I mean by that is if you maybe generate a a beach with a person on it, and you want to change the person, but not the beach, that's really hard to do with out of the box image generation tooling. And so the goal of Comfy UI was to break this into different stages and

workflows. So you had the RNG values feed into making the beach with a prompt that made the beach. And then in another stage, you had RNG values and a prompt that made the person on the beach. So you could tweak the person without tweaking the beach. That sounds powerful, right? Sounds compelling. The core of it was reproducibility. It was really important to this project that everybody be able to generate the exact same beach from the, you know, from the initial prompts. And so part of that culture was distributing those prompts. And the best way that they could think of to do that was to make it optional where you could have the entire workflow with all the prompts and everything

required to make that image embedded inside the PNG itself. Like in the metadata section, there's a giant JSON blob that has all the prompts. Except when I said optional, that word optional is doing some heavy lifting. It's doing some heavy lifting on the Wikipedia page for a default behavior. So, everyone that went to make images through Comfy UI, well, they might not have realized that all their prompts and everything that went into that image ended up in the final image. And if you're not using an open- source model, if you're using a third party API service, everything that went into generating that image, well, that's going to include an API key, too, isn't it? Keys in and PNGs. I thought it

was going crazy when Truffle Hog started flagging a bunch of PGs saying there's live keys in it. Either I was going crazy or Truffle Hog was bugging out. It seemed impossible, at least not at the volume that I was seeing it of all these different PGs with API keys. But sure enough, when I cracked open the metadata section, it's exactly what it was. How many people use Comfy and just have no idea that all their prompts, all their workflows, and all their API keys are being statically embedded in the images that they're distributing out? I sure wouldn't have known if we weren't explicitly looking for keys on HuggingFace. But it's time to zoom back out again.

Out of one PNG in one data set and then back in again to another file in another data set. It's 2024. We're in Russia and it's May. Finally, some nice weather. It's not freezing cold anymore. You just got a sweet internship at Jet Brains as an AI intern. You're going to pull off a big scrape of GitHub for a little AI project that you're working on. And the basic idea is simple. If you pull patch data from GitHub and you correlate it with GitHub issues, you could make a bug fixing robot. You could take the patch data and correlate it with the GitHub issues and feed it into the AI system so it knows when you see this particular

GitHub issue, this is the patch that you use to fix it. It'll be the ultimate bug squashing data set that everyone can use to make their LLMs better at fixing bugs. Now, the amount of data that we're talking about, one personal access token isn't going to cut it. Like we were talking about before, it would take too long, even with the rate limits of the one authenticated token. You're going to need a bunch of tokens. And to make a bunch of tokens, you're going to need a bunch of accounts. I mean, after all, you're an intern and your internship is going to end in a few months. So, you better make sure you can get all this

data as soon as possible. like rate limits be damned. You're smart and you're not going to make the same mistake as the last story. You're going to put explicit logic to make sure that it doesn't scrape all the private repositories and you get to work making accounts, a lot of them. Here's a fun fact. If you make a GitHub account, you need a unique email that goes with that GitHub account. So, for every GitHub account that this intern made, they made an email account, a throwaway email account to go with it. It's going to take forever to do all that. It does take forever, but finally they do it. They end up with 20 accounts, 20 API

keys, a big pool to do all their scraping, and they only took one little shortcut. But we'll come back to that shortcut later. Ready to go. Finally, we can clone all this data down and we can zip it up and we can put it in with the GitHub issues. We're, you know, perfectly primed to release this big ginormous patch data set. Except you made a little mistake, a small one, but you forgot when you do a clone or maybe you didn't know the.getit directory that contains the config file that had the token to do the cloning is a part of the directory. And so if you zip up that whole thing and you publish it where

your cloning token is going to end up in your data set, that's not a big deal though cuz you made a whole bunch of throwaway GitHub accounts. And so these tokens don't have access to anything sensitive except for that one little shortcut that I mentioned. You used one of your real account tokens as a part of the 19 plus 20 was your account. So even though you didn't scrape any private data, the access token that has access to private data, well that's now out there and it's copied thousands of times across all the repositories that you cloned. Fortunately, this one's easier to clean up. You just have to revoke that key. All right, it's time to zoom

out again. Way, way out. And then back in again. So, this man by the name of Tom is a random data scientist on Hugging Face. He doesn't work for a big company. He's not got a big organization representing him. He just enjoys ping around in his free time scraping lots of data from non-traditional places. And when I say non-traditional places, I mean Telegram. He scraped 10 million messages from public Telegram channels. How do you even find public Telegram channels? Well, as it turns out, Telegram's app lets you search for them. So, every channel that's set to public is discoverable to the whole world. And he figured no one had done this before. where everyone has pulled from the

internet, from Reddit, from Stack Overflow, from GitHub, but nobody had pulled from Telegram before. So, he could have this unique training set that's specifically curated for this type of messages. But what he didn't know is buried inside those 10 million messages in a few of the channels was a criminal enterprise that was rewarding people for submitting stolen Stripe keys inside public Telegram channels. When I say a reward system, that reward system included signup bonuses, special giveaways, all this incentive to post stolen Stripe keys to these channels. Why? Why are these Stripe keys so important to this criminal enterprise? Well, the answer to that is it powered their testing system for all their stolen credit cards. Imagine stealing a

database of millions of credit cards. You need to know which ones work and which ones don't so you can sell the ones that work and prove to your buyer that these keys really work. And so that's what your pool of 70 Stripe keys are for. And the way you're getting new Stripe keys is by paying random people in these public Telegram channels to submit more. And by the way, a week ago we got these Stripe keys revoked. These stories are very, very recent. And like I said, there's no way we can go through every story in hugging face, but this is an entire criminal enterprise, an entire underground network of public Telegram channels rewarding people for stealing

Stripe API keys, posting them so that millions of victims can have their credit card data stolen and defrauded. Okay, we need to zoom out again. Zoom out in space. But this time, we're not going to zoom back in in space. We're actually going to go back in time a little bit. Back in time all the way to the 1800s. Back before they had AI generated music. You can almost imagine back then you had to have a real live performer who actually played an instrument if you wanted to hear music. It's kind of crazy to think about. In 1896, a British statesman, Lord Rosenbury, was given a series of widely attended political speeches all across England.

Now, there were no recording devices at the time. So, think about how that played out. Like, how could you read the speech from the Statesmen if there was no way to record the speech? The answer was to send a physical reporter and have the reporter transcribe the speech character by character, word for word. And that's exactly what the Times did. The Times had a reporter follow Rosenberry around for a whole year and have that reporter document verbatim transcripts of every speech that he gave. And they published all those transcripts in the paper. And this man named John Lane, not affiliated with the Times, wanted those speeches to be more widely available. And so he published

them in a book. He took them from the Times and he put them in a book. And his thinking is, well, these are public speeches. They were given by a public figure figure. I should be able to do this. The Times disagreed and they sued him almost immediately. And that sounds a little crazy, right? The information was public. It was given by a public figure, but the Times made the argument that they had to pay a guy for a year to follow the statesman around to get that information. And so they made the argument that they should have some copyright protection to that speech. This was 120 years ago. and the courts ruled in the Times's favor. They said he

wasn't allowed to publish that book and that all the copies, well, they were illegal to distribute. Seems like not too much has changed in 120 years. That old book from 120 years ago, I actually have one of the illegal copies right here, uh, copyrighted from 120 years ago. But guess what? There are thousands of copies of this exact book inside hugging face across thousands of different data sets that are used to train AI models. And the Times is still suing people for exactly that for taking the data that's put out there and training AI models and you know incorporating them into data sets that are open. So I want to introduce to you another data set on the hub, common

crawl. And put simply, common crawl is just a scrape of the entire internet. Actually, common crawl is more than one scrape. It's a routine scrape of the entire internet. When I say the entire internet, I mean everything. Reddit, Stack Overflow, GitHub, and the New York Times. One of these scrapes, the last one from February, is 363 tabytes. Unfathomably large. That is two orders of magnitude larger than the stack. just so you can wrap your head around it. Well, what happens when the owners of those websites said, "Hey, our data is under copyright. I don't like the fact that you've scraped it and included it in the stack." And what happens when that includes speeches from an illegal

120-y old book or includes just the New York Times website directly? Well, the answer to that is the New York Times, just like they did 120 years ago, sends takedown requests and the stack, well, they comply with those requests and they remove the times. But what does that mean for all the models, the open models that were trained on the first version of uh Common Crawl? Now, you can't reproduce those models because the Common Crawl data set is slowly evaporating over time. It's evaporating as all these different uh copyright holders are striking and dissolving bit by bit the original data sets that was used to make the data. So whether we're talking about copyright security data

privacy honestly these problems are frankly impossible to solve at least with our current technology and that's what made the OSI election so explosive. It wasn't a friendly debate over whether or not they should lower the bar for corporations to claim regulatory benefits. It was actually a debate about whether or not any of these open models, even legally, privacyfriendly, or securely, could open their data. And even if they could, if random copyright owners could slowly strike and slowly deteriorate those data sets, what was the point? Like those models wouldn't be reproducible ever. There would be snapshots in time. They're all different as the data sets slowly dissolve and slowly become less and less for different copyright strikes. There are

workarounds. There are workarounds that don't have this copyright problem. Let me introduce you to Leon 5 billion, a data set that contains 5 billion images scraped from the public internet. Except this data set actually has zero images in it. So what's the workaround? What's going on here? Well, this data set is just a data set of external links. links that link to external images, five billion images. I mean, it would be a nightmare to actually host those images for all the reasons you're probably imagining. But a link telling you where you can download the image, well, that kind of works around the copyright problem. But it doesn't really solve the bigger problem because those links are

going to rot over time. And that data set is also going to slowly atrophy in its own way. If you download all those images this week or download all those images next week, there's going to be a different set of 404 messages, a different set of DNS issues, and the data set is just not going to be consistent, and the AI models trained off of it aren't going to be repeatable. Look, to kind of bring this back, we can't look at open data sets the same way we look at open source software. It's just different. It's too different. It's different from a legal perspective, from a privacy perspective, and from a security perspective. While we were

going around looking for API keys, we found credit card numbers, social security numbers, and copyright issues, like tiny little fractions of all the overall issues on the hub. To put this into better perspective, you know, we tal we talked about a couple individual stories, and I mentioned that, you know, if we were to talk about the millions of API keys, it would take millions of conference talks to tell all those stories. Um, well, here's an interesting fact and a way that we narrowed in on some of these specific stories. Many of these keys exist in more than one data set. Like, imagine a key leaked on GitHub. I told multiple stories to you of different people scraping GitHub from

different angles. That same key is going to end up in multiple data sets. One example of that is Twitter API keys. This was a little surprising just to see how biased the data seemed to be for Twitter API keys. We found the same Twitter API keys over and over and over again or just a lot of different Twitter API keys in many, many different copies of data sets. Why? Why Twitter? Well, there's a couple reasons for it. One, it's easy to make tutorials out of Twitter API keys. It's basically the only social media site that has a wellsupported API and everybody uses social media. The second reason that's a little bit more surprising has to do

with the ga the data curation bias. Remember how I was saying like our interns and stuff like that were scraping particular types of data? That's called data curation. And so what we found was one of the methods to curate data when it comes to code is you might want the code that you're training your model on to run without throwing an exception. And so when you start collecting all this different data, you might execute it in a sandbox. And if it throws an exception, you get rid of it. And if it doesn't throw an exception, you include it in your data set. Well, guess what? If at the top of your file, you say your API key goes here, it's

going to throw an exception and you're going to exclude it. If you hardcode the API key, the API key is going to end up in your final data set. And so you have a subtle bias that's actually pushing you towards having more API keys in your final data set than in the starting data that you were curating from. And then the last element here is the first generation of LLMs learned from this behavior. And then they started putting more code out into the wild that had that behavior and that code got sucked into hugging face which was used to generate the second generation of LMS creating a vicious cycle or a vicious feedback loop. That's not to say you

can't solve these problems with different training methods, but it's just explaining how we ended up to where we are, which is tens of thousands of Twitter API keys scattered all over live API keys with access to real people's Twitter accounts scattered all over Hugging Face. So, that's an example of the data being copied and curated into multiple different data sets. Same data sets with the same keys. And this got us thinking like so many of the keys got scraped from the same data sets shared across multiple data sources because they started from public locations. What about the opposite of that? What about the data leak situation? Well, that wouldn't have had multiple people scraping from GitHub if it just came

from your one private GitHub. So, we flipped this on its head and we said,"Well, what if we start to look for the orphan secrets, the ones that are only in one data set, that aren't in thousands of data sets? Those might be from accidental leaks. They might not be from scraped public websites. So, we filtered out everything that was copied in multiple data sets. And maybe if we just focus on the ones that appear in one data set, that will be some indication that not just a key, but also data surrounding the key was accidentally packaged up and included in a larger public data set. So we tried it and as it turns out, it worked. And

that's what brought a lot of the examples that we covered to you by looking for these orphaned keys that didn't exist in multiple data sets but only existed in one. and you know all of those keys that were scraped from public websites and public GitHub repositories and chats and everything that was on the public internet um we filtered it out and by focusing just on the private stuff that's when we started to find more troves of credit card numbers more troves of social security numbers you name it countless privacy issues countless security issues all this private data that accidentally got packaged up it's frankly overwhelming and the thing is you can't train an LLM without the pre-training stage. You need

all of this toxic data to have a powerful LLM. And so between legal and security and privacy issues, there's a unique set of problems with opening up data sets. And for the data sets that are already open, well, they're copied, they're replicated. That Nanx story that I told, I mentioned that they took down their Apache license scans of their internal repositories. It was downloaded 200 times per month at the time that it was taken down. So, it's out there and the data is probably copied into other data sets as well. The reality is we can't let the big labs keep the AI all for themselves. We need open-source alternatives, but the only way to make

those open source alternatives is through these toxic data sets. And that is what almost broke the OSI about whether or not these data sets should be open. I was thinking of a way to sort of end this talk in a good note because I don't really have a solution to any of these problems. I don't think anybody does. The whole industry is barreling forward, you know, moving faster than we have time to even think about. So, this is the best I could come up with to kind of wrap things up. I'll leave you with this. The data sets that are on HuggingFace comprise almost all of humanity. like everything ancient Egyptian like scraped copies of every

surviving book, every newspaper from the last 2,000 years, every public image, every public website, almost every thought you can think of is there on Hugging Face. It's all there. And I started thinking to myself, is there eventually going to become a time that we appreciate that all that data was archived and put up like a 100red years from now or 200 years from now? Is this huge collection of data going to tell our story? The API keys, the social security numbers, the data breaches, for better or worse, are a part of that story. And I'm kind of resigned to the fact we're never going to clean all this up. It's just too much. It's too much to

think about or to stress yourself out about trying to clean up. And there's too many copies of it replicated on too many systems. There's no going back. So, to take a more optimistic view, I think about the archaeologist 200 years from now, going through it all, finding stories that I didn't have time to tell. I won't weigh in on whether or not the data should be open. I'll leave that up to you and the OSI and hugging face, but for now, I just imagine that future archaeologist with a smile on her face finding all the API keys from 200 years prior that are still alive. Thank you all. And I also want to give a huge round of

applause for Brad who played violin. Brad does uh security at Figma and they're hiring is my understanding. So to give that little plug. Thank you all so much for coming.

BSidesSF 2026 - Follow the data to learn the secret (Dylan Ayrey)

Related talks