← All talks

Why mirroring software vulnerability data matters

BSides PDX21:17180 viewsPublished 2024-11Watch on YouTube ↗
Speakers
Tags
About this talk
CI/CD jobs don’t care about US budget cuts: why mirroring software vulnerability data matters The National Vulnerability Database (NVD) is the biggest source of information about software vulnerabilities in the world. As more people have been using this data to help secure software supply chains, the servers behind the NVD have struggled to keep up. They implemented rate limiting and added an API to help people download less data at a time, but demand continued to grow, and continuous integration jobs didn’t care about US budget cuts affecting those running the service. But those same test and scanning jobs have managed not to completely over-run the servers that handle software updates for popular Linux distributions. What if rather than convincing people to slurp data more carefully (while somehow also convincing them that vulnerability scanning was great and they should do it), we reached out to some open source mirroring experts and made some magic happen? This is the story of how we mirrored the world’s vulnerability data and why. John ‘Warthog9’ Hawley is a Linux kernel maintainer (ktest), FOSS mirror operator, wrangler of open source licensing, open hardware designer, and someone who believes in not only over engineering most things but that 12 gnomes in a trench coat sometimes works pretty darned well. --- BSides Portland is a tax-exempt charitable 501(c)(3) organization founded with the mission to cultivate the Pacific Northwest information security and hacking community by creating local inclusive opportunities for learning, networking, collaboration, and teaching. bsidespdx.org
Show transcript [en]

[Music] thank you all for giving up your lunch break to uh come to my talk so this talk is uh cicd jobs don't care about budget cuts why mirroring software vulnerability data matters I'm obviously not John I'm the other half of this presentation once upon a time not so long ago uh the US government decided that we should be more careful about cyber security and uh they wanted to improve how that was done in the people who provide them with software and as a result everyone's security plan all of a sudden looked kind of like this step one scan for vulnerabilities step two do something about them step three profit um so there's a lot to say about

what happens in step two and what happens in step three but a lot of people got stop on step one so let's talk about what happened there if you're going to see what vulnerabilities you have in a software you need a list of what software you have which is shockingly difficult and then is an entire separate talk but you also need a list of all the known vulnerabilities that exist in the world and one of the Premier places to get the list of known vulnerabilities is the US government so the National Institute of Standards and Technology provides this giant database it's public it's open it's free to use that's where you get your vulnerability database but what happens if you have uh

maybe a little underfunded government agency and your top guy points everyone at your data source yeah you basically uh denied of service your own agency and you did it very effectively I mean it's not even just the the US people it's like the whole world is trying to download this database at once so that didn't work so well and then on top of that we started seeing news articles like this one saying look uh they don't have any money they're not actually updating the database the way they used to and nothing's ever going to get better great great news good job good job everyone good coordination so the um the fine people at NV actually did have some plans for this

and they did things like setting up some rate limits so like people's wackadoodle CI jobs didn't uh accidentally denial of service the entire world at once that that was pretty good and they did an API so you could get just just the newest stuff which would be great except uh if you try to download the entire vulnerability database 50 entries at a time it takes a long long time to get a whole copy of the database so people didn't like that there needed to be some sort of way to start and then also a lot of people were writing these things in ephemeral CI jobs where I just like grab a Docker container download my crap build it scan

it done throw it all away they they were not keeping any of that data so they didn't like that either and uh the NBD folk learned this pretty early because they were like yeah we're going to turn off all the old Json files it's going to be great and they were like well um not this month not this month maybe not this year so obviously we were not the only people seeing this and so then the next step was um asking for more funding which is a perfectly reasonable thing to do but it takes a lot of time so why did I care so I am here not on behalf of my employer but this is my

work project which makes a vulnerability scanner that grabs a bunch of publicly known data and helps you scan your software the idea was we can't go around whining that open source people never scan their software and then never give them any tools or ability to do so so this that's what it does and I got to experience all of these nvd changes as a developer on the other end so first we had to like add in the API keys and then we had to change the code to do the back off or the the um the different rate limits and then handle the errors when when it gives you the rate limits and then we had to find a way to like

bootstrap your data because if you're if you're in the US sometimes you can get the whole data through the API but if you're in like India which a number of our Google summer of code students uh were doing uh you were never getting that data so you were just screwed and at the same time our my colleagues in uh France and in the UK were saying look why do they keep taking off the taking the database offline in the middle of the day and I can't get any data so it was not going well and I had the moment to make a choice now I could fix this just for me and I could continue fixing it just for me I

could change it so that I screwed over everyone which I as I know in this audience all of you are thinking this looks like a great way to uh Force everyone to get more funding right yeah or um I have a green card and I do not want to be deported so why don't we um take the high road and try to fix this not just for me but for everyone so one of the things that you might realize when we're talking about CI jobs is like those things are just grabbing a container with a Linux image adding all the software updating it and going why does it work that they can like get an entire software package it

is faster to get the fix than to find out whether you need it or not like why that that that's that's stupid and I know the reason why because I am an old open source hippie the answer is software mirroring and we built an infrastructure to make that magic happen even though we don't have Government funding we don't have industry funding we have a bunch of random open source hippies spread around the world so not everyone has the ability to uh tap into the network as easily as I do but it turns out that the uh person who one one of the Premier experts in uh open source mirroring and content distribution networks for open source

lives in Portland in fact he lives in my house so I walked upstairs and talked to my husband John and this is how we solved the problem when your wife comes to you with a problem and she says gosh darn it fix it your first answer is yes ma'am the second answer is is uh yeah so so to explain how I'm actually one of the the the oldest and probably one of the more Premier people who understands open source mirroring I ran kernel.org for about a decade long long ago far in the the the the far distant past so I've seen how everything has literally been built from almost day one at some point I stopped doing

kernel.org and I wandered off to do other things at which point a uh a friend of mine came came to me and was like hey we should do some mirroring and I got back into it but I jump ahead a little bit so problem number one as my wife points out apis suck and everyone should feel bad about them and why does http get not why is that not considered an API call anybody in the room believe that HTTP get is not an API call oh good I don't have to actually argue with anybody that's excellent so so really the answer here was step one we need a place to actually host all this data because one of the things in

the open source mirroring world is you don't just have these magic mirror or these servers out there serving all of this content they actually have to get it from somewhere and if you don't make it easy for people to get the content in the first place to mirror they don't mirror it I'm looking at you Debian I'm looking at everybody who makes mirroring hard or asks for a bloody SSH session to the mirror server I'm not joking about that actually so step one steal my wife's project name create a website make the sketchiest website ever which is now different it's actually been updated so that it's not entirely sketchy and start actually setting up a

place that has rsync and HTTP access so that people can actually go and find the data okay that's easy buying a domain name is not that expensive expensive I have 42 youu worth of uh data center space down in California setting this up isn't that hard great problem one is now solved it also gave me a really good excuse to buy another domain name I might have a problem um okay problem number two okay now I can serve this data in some vaguely useful way this isn't mirrored yet because as nvd obviously found out somehow Linux plus send file might be too slow I don't think they actually found that out I think they use

Microsoft which doesn't use send file would you like to know how much more efficient send file is anyway um so how do we distribute this so again getting back to this uh uh uh idea of mirroring so a friend of mine came to me at one point and said hey we should we We we've accidentally created this entire internet exchange yes we accidentally created it an internet exchange maybe we should put a mirror server on this so that all of our servers that are on this internet exchange can get our data much faster than going out to wherever the heck it is we're going great so we took some money built a server and then people started actually

paying us money to put their names on a label on the hard drives in the server why I don't know but they keep doing it and I keep being appreciative that I'm not spending my own money but then uh my friend his name is Kenneth came to me and and said look you've been doing this forever what if we change this up a little bit I'm like okay what what what you've got an idea here clearly what if we took thin clients and everybody knows does anybody not know what A Thin Client is oh good oh okay so there's a couple of you thin clients if you've got a laptop think something that's 10x slower than your

laptop this is a machine that's barely designed to be bolted to the back of a monitor and dis play something on a good day he's like okay what if we we snag some Thin clients I'm like okay so we can pick these up for like 20 bucks on eBay and we throw an SSD in these this is not going to end well is it and then we put them on the public internet and we use them for mirroring and my immediate reaction was this is a dumb idea it's going to fail and then I went to sleep and I woke up in the morning and I got back to him and I said okay I did the math and yes

this actually can work because in reality we've been coming at the mirroring problem slightly wrong we've been coming at this with a big Iron take and needing more RAM than uh um is insanely sane and faster disc than you can possibly think of to solve this problem the real answer is is we need disc to be ever so slightly faster than the network interface and once you're faster than the network interface and you've got enough uh uh cash to to kind of keep your float going you can keep going and you can mirror so um yeah so so so we did this and then we just kind of vaguely published that hey everybody we've got this mirroring as a

service thing we'll send you this $25 box with a $100 SSD in it that has a gigabit Nick that is awful because it's real Tech and um you can put it in your ISP and then you can have a local mirror and you know how that work turned out right there that's a data center I'm not even joking so um this is actually a data center in the United Kingdom one of the mirror boxes lives right there in that that that green box and it happens to be an ISP that's doing fiber to the home and that just happens to be their largest data center location which is a box in the middle of a

field so we sent them a box and they put it in there and in fact it's still running to this day but they weren't the only ones who did this in fact internet service providers internet exchanges anybody who had a random amount of spare bandwidth cuz we're not talking a lot here we're talking a gigabit we're talking 10 gigabits at least in the end uh uh at least in the beginning and they would take these little boxes which are right there you can see that's an HP t620 which is a ridiculously low powerered AMD based box and they would shove it in wherever they could and then have to pull out a switch that uses uh this uh I think it's

this switch that uses four times more power than our box to actually plug into it because we use a coppernick instead of an SFP and that is how ridiculous this has gotten that isps are willing to do this but they gave us data center space and they gave us Transit and they gave us power and because we're low power because we're using a Thin Client that's not the CPU is not crazy the ssds aren't crazy and all of that we're only talking about a gigabit 10 gbits of bandwidth until we got to the mirror in Australia which has 100 gbits and that's a different story for a different time and so they threw it in there so

great I now have a place to mirror from I now have a place to mirror to and before every anybody asks there is a statement that I have to make and I have to reset my magic counter to zero because I've pointed to the the the paper who thinks bit torrent is a good idea you're wrong I have a paper I wrote in 2008 that says you're wrong and none of the numbers have gotten any better since 2008 if you would like to read it it's right there go have fun and if any of you say ipfs you can leave because it's worse so my friend Kenneth and I created a worldwide mirroring system that today

has 339 GBS per second of worldwide Network bandwidth that's a third of a petabyte or a terab a terabit of of uh Network bandwidth that we have access to to be able to deploy this stuff the NIS databases aren't that big onto all of the mirrors this data goes because now I have a place to mirror from a place to mirror to and a worldwide place to start Distributing all of this data and at some point you're going to ask okay John how are you and Kenneth not the sketchiest human beings on the planet because now you are taking cve data that everybody else is now going to depend on and you're publishing it for

everybody and the answer is yes we are the sketchiest people on the planet and you shouldn't trust us at all um the data does get mirrored but how many of you have downloaded VC in the last week month year congratulations you downloaded it from us because we're like 95% of VC's bandwidth these days so if you trusted us to download VLC congratulations you probably can trust us enough to download your cve data and uh just to give you an idea this is from vlc's actual uh uh mirror bits uh perspective and that's how much data we moved in 24 hours um on on a quiet day so um so what did we end up doing

here we've ended up writing our uh running our own API crawler because my wife needed it to to not only do the API queries for her own project but once we uh since the API is slow enough we might as well do it once for everybody so we don't have to do it again this not only saves everybody else uh uh time and effort but it makes and keeps the API server clear as much as we can but it also means that we can generate Json files that are much easier to transfer because frankly I don't care if they're even a gigabit or gigabyte in size sending a gigabyte blob is easier than having an API query an actual API

query um we're generating the Json files we're signing those files uh yeah the signat the entire gpg keychain is on the machine that's actually doing all of this data so take the signature with a grain of salt it probably came from me it's probably safe but if you actually care about it go back and verify the Json files from the original source um we're mirroring this and we've literally solved the entire API dos problem for free because an entire mirroring system is designed for nothing but to handle a distributed denial of service attack from all of you constantly so with that the prettiest data center on the face of the Earth my wife terot she can be found out here on

Mastadon I can be found out on mastadon questions comments concerns confessions mockery please come I I guess there's not a whole lot of way for questions but if you shout I can always repeat them oh there is there is there oh I I'm just standing it exactly the wrong spot I can't see it so so there is a microphone if you want to come say anything you can also donate to our ridiculous mirroring project if you really get bored too the slide with your paper oh yes the slide with the paper that one basically it's from the Ottawa Linux Symposium in 2008 Pages 173 to 182 currently that is the best mirror out um on landy's website or if you um

at me on Mastadon I can always point you to the the same link how did we convince those isps to give us the space we just kind of said hey we've got this and part of what makes our value proposition here compelling which is a little bit different than the way this has mostly been done is we not only send the ISP or the internet exchange or whoever's got the bandwidth we not only send them the hardware we manage the hardware and we deal with all of the mirroring in the back end so when I say this is a mirroring as a service like literally Kenneth and I are are managing all 30 of the these machines simultaneously so the

isps to do here is literally plug it in plug it in give us a couple of IP addresses that we can keep moving and then they're out like they get the the sponsorship information from the um you know from the the the URLs and the whole nine yards and we talk about them in uh um in conferences and whatnot like zipley you know if you're here in the Portland area zipley the mirrors up in um Washington are the are the big ones um that you're probably downloading from so we just kind of keep announcing it and keep talking about it at these kinds of events at like things like nanog and at other things and

people they also get everyone in their dat bling fast yeah yep yeah it actually cuts down on their uh their external Transit as well so any other questions before I get played off with a uh a Sasquatch in that case I'm getting played off with a Sasquatch one question oh what order of magnitude of boxes do you have distributed across the world say that again what's the order of magnitude of units you have distributed across the world uh that 339 gigabits per second is on something like 34 different machines so um anywhere from Australia to Europe to there's a whole pile here in US and Canada um the only places we I think the only continents we do not

currently have machines are Antarctica which we're working on and and um Africa [Applause] [Music]