Establishing Trust In Artifacts With Provenance

Name: Establishing Trust In Artifacts With Provenance
Uploaded: 2025-06-17
Duration: 23 min 36 s
Description: When you buy something, you might want to know where it was assembled and where its parts came from. Depending on how thorough you are, you might want to know more details about the process. Did it meet some organic criteria? Which quality inspector assessed it? Depending on where you live in the wo

BSides Buffalo · 202523:3625 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Ralph Bean

Tags

CategoryTechnical

StyleTalk

About this talk

When you buy something, you might want to know where it was assembled and where its parts came from. Depending on how thorough you are, you might want to know more details about the process. Did it meet some organic criteria? Which quality inspector assessed it? Depending on where you live in the world, you get some human readable version of that information stamped on your packaging or included on the box today. When you consume a software artifact in production, you might want to know its *provenance*. In this talk, we’ll explore the activity of checking provenance as a gate to production and look at questions you might want to ask. Where is this artifact from, how was it produced, what checks ran against it, who claims these facts anyways, and more. We’ll look at pre-requisites necessary to answer those kinds of questions by comparing the provenance details exposed by systems like Github Actions, Tekton Chains, and Witness. Join us for this dive into provenance details and tools. You’ll come away with ideas on both why you should generate provenance attestations and how you can use them to do actually valuable things in the real world - not just tick a compliance checkbox. About the Speaker Ralph Bean Engineer at Red Hat Ralph is an engineer at Red Hat and member of the Konflux Governance Committee. He's happiest when learning new things, the open source way.

Show transcript [en]

Well, I will get started. Thank you for coming to my talk very got here. Um I start with an analogy from mesp space right you know cyerspace is where we all work but mesp space is where we all live and in meet space so like products what is it what is it like when you buy something physical you get some product you get some information maybe where it was assembled in Mexico um where parts came from maybe you get some organic criteria or sticker on the package you know it met some criteria or another um if you get like physical goods from Amazon sometimes inside the box you'll be inspected by quality inspector number 13. Uh and so you know

some things about where this product came from. Uh a thing that people are super familiar with is like the ingredients list right on baggage chips and that is is not quite like a problem. It's like what I want to talk about today bombs software build material is much more close to that like if you're looking at artifact what are all the open source components and dependencies related to it. You see that in Esbomb and it's related to the idea of province but not the same province more about where did the artifact itself come from not so much what what went into it but what were all the things that happened to it along the way

so with software artifacts we generally don't get that information like if you're consuming anything like it's really really hard to come by test bombs are becoming more popular like Python packaging um ecosystem receiving tests and pets are bound providing test bombs for Python itself You'll see it in some vendors, but the province is not a thing that we come to expect. Welcome to the end of the day. Um, we're talking about province software problems. Um, for a software artifact, right? Like if you are downloading something like a Python wheel from Python package or or anything you get from a vendor, you'll get a binary, you'll get a tarball, but like clear information about where the

sources for that came from are not a part of the APIs. They're not a part of the artifact. Typically, this is becoming, you know, more so the case sometimes, but less definitely less so in the past. Um, we talk about like Maven jars on maiden central are these compiled outputs of jars built on somebody's laptop or maybe built in a built system somewhere if we're we're lucky for that jar. But being sure that you can actually recompile that same jar from that source is like an act of faith for for most people consuming jars and nobody consumes the source sources. Everybody's consuming the pre-built jars that are distributed. Same with uh with Python wheels, right? Are built from some

source. There's an SIS that off often comes up from the source distribution, but knowing that they're the same and one really relates to each other. You have to like open them up and instep and nobody has time for that. um knowing how it got there. Did it come from a laptop or or from a from some build process, knowing if any checks were run against it, you just basically have to go and look at the upstream and like analyze their processes and read their need to figure out what they do um as a part of what their their project is to know um know something about that. We don't have like a standard for transmitting that or I guess where do we

jumping ahead to my next slide, but why does this matter, right? Like um supply chain threats are are a thing. supply chain attack sort of thing increasing um over time. Notable ones are like solar winds where the build system itself was compromised um up in their batch uploader script. Um, the PHP community had an issue where they're they're they were hosting their own Git servers away from GitHub and the Git Source Forge that they were operating with got compromised and uh and and and ended up with with malware getting into PHP and they since moved to to GitHub and all of these are different kinds of supply chain attacks that we don't have any visibility on um short of

just trying to communicate about them um uh informally and dealing with them. This is a diagram from the the salsa documentation. So the text bot supply chain levels for software artifact salsa with an attempt at um developing a standard for making statements about software problems. If they have an artifact as a way to say communicate um what are all the things that happened to it along the way and in their docs they have this diagram just depicting the vision threat vector. What are all the different places in which an attacker might um find a way to get something into a software architect if you're you're consuming the stream? We we typically really only think about

historically or about security here at point H, right? That's the signature. If you're downloading some if you're if one of your peers is downloading some software and they're like being like let's say careless about it and then they're not checking the digest of that binary against a signed like Shawson thing like people will be sensitive oh you're being careless you should always check your signature but there's really that's that's the only attack that you're mitigating with that kind of practice or behavior. Uh and to that point, traditional software signing, um what does it tell you? If you have a signature on an artifact or a signature on a shot in the file next to it, uh you can like deduce

from that signature who signed it, right? You have their public key information in the in the signature, but you don't know what they were trying to tell you by signing the artifact, right? Like we typically just interpret this on like a a high level. It means the artifact is good. They meant to get this to me because it's signed, but there's no way for them to express anything else other than this kind of like general general notion of well just this fact that they signed it and that's a and that's it antistations. So this into anticessation framework there's a couple different layers here in case I'm thinking about my slides. I mentioned SLSA before. There's a building up of layers here

where at the base is this Inoto anestation framework which is a way of making statements in general about software artifacts. It's this JSON schema um for for describing and and and encoding expectations for how those things should be signed and how they should relate to software artifacts. It's pretty abstract uh and maybe difficult to use on its own, but just think of it as a way to make statements in JSON that you can expect others should be able to decode and interpret in in some way. Um there is like it says on the slide u the identity of the person who's making a claim. There's the content of the claim itself called the predicate. Uh there's

the subject, the thing that the claim is about. And then all of this cryptographically signed right just like a signature that you would expect to find on a normal artifact except this is a signing of the station itself. Uh it looks something like this. You uh you you generally see everything well encoded in this B 64 encoded uh uh statement. The payload type letting you know that it's an adation. Inside that payload expand it you see the subject and the and the predicate. And also show some some more specific examples of that later. Um SLSA providence is this description again of a statement about how the thing was built. But there are other kinds of antistations that you can

express in ento but which are not providence attestations. So Eston was one of them right at the ingredients label. Um also vulnerability scan attestations for a scanner to say I scanned this at this time. I found these things at this time and publish that in association with the artifact test results. and you have functional results or like JUnit results way to encode those in like a intoation and publish them with the artifact and if you're thinking about this I've been framing it this way the whole time about like consuming artifacts from the open source community like that seems less relevant like why am I going to download the JUnit results of some library when I'm

pulling it in but think in the context of like a large organization an enterprise or something like that and you've got multiple departments maybe multiple security be transmitting information downstream so the downstream consumers can be like rationalizing about an artifact does this meet the expectations that I want to see as it's entering my quality gate and uh having those things expressed in a standard verbal way associated with the artifact is a really really nice property right that even as it makes it way to production you can still then inspect after the fact wait is this still the thing we want we want to be running did it pass this test and at what time did

it pass this test bespoke APIs in your organization solve that problem today everybody knows you have bookmarked the Jenkins instance to go and debug this stuff but having a machine opens opportunities. Um, so providence predicates in practice on time should be good. Um, like I said, we need to copy and run this. These are commands that run. Trust me, it should be using commands that run. Um, so you can put anything in a predicate, right? This is hello world doesn't look like a real, you know, what you you'd expect to see, but you can put it in in predicate.json here. And then let's say I built a computer.io rb testites and I want to test something

about it. I can use the cosign tool which comes from the six door project. Um running that is fun. Maybe I should run this. Yeah, that's more fun. Okay, I have open a terminal here with already predicate.json prepared. Yeah. And with predicate.json, I'm going to go back and copy this. Do I have all my backslashes? And when I run it, the cosign tool is cool. You can um in the mode I'm about to run it in. That's really small. I'll make it bigger. In the mode I'm about to run it in, um it's using what the the SIG store community that produces the cosine tool called keyless sign, which is a little bit of like a marketing like

hype thing. there's really a key involved at the end of the day. But what it's going to do is it's going to notice that I haven't I haven't set aside an option where I say I want to use this long lived key, this GPG key to sign this stuff. And so it's going to default to its its fun behavior, which is it's going to prompt me to go through an OOTH flow proving that I am who I say I am with respect to some identity provider like an an open ID provider. uh and in doing so it will um grant it will create a shortlived key bound for about 5 seconds and use that to sign my

attestation later when we go to verify that key will have expired but we're not going to be trying to verify things in terms of the key delivery we're trying to verify in terms of was there a thing that was signed proving excuse me provably that um that I had access to that um OOTH identity at the time I I made the signature. So I run. Oh my gosh, it's too big. It's going to jump me over back to my browser where I prove I can log in with GitHub and in so doing I'm back using payload predicate.json uh uploaded. There are some other architectural pieces involved here I can't unfortunately get into. There's a transparency log where you can you can

use it to detect attackers who have compromised access to the OOTH servers by seeing evidence of them signing like if you lose access to your key. The transparency log gives them makes makes it such that they can't hide, right? They have to expose themselves even if they have compromised your um your your trust route. But um so that that's that command. Close this and get back to my slides. The second part being verify at testation. And all we're going to do is get back the the thing that we saw. I'm going to I'll paste it pause for a minute and then talk about it. So verify antistation we are going to use um certificate identity

of rap.com it's me and this a certain certificate oc issuer that's and attested to by github.com login oath if this image we can find an attestation that is assigned matching these properties then pass this will exit with code zero otherwise it'll exit with code one saying failed verifying theation and furthermore it will print it to standard out we're going to pipe it through jq to kind of like unpack at that base 64 encoded blob and then just just look at it in doing so see I'm sorry I have to make it smaller but it's not that big what we see is an into statement you can tell by the type at the top certain predicate

type uh the subject is that image description badges that I attested to at that time and finally the credit world generated which is not all that exciting but this is the the kind of tool at the at the bottom of the access station Um I guess it's it's one below. I was talking about in Tojo before is this abstract framework for making attestations. The cosign tool gives you a way to work with that. And we'll move up next to SLSA um attestations which are these like uh statements that are meant to describe supply chain processes. That's the more more interesting part of the um so next slide. Um I want to talk about three different ways that they can

be generated. There's more ways anywhere that you can you can use cosine. You can generate these things. You just have to have an you know it will construct the SLSA problem and statement by hand which is is cumbersome but GitHub actions has GitHub put out an action uh a test build province which uses cosign under the hood to do exactly this. So when you build something in a GitHub action and you in a GitHub workflow excuse me and you call this um invoke this action uh it will use cosign to generate one of those access but with the structure that describes the GitHub workflow that's doing the building. So at the end of the day, you've got some our ball pops out

and you've got a statement signed cryptographically using that OIDC flow from GitHub that says this GitHub workflow definitely built this artifact and it was came from this repo. And so again, notice the bot get information about the commit shot, the repo URL, the workflow files, and the inputs that were provided to it. But a problem that you'll find with the GitHub um at the station metadata is that it doesn't really tell you anything in detail about what happened in that build at that time. Meaning like if you're referring to actions checkout before, it's just absent in the antiation. It's it's absent entirely. You don't even see it there at all. But for what I want to see

is actions checkout before and the particular get ref of that action at that time. Right? So when you're looking at an artifact, you can see you know if we learn later that actions check out what was compromised between this state and this data these actions and so you can extract that from the province but with GitHub providence you don't have that information and we've tried hacking around GitHub's API to see if we could like pull that information out and we don't find a way right now. So we need GitHub to do something that I think to make that that information available. Um you want to you want to see one? Let's do it. Check my time.

I only have six minutes left, man. Uh here, this is my my buddy Louis image built in GitHub. And I'm going to run this alternative command to grab the blob out of the registry that corresponds with the attestation and just explode it here. And so in full it's kind of small, but this is the the GitHub attestation. It has what it says. We know the art artifact we're talking about. That's the subject. We have this predicate that goes from here to here. We have some information on parameters that went into it. And we know which repo and which commit it came from. But nothing more. Nothing about that actions before um action that was used.

Um a second thing about the GitHub method that's weird uh is that the key that is used to signification is essentially available assigning materials. It's not quite this key. The ability to perform that OOTH flow to prove that you are the GitHub action that it needs to sign within it is available as a GitHub action. So it can lie about itself. It cannot build and then it can sign statements about it. You can sign statements about Alpine or something. It's not that not that it can necessarily have rights to push that attestation to sit next to the the proper Alpine image that everybody uses, but it's still still feels weird, right? is this this vector that from within the

build system you can use the sign in key um depending on what you've committed and pushed in and you're asking to be to be built. Uh in the system that me and my team work on we're using Tecton which is the CI/CD platform for Kubernetes uh and um this is this is the way you describe a cache run uh in in Tecton asking for a similar thing to do a bit clone asking some some addition builds would say that come later. Uh the cool thing about this system is that in Tecton there is a controller that runs next to the main part of Tecton that's watching the Kubernetes API for the evolution of these pipeline runs. Like you see they

have a GitHub action in the uh in the GitHub UI in the Kubernetes cluster. You see these pipeline runs these cyclon pipeline runs evolve and this controller is called chains. Chains job is to when it sees that finish to describe capture the state of pipeline runtime finishes generate a salsa prevenence excuse me the salsa provenence predicate for it and push that into the registry along with an artifact that it sees emerge from that same pipeline. And so you don't get that problem that GitHub has at the end that the key is available to the build workload. can't you can't force that system to lie about what it's seeing. It's a neutral observer, right? It's on the outside of the the data

coming the building. Also, it just so happens that the detailed material the materials are very detailed. Um, and if we look on the next slide, you can see an example that this is the

and it's going to be too long, right? Right. If we like pipe this through word count-l many many many lines of JSON, right? More JSON than we want to see. Um, uh, I I guess I don't know know what to refer to in it, but you see all sorts of details about every task in the process here. It wasn't just one get clone task that ran. There were really like, you know, 18 tasks or whatever that ran this pipeline doing various things. But the point is is that for each of them, you have the um the resolution all the way out to the digest of every image that went into it. and every task that like

pulled those images in. It's it's uh it's very detailed. Uh I'll jump off of that and back. There's one last tool that's cool that I have not much experience using called witness. I see uh it's between the popular and GitLab CI/CD. It like the GitHub one has the problem that it runs inside the pipeline. So the signing key material is available to like your whatever your developer is passing in assuming you can't trust your developers or that maybe that the repos get compromised and so on. They can sign things in any way. But it has a really cool set of plugins for figuring out what's happening in the build environment right around the build capturing all the environment variables,

the states of files, the data and everything in a way that techn doesn't have that information. It's just kind of looking from outside the process and these are all the things I saw run not here's all the the nitty nitty details the file system. I have run myself out of time. I had a whole section I wanted to talk about my project but I got to blow through it. I'll I'll say we we um the point I want to take I want you to take away you made this sick with with these antisation things you you might be like with espons like oh man I got to like start producing bombs I got to start producing

provenence stuff in my pipeline because somebody told me I have to for compliance reasons like it's just a thing we got to be done um but if you have those really detailed decisions and you can trust them you can do cool stuff that you couldn't do before we use it for policy building ever this time so anytime an artifact fact is coming through. We look at that super super detailed um at the station and say okay first of all did this come from any of our trusted repos. Second were the tasks that were used to clone prefetch dependencies, build and scan. Did they happen in an order in which there were no other tasks running

in parallel that had access to modify things, interact, right? And we can describe all that in a regular policy that we use to analyze and we we build up this body of policy over time. It gives us in this next one this crazy flexibility, right? Because we can in our in our Boulder systems, we always made it the case that to do an insecure thing was impossible, right? And this created an incentive for teams who wanted to innovate and try new things to work around the system. So we had shadow it out of shadow. In this system, we can permit an insecure thing to happen in the build system, but still get it at release time, which then depending on

like a business trade-off, but do we want to allow this thing to to ship even though it can't do a hermetic build? it's going to be like that's a requirement for us. Maybe a new a new language package language manager we can't support yet or we can we can say yes to that without it having to create its own shadow pipeline and these like layers of security type. I really like this guy. There's also a scheduling element to it too which is in the policies we can put dates on rules. So we can put them in place and say this goes into effect in 60 days time and all of our developers are receiving feedback

in their CI about okay this is valid you can ship it today but you're not going to be able to ship it in 59 days 47 days and so on going down. We used to do these flag days, right? Big emails say, "Everybody get ready. Have you built secure by June 1st and invariably people would break and we all have things. Those are things I like about this. Uh, and that's it. Thank you for coming to my talk at the end of the day. If you have questions, I'd love to talk about things I know." Thank you. [Applause] One question. Um, so is it always tied to a build? Like is that where the at

the station and the uh signing happen? Is it always build versus commit? I know you talked tied together, but is it really focused on the build process is built on? That's what it's tied to. Yeah, I think that there is the for Salsa in particular, it's um it's been focused on the build, but they have these different tracks of the specification they're building out and there's one for source manifestations that's like in draft right now that they haven't published. The tool in cosign definitely supports that kind of behavior like a developer using cosign to sign their commit say whatever. But uh incorporating that into the broader salsa chain is like is still coming. Gotcha. Yeah. And the idea that

you can sign something well after I think that's totally in play too. You could build and chip and then later you know um publishing attestation that's this summary kind. It's called a VSA verification summary attestation say now on July 1st even though this thing was built a long time ago it still meets our requirements today. You can put that in the registry kind of stack up these PSAs or this ran in staging for 3 days meeting our so time requirement add another attestation that has passed that that criteria. No, it's not just building.

So, could you have in that pipeline a release that you have for like developers and then a release that you have that meets all the security requirements that you might ship to like the DoD, right? Let's say yes that developers, yeah, can can hack up in their own pipeline ways of building that do not meet requirements. And the only time that matters is if they would try to release it to to our customers at which point it or try to release it to our managed SR environments at which point okay we have slightly different requirements and policies for those two things.

Cool. All right. Thanks for all it right. Yeah.

Establishing Trust In Artifacts With Provenance

Related talks