Life of a System Call — Why Containers Are So Hard To Secure

Name: Life of a System Call — Why Containers Are So Hard To Secure
Uploaded: 2025-06-17
Duration: 49 min 3 s
Description: This talk follows a system call from within a container to the host kernel, uncovering how the container runtime and Linux security features handle it. It explores techniques for realtime syscall interception and techniques attackers have used to break out of containers. We will also talk about the

BSides Buffalo49:0357 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Mark Manning

Tags

CategoryTechnical

StyleTalk

About this talk

This talk follows a system call from within a container to the host kernel, uncovering how the container runtime and Linux security features handle it. It explores techniques for realtime syscall interception and techniques attackers have used to break out of containers. We will also talk about the defensive side showing what it takes to be a "secure" container and some newer techniques and features coming out of the Linux Kernel. Attendees will gain an entertaining understanding of low-level aspects of the Linux Kernel and understand when a container is/isn't hardened. About The Speaker Mark Manning Break in. Break out. Whatever—just expect it to be broken Mark Manning (@antitree) is currently on the security team at Chainguard, a secure software supply chain company working on secure containers and sandboxing. He enjoys beating up on containers and kubernetes. He has previously presented at DEFCON, Shmoocon, and various B-Sides conferences. He is the founder of Security B-SIdesROC, Rochester 2600, and other local hacker gatherings.

Show transcript [en]

All right, I'm gonna get started. Everybody, can you hear me over there? Mile away. Cool. Uh, I've never worked a room like horizontally before, so bear with me. Um, this is uh the life of a system called why containers are so hard to to secure. Um, I'm just going to give you some of my takeaways up front or the stuff we're going to talk about because I don't want to trap you in this room. Um, containers are not a security boundary, but can they be is the question that I'm kind of trying to answer. I've worked for a bunch of years now trying to secure containers and then customers will often ask me, hey, can

you turn it into a sandbox? And we'll talk about like what actually has to go into that and why are all these customers asking me to build remote code execution as a service services and why are we using Kubernetes to do that stuff? Um, we'll talk about system call filtering with like SEC comp and and uh hypervisor stuff and how to do it correctly and how to turn a container into a sandbox for some deck machine. Um, so for security architects, if you're building stuff, we'll talk about like here's how you can do this stuff. Uh, for pentesters, we'll we can show you like here's how you can do it wrong so you can exploit it. Um, and if you're

just trying to like run a Kubernetes cluster and keep the lights on, uh, like I'm with you. Like that's kind of what most of us are doing. So, how many people have heard this in some way, shape, or form like containers are not a security boundary? We've kind of like we've recited it, right? Like we just go containers are not a security boundary. Containers not we can't trust it. And it's like this um you know like a pool lane that you have to swim in. Like nobody's stopping you from breaking out of it. Whoops. Um to answer the question, I'm going to set this up with these little anthropomorphic metaphors uh throughout this presentation to make some like the

esoteric topics slightly more interesting. At least that's the hope. Um if you think of an application, it's kind of like a factory that has access to memory and and storage and like inside the factory, it's doing a bunch of stuff, but the things that are interacting uh outside of the factory to the rest of the operating system are like the system calls the the open and the right stuff. So you think of like an application, it's just a process that's running. If you have like a hello world application, you statically compile this. Um the factory is going to have some like system calls that go back and forth. So it's like in reality like here's a Linux kernel, here's your

process. It's constantly interacting with the Linux kernel to get your process done, whatever it's whatever needs to do. So these processes are made up again this is just normal process stuff of executable code like memory mappings and file descriptors and like these system calls that will interact with the Linux kernel cuz when you think about it like at the processor level the CPU doesn't understand what a file system is or like what a f these are all metaphors that are abstracted by the kernel. So if we want to use any of those things like opening a file uh we have to use system calls to talk to the kernel and say please go figure out what

a file is and open it. So then what's a container? Container is like this factory that's like I don't know it's got you know garbage around it but we still have the same idea of like the system calls need to come in and out of the container. So we can't say it's completely container isolated. It's not restricted. We still have stuff that comes out of the container and interacts with the rest of the kernel because at the end of the day containers are just these filtered views right like each of these two processes. I got two instances of engine X up there. Um these are just um things that the Linux kernel looks at as two separate processes.

Whether or not they're in name spaces, whether or not they're just running on the host. Um those are just labels for the Linux kernel. So from the kernel perspective, a container is just a process that happens to be running on the host and you get a couple of extra like things on it like this where we get into kernel namespace and the croup stuff and all the things you kind of bolt on. But those are mostly like if anybody's worked in SE Linux kind of like the idea of like labels, right? We just attach some metadata to the process and now we call it kind of like but what's a sandbox? I'm interested in other people's perspectives. Somebody

tell me I'll take wrong answers too. What what is a sandbox or what is not a sandbox? A testing environment. What's that? Testing environment. Testing environment. That is a very good answer. That is one one type of sand. Yeah. So trying to get into the isolation world. Least amount of damage. Anybody else? Let me explain what what is not a sandbox. You know, I was going to say a sandbox could be like a controlled space where you're afraid to basically test and troubleshoot. Yep. Yeah. And in a lot of ways that's how sandboxes started. Like the original concept of sandbox like a lot of things in infosac was started in the military where they literally would bring a box

of sand and like go toy you know tanks onto some like oh where should we put the bombs today and you kind of line them up and everything. Um, so this was a sandbox and this matches kind of what your definitions were talking about where it's just like a play area, messing around with stuff. Sales people do sandboxes all the time. But the sandbox I'm talking about is more like the one you're talking about. If I'm interested in, it's more like a tool that helps guarantee that something inside of it, whatever it is, can't break out and get to the underlying host. So you can imagine like malware sandbox. You've got malware malware, you want to download it. Um, you're not

going to just run it on your host. you don't know how it's going to affect stuff. So, you put it in whatever a sandbox is. And that word has been used a whole bunch of different ways. There's like the Chrome sandbox. There's virtual machines which are they sandbox. There's G Visor. There's Katana. There's a bunch of things that could be kind of called sandbox cuz it meets this definition. But also, a container kind of looks like a sandbox, right? If I just take that definition, I go a something that confines you and prevents you from doing anything on the rest of the container is attempting to do that, right? but we don't really trust it. So what would it

take to turn a container into a sandbox? And that's kind of the the base we're going to be talking about. Um EVPF, SEC, BPF, emulation, virtualization. These are all kind of some options. So back up. Uh my name is Mark Manning. I go by antitree on the interwebs and stuff. I used to work for NCC group where I was running their container practice which just means anywhere you crammed a container. Um we could do a review of it and um some people in the crowd know how this works better than others. Um there would be weird places where containers existed in firewalls and in cars and whatever whatever it was we would kind of like do an assessment of it. Uh I

worked at Snowflake where um I was trying to build their Java UDF thing and the Python UDF thing was like isolating with like a Pra supervisor. You don't need to know about any of these words, but just I was attempting to build RC as a service. Um and that's kind of what I've been doing the last like few years of my career. Um, I spun off into a consulting company where I'm focusing on doing a bunch of GIS and Kata and like building these these RC as a service stuff because there's more and more customers that are demanding it. Even though I tell them like, hey, don't do this. It's probably not the best idea.

Um, I can still help you figure out the best way to do it. Today, um, I work at Chain Guard and we secure software supply chain security stuff through secure containers and I'm working on like some of the secure build stuff. When we think about like secure build, it's kind of RC as a service, right? It's like somebody else's code running in your environment. Um, so I spent most of my career as a breaker in the last four years more so like trying to be a builder or like aspiration builder. Um, I've worked in I tried to do the math on like more than 100 different environments doing penetration tests on quote unquote container environments in

some way, shape or form. I've worked with six teams building these RCS and service things. I got accidentally got a couple patents. You know, if anybody works for some of these companies, you know how kind of like patents just fall in your lap sometimes. Um I started besides Rochester. I run Roster 2600. Anybody from Rochester in here? Okay. Nice represent. Um and over 9,000 is the amount of times that people have asked me like why do I keep working on Kubernetes? So, we're going to talk about system calls, talk about hypervisors, we'll talk about sec, and then I'm going to show you some new tools that uh I think are are interesting. So, how do containers actually work? Um,

we probably go like, hey, containers, a bunch of system calls. It just like executes some stuff and like I run that that docker command. It just it just works, right? Um, when I say system calls, I'm I'm making these little anthropomorphic system calls cuz every single system call has like a personality and everything has a different purpose and they have these different parameters. But the simple thing to think about these these are just like the API mechanism for talking to the Linux. We don't need to like over complicate it. Um, I made them like anthropomorphic and like friendly just because like they feel like more fun to talk about. Like nobody really wants to talk about clone and exact be.

But each one of these is important and each one of these has a different impact. Each one of these are not necessarily nefarious and each one of them has a different purpose. But we can use different system calls for different purposes and attackers can use them too. So like the prace system call isn't inherently evil. It's just that if you allow Prace in production, it's probably going to damage you more. It's going to uh cause some some problems for you because you can hook um other processes. you can hook like other system calls and tells to do different stuff. So we don't want to run fut trace in production and similarly like write commands you go

write's not trace there's nothing really wrong with it I need to write to displays I need to write to different outputs um but if you're writing something malicious or if you're writing you know an attacker can use write in the same way that that a good person right so there's different kind of uh perspectives and then there's system calls that operate in different modes ePF anybody heard of EPF EVPF I'm just seeing if you're awake everybody must have heard of EVPF okay imagine I you have not. Okay. EVPF is an underlying technology that's designed to work in a couple different ways to um either do like networking controls or do like low-level kernel system um

monitoring. So, I like to think about in like these two modes. Like one's just watching stuff in the kernel as it runs. The other one's more like policing where you go, I'm going to block a certain system call. I don't like it. I'm going to watch what's going on in the kernel and I'm going to I'm going to restrict things. Um, so like I said, this is how a kernel, this is how container is created, right? You start Docker, run the image, some magic happens, and poof container. Um, and if you get this joke on the left, then I think we can be friends. If not, that's okay, too, cuz it's pretty hyper specific. So, let's

supposed to be clones, but they're getting older. Well, they're getting older. It's the parent child relationship is the processes. What I was trying to figure out. Oh. Oh, I like that, too. Oh, I didn't realize this was so deep. I'm I'm a pretty deep guy. I don't know. You're getting deeper. So, let's look at actually how a container is created. I spent a bunch of time actually doing the analysis on like at the system call level what a container looks like. Um, and you don't need to memorize this, but it is a great interview question that I've actually had a couple times of like how does a container work? How do you start a container? So if you imagine um I'm

starting docker podman or even like the kublet and kubernetes it interfaces with containerd through a gRPC call and it just says hey start container run container um it makes a fork and exec which is a common thing that happens in processes let's just make like a separate process underneath my main one I'm going to run this containerd shim process and this is a completely different binary this containerd shim process's main purpose is to manage all the container containers that are running and all the container processes that are running. So when I run container D, I'm going to do a fork exec and when I'm running the shim, I'm going to do another fork exec to run the run C process. And if this

feels complicated, it's because it actually is. Um run C is the part that does all the security. So I want to be clear like Kublet hasn't done anything, container D hasn't done anything, and sh hasn't done anything. This is the part that does all of the containerization stuff where you do the clone operation which takes the clone of the original process and all the memory mappings and it goes into I'm making a new network name space some new new process name space. So all the stuff that we think about with like containers and namespacing stuff it's done by this run C init process. Now we haven't even talked about the thing um that we want

to be running right like at some point we want to run our own process in this container. So this is just the setup. We now have a kind of hollowed out name space to run our own processes in run C and it uh has created that whole thing. It clonesed itself and then it gets into whatever binary that you want to do. So there's an exact in here that says I'm going to do some hello world you know uh command or process. What I find is interesting is like okay you go like okay this is complicated but who cares? This is the important part for security in a lot of ways. the run c init process kills itself after it

creates the shell for the container. It doesn't keep going. It doesn't doesn't maintain in the background or anything like that. It kills itself and now the hello world process it actually abandons the child and redirects it back to the grandfather process which container in shim. So, one of the security benefits of doing this is there's a bunch of attack vectors of being able to go back up through when you think about like what is pig one and and and some of the UID management stuff that they do. This blocks out a whole bunch of attack vectors by doing this. So, when we talk about like name spacing like that's all cool. This is really one of the the

bigger security controls for containers. Okay, if you didn't catch that, it's okay. Understand it's complicated and there's just some calls involved. We'll come back to that later. So still we will say containers are not a security boundary right but we're not really doing that classic thing of security people right we just go it must be the most secure possible thing in the whole world but in reality uh containers are not always designed to be secure when docker was created it was like a development tool and I think of docker as like having a house and we've invited our containers over to our house and they're our friends and we've bedded our friends um and like they're hanging out

and like okay maybe they could potentially be malicious, but if they were malicious, kind of more like a dick move as opposed to a CVE, right? If someone's dropping an ODA on your container that's running at your house, you know, is that really a security flaw? Is that already a threat model? Like, no. It's it's it's really not its main use case. And then you have Kubernetes, which is more interesting because we run Kubernetes in production. We we have customer workloads on this thing. Um, Kubernetes is kind of like an apartment complex that you own where you've vetted the things that are running there. You go, "Yeah, I'm running Alpine. I'm running EngineX. Cool." But if anything goes rogue um in

the middle of the process, there's nothing stopping that. If you've got a bunch of neighbors that like you you did a background check, like cool, this family's in there, but then like one neighbor's fighting another neighbor, like what are you supposed to do, right? You just kind of go kick them out after they've done something bad. And the real risk here is it's less about like breaking out of containers like someone knocking down the wall. It's more, you know, unless you live next to like Ted Bundy like boiling heads in the next apartment, right? Uh if you don't know that reference, I got a docu document. Um the uh the real risk is like noisy

neighbor stuff where you go this and this metaphor works too. In our apartment complex, some director's playing music too loud. In the real world, it's like I'm consuming too many resources or, you know, taking up logs and and but this is what I really want to talk about. Um, the RC is a service scenario of Kubernetes where you go, this is more like a prison. I know they're going to be nefarious or at least I have no control over who's going to show up. So, I need to lock them down, confine them, restrict the resources that they have access to. They don't do mounts, they don't do everything. And we're going to, you know, hire prison guards to make

sure like everything is cool. And in the same way, if anybody remembers Shaw Shank Redemption, if you have a spoon, you want to dig through the wall. We can imagine that like with enough effort and enough time, somebody like an an AP, a nation state can probably dig their way through one of these containers and break out of the containers. But we're hoping then like after they get out of the sewer um we have other controls in place like BPF to kind of monitor some of the uh the bad actors that so fine how do we actually secure all these dangerous containers which is what I what I mostly want to talk about. The first one is um

hypervisors and the premise here is like you know I used to do talks that would that would start off with containers are not virtual machines and like just because there's it seemed like they kind of were and like it it didn't make sense at the time and I'd be like no no no containers are not virtual machines but now we're kind of back saying like well some of them are now we now support uh different types of things like kata. So if you have bare metal, like you imagine you've got hardware, you've got a kernel, you've got binaries and libraries, and you got processors running. This is just like how computers work in like what the '9s. Um, and it's

ironically we're coming back to this. You know, we can now buy bare metal instances on AWS. Um, and then if we want to do a virtual machine, it's the same idea, but somebody was like, hey, can I stack more software and more uh fake operating systems onto my machines to optimize some of the hardware consumption? And they said, yeah, sure. So that you have a base hardware, you have a hypervisor like something like ESX. Then you have kernels that go into each of the virtual machines. And this is the important piece. There's a separate kernel for each of the VMs. So when we talk about like breakouts and exploits, they have to exploit that kernel for that instance of that VM,

which is what's different than uh container, which is more like okay, we got hardware, we got kernel, we have this like really light supervisor thing, and that's the container process. just watching processes, watching the containers, like are they still running or not? Should they be doing something else? And one of the other nice things about containers is when you have two instances of engine X running, right? You would go I don't need to have two copies of EngineX, I can just maintain the same base layer so that like my two containers can both use the same instance of the engineext binary to load up uh into memory. So between these two things, one of them

is a hardwarebacked isolation model. If you look at Intel VTX and all the stuff that goes with virtualization, like that is a a stronger boundary that's on process to do like memory separation and process separation. Hypervisor stuff and virtual machines are a strong boundary. Let it be be completely clear. This is the best option for doing any kind of um sandboxing stuff. But we also have G Visor. Has anybody used G Visor just in like just just testing it out or anything? Okay, cool. So, G Visor is like if virtualization is actually some hardware backed uh thing, Giser is running a a kernel inside of userland in like a normal like a normal program and

faking all the system calls. So, like what if you could fake the kernel so that whenever I ran everything, it would go to my fake kernel. So, if you exploited anything, it just it exploits my fake kernel. Um, this came out of Google and it's been uh it's what Google uses for uh what they call Borg internally. Um, and it's basically you can run any kind of workload that you want on any kind of environment. It doesn't need virtualization. It has a a bunch of different mode that you can run it in. You can run it in virtualization mode, but it has this cy trap mode. So, if you've ever heard of Gvisor, usually your reaction is like, oh, this thing is

too slow. I'm not going to be able to use it. But there's been an update in the in the last couple years that he makes it faster. Yeah. Well, you just said that you're emulating it. So, I mean, the slowing comes down cuz you're literally streaming something off the cloud or you're running it off of like an you're not running like the actual machine. Like sometimes VMs are slow because you're you're on like your home laptop or your home computer, right? Running a machine that could be in like California or something. Yeah. So, the main reason that this is like slow is because of IO because they have to face some of the IO systeming files. So when

you do a lot of like file based stuff, it it's like three more hops to get to the actual IO operation that you do. But the the takeaway that I want you to have is that like G Visor is good. Nobody configures G Visor. They click the button, turn on G Visor, and they go, it either works or it doesn't. But there's 132 config options that have major differences in the performance and the security of Gvisor as it's running. My whole point is like if you take nothing else away like Gvisvisor is a viable option for people that are in the cloud or in bare metal or wherever you want you can run G Visor and you can tweak it

and there's a bunch of uh cool things that you can do. I know this because like at Snowflake this is a thing that we were in a highly performance requirement environment uh and we still were able to make it work. So which ones do you actually choose? So we've got gisor we've got virtual machines. Kata containers is an example of like a container runtime that's like a micro virtual machine that is a really good um security performance benefit. Like this is like the best one out of I'm going to talk about today. Um it works great on bare metal but you run into problems when you have to do like nested virtualization. If you're using EC2 you

need to buy like the nested virtualization extra option on EC2 and and you can do it on on bare metal but that also possible. This is like if you have that environment this is like point and click. I'm just going to do this and poof like everything's solved. Visor is another great option. You can run it anywhere. So if you don't want to do the bare metal, you can do it on GKE. This is what um GKE Google's Kubernetes managed Kubernetes environment. They call this shielded nodes. So you don't have to do any configuration at all. You just say do you want shielded nodes? You click like this little radio button and you go poof. Um you now have like a much

stronger isolation boundary. in a lot of the other um tools that Google provides is using GIS under the hood. The only warning is heavy IO stuff. If you're doing tons of file operations, it's going to be slower. And if that, you know, doesn't matter to you. Like if you're doing like it all matters on the scale. If somebody was expecting, you know, 100 millisecond response time and now it's at a 200 millisecond. If that matters to you, then like maybe G visor isn't going to be the best option. And the thing I'm going to come back to more is uh is sec. Second BPF is kind of like the de facto if you want to secure your

Kubernetes cluster this is what you're supposed to do and I'm going to tell you about some of the sharp edges here. Um the amount of effort to get this done is high in my opinion. The performance though is the best. So if you're really in a performance sensitive environment or if you don't want to deal with ripples and like some of this all this other stuff BPF gives you the best bang for your bang for the buck if you can invest the time to do just highly prone highly highly prone to misfiguration. One more thing to circle back on Gisor. Gisor has these different modes and when we say G Visor we're actually thinking about different modes of G Visor. Right

now like I said people just install G Visor they go G Visor it's a binary question yes or no. In reality you can run it in par virtual. You can run it as a virtual machine. You can run it with its own network sandbox which you probably never heard of before. Who wouldn't need that? But Google's um sandbox technology is to they they found out that the Linux kernel in the past in the past like 10 years there's been two of these where the Linux kernel was um had a vulnerability in the way that it was parsing packets and they said well we can never have this this vulnerability ever again. So they built an entire dedicated network stack just

sandbox um anything from the TCP will just run in this thing. Very cool property. You can also run it in uh like nonroot mode. So the entire thing can just be run as a as a non-root user. There is uh a feature there if anybody's interested and you can talk to me after about some of the details. So let's talk about second BPF because this is something that I've had tons of customers say, well this is a solution for Kubernetes. I think if you take like one of the the Kubernetes exams, um they tell you this is how you secure Kubernetes. So let's go into some of the detail uh the stuff I'm not really

telling you because this is the idea of like hey I want you to protect the kernel integrity at all costs but like uh but not that much cost right you know it's like I don't want you actually to have to install anything or buy anything I just want you to tweak what you have so back to the kernel the the uh the Linux name spacing problem with containers is it's just this it's just that in virtual machines I was telling you there's a separate kernel for every single virtual machine for containers it's one kernel It's shared across everybody and the shared kernel problem is the kind of the main premise of like everybody saying containers can never be

secure. Containers are not a security boundary. So let's like lean into a little a little bit more. How could we secure it? Because we know that every container you know that factory metaphor I was trying to say every every container has these system calls that are going in and out of it all the time. Exact read, except all these different things that would be non nefarious but like just part of normal operations. But this also implies that every attack, every breakout uses these same system calls. So we end up coming to this conclusion of hey, if I can block the system calls, I can block the attack or at least it limits the attack surface of

like the potential even like OAS that are coming out if I can restrict what my container uh needs to do and only confine to that exact sock started in 2005. It has one of the most pretentious names that I've ever heard. It stands for secure computing. Like in 2005, they're like, "Hey, we've solved it guys. All we need is secure computing. We just can't turn this thing on." Um, secure computing was created in 2005 to just limit the system calls that are allowed to go in and out of your application. And then secon BPF came along and it's just to complicate everything because we go Steon BPF is related to eBPF, right? Well, no, not at

all. Actually, it's it's actually related in the same way that Java and JavaScript are related to each other. Um, it is a BPF format. The Berkeley packet filter format that SECOM BPF is is in. I can probably spend a half an hour describing how stupid this is, but let's just move on. To get a secon profile, all you have to do is like remember that like hello world application. I'll show you in a second. You just modify your own code and kind of import the lips.com stuff and say here's the filters of system calls that I want to run. What changed is Docker came around and they're like, "Hey, I'm managing the processes. I don't need to care about

the programs. I don't need to change the code itself to do set profiles. I can do like the entire outside of it. I can do the container part as a sec profile. Was this a good idea?" So, here's our hello world program and here's an example of we just make like a you know a secmp allow list. So, here's a couple of system calls that we want in there and it just like loads up. It just it just loads the system call or the uh the seccom filter. Now whenever this process is running, it will be confined to only only run those two systems. Now whether or not that causes your application to crash is is up to you.

Docker was like, you know what, I like that idea, but you know what it needs is more JSON. So they abstracted stack comp away and they said we're going to make a JSON format for you that does the exact same thing as as what we're just doing in C. And you can kind of think of this as like firewall rules, right? You can see like there's a default rule set that says either allow or deny. And then you put in a bunch of rules underneath it that says, okay, maybe I'll do like the default allow everything except for this or default deny except for these specific ones. And you can get into like very complicated conditions. You can say

I want to log the open at system call but only when the third argument is the number four just as an example. So you can get super combo but it doesn't need to be this binary thing of like I'm going to allow the system call. I'm not going to allow the system call. You can say I'm going to allow the system call in this case. Um and you can kind of kind of confine it that way. So if I said we needed to do all these system calls, we needed to build a profile for our sec thing. How do we build a list of all the the system calls that come out of uh of a container?

Anybody have any ideas before I before I tell you what would you use to find system calls for your your application? Any uh P trace. Prace. Yeah, we could we could do that. S trace. S trace be more specific. Yeah, that's good. Um like this is a a like a scenario that's been around for a while. I'll get to the the tools in a second, but it is it is going to be like S trace and P trace. Just dump the list of system files that come out. So, years ago, I had a company come to me and say, "Hey, we've got this complicated thing. Uh, I've got users that want to upload an image of their own avatar. Um, like

every time they upload an image, I want to just resize it or I want to like tag it with like a different, you know, uh, political thing that I support at that time." And I went to the internet and found this really cool tool called Image Magic. Um and they appropriately were um sometimes when I say image magic people go oh because everybody kind of knows that it's impossible to secure image magic because of all the different file formats that it needs to parse. So this customer also knew that too and they were like listen I don't think I'll ever be able to secure image magic but I want you to build a setcom profile that just

restricted to only the thing it needs to do. Don't worry about the parsing stuff and prevent any kind of container breakouts. So which system calls do we get? we can write that hello world program, but which system calls does that actually translate to? How do you know you've got them all? Let's exercise an application. In the same way, if anybody does fuzzing, um it's the same idea where we just want to navigate every single branch and then see if we can't knock out the system calls from from executing. So Josh mentioned S trace like that's the classic one that's been around since the '90s, right? Just S trace a binary dumps out all the system calls that you

need. There's a tool that's now been kind of archived, but it's still interesting how they were trying to work it called Zaz. No one's used Zaz before, right? Like I I just found this thing. Um, this is a tool that will attempt to spin up your application like your your uh Alpine image and it'll just whatever stuff you need to do in it, but it also lets you uh feed it like random things for it to do. So if you want to exercise the whole application, you can kind of run the web application, visit the web application, make it do whatever it needs to do, and then it'll automatically generate these seccom robots for you. The official way, and

what you've probably seen in like any of the CKA exams is this OCI sec BPF hook just rolls off your install and now you can annotate your Kubernetes environment or your Docker environment. It just says something like uh trace this tall and then output a sec comp profile that you want to store and done. So all this is supposed to be generating um a a set comp JSON profile that you can use on docker on kubernetes on podman. It's all uh all standard. My favorite tool is inspector gadget. Um, which is you can install if you haven't ever installed coupe cuddle crew. You can just do crew install gadget and it'll install this BPF based

tool um from the really smart people that used to be at Kimble. They're now at Microsoft. But it lets you do like background processing like what if you could just scan your entire Kubernetes environment and generate secon profiles for everything that's running. Those tools are complicated. Let me give you a really simple tactic. All you do is create a secon profile that says the default action is log and then don't put any system calls in. And all it's going to do is is start spitting out all the system calls that your container uses. And you can just imagine running this in production. You just go like whatever you're going to do in production, just do it. And then like maybe I'm going to

come back in like in a in a week and it should have all the system calls from that container and then I'm just going to make a checkout profile for those. Right. [Music] So sorry it's like auto. Yeah. No, exactly. It's the it's the exact same idea just like it it just ends up in actually same function as that stuff. So this is kind of how do we do we feel that Kubernetes and Docker kind of go together? How many agree that like these are best buds, best friends? Like yeah, they're they're kind of like yeah, they're doing stuff together. Um, under the hood, what Kubernetes has done is more like this. Kubernetes has now forked away from

Docker completely. So Docker is like a product and a company and they're kind of being like, oh, we've got community thing, but like they they have they have split apart. This is going to be important uh in a minute, but I just want you to understand that like Kubernetes now runs containerd. Docker has run C under the hood, but they do docker D, which is based on containerd, I think. So, I don't know, it's complicated for Docker. Let's focus on Kubernetes. Um, and back to our hello world program. So, Josh said, let's do S trace. Let's do S trace on our um hello world program. down at the bottom you can see it says system calls just like a summary

of what it found uh 17 system calls to make just a hello world fragments just prints out the statically compiled the smallest list I can I can imagine now let's do the same thing for a container I work for chain guard so let's use a chain guard container cuz I know it's the most micro you know restricted container and then I'm just going to copy in the hello binary and I'm going to say run hello binary there there's nothing complicated about this thing and I'm going to do s trace in that container container, what you're going to see at the bottom, I'll zoom in for you, is instead of 17, we now have 1,811 system calls. Wow. And you go like, what

are all these system calls? And do I need to add all of them to my sec profile? Um uh I'm doing this in the wrong way on purpose to demonstrate like this is the important piece. Here is uh I'm just I'm going to admit right up front like you will get the wrong answer, but let's let's get some interactions to see if you're still awake. A secon profile is applied to a process. And here's the processes that I told you were involved in containers. Who thinks that um the seconom profile is is applied at like that first container D spot? It's okay. You can say wrong answer. We had no idea who thinks the second

profile is applied at the container D shim spot. No, they just Yeah, a couple arms. Thanks. I know you're still loud. That's good. What about the run C in it process spot? That's where sec was. Sure. Sure. More hands. What about at the hello world spot? Like that's where the sec. How many people just have no idea? Okay. Thank you. Just Okay, we do have hands. Okay, I appreciate that. So, here's the complex uh situation that we have. It's started out if you're using Docker Community Edition, it's here, which is absolutely insane. Um, if you do it on the outside here, you have to capture all the system calls for containerd from community shim to run cit to hello

world. So, what you're going to see is there's all the system calls that you need to start up the container include really dangerous stuff like the clone operation, unmounted, unmount, all the things that are part of the container process. It goes back out here. So run C 1.2 and below which is what Docker's using right now. Um it does it out here. Newer run C does this um which is do it at the container shim level. But like I said before all the main operations are happening at the run C in it pro uh part. So it still doesn't make any sense to do it out here cuz you're going to be getting a whole bunch of extra uh

unnecessary operations as of Kubernetes. Well, container D and run C13 which just came out in like the last month. They refactor this again and now the second profile is actually in a more sane place. It's at the place right after you do the clone operations I was talking about of like building the container then it's actually applying the sec profile. Why am I describing that? Why does that matter? We are not comparing sec like custom sec profiles with nothing. It's not like, well, it's better than nothing. We're actually saying there's a built-in runtime Secton profile and SC Linux profile and appar profile that it's pretty good. It was built by Jess Friselle and it hasn't been updated uh

that much and it's got some really smart things that it does in there. the pretty like some of the system calls are dangerous, but the most dangerous ones are kind of confined to only do a specific thing versus if you just took that seccom profile that I show that I gave you with like 1,811, you know, system calls in it. Um, you'd be adding a whole bunch of dangerous things on here like like why is my hello world opening a socket? Why is my hello world doing unmounts and why is it doing why can it load BPF programs? Like these are the this is the problem of building these techn profiles because it's highly prone to foot shooting.

So here's an example. I did all the tools that I just listed Zaz BPF hook inspector gadget and I said generate a profile. I even did clot. I was like you I don't know you don't even know like do everything with an LLM. So why not just generate one of these things. Every single one of these is different in some different way. Now which one is more secure? Which one is actually accurate? Um we're going to figure that out. So, I built a new website that they're just opening up today called secmpair.com. If I can figure out how to switch over to it, do a demo. [Music] And we can go to the site right now if

you want to play along. Imagine we've got uh two profiles and I'm going to do like here's the inspector gadget profile versus the uh we'll do the Zaz one. Um and I want to compare these two different things. And what we're going to see is like okay here's a here's a substantial difference in the amount of things that are going on up here. Let's let's focus this down a little bit. Let's focus on just the system calls that are the most dangerous. So one of these allows for clone. uh one of these the uh completely lost clones we swap out for uh let's do the log one and let's just compare it to like the default for

container B. Um you'll see that there's a whole bunch of things that are allowed like these are all the dangerous ones that are being allowed over here. These are the ones that are blocked. The main problem I'm trying to to draw here is um we don't know all this work that you put into these custom secon profiles. You don't know if you're actually making it less secure than the default. So this website is designed to to analyze that stuff. You can upload the URLs and all of the images and I'll actually even try to build them for you. So a whole bunch of dangerous fun stuff. I've also built a tool called secmp diff uh that I'm not going to really go too

much uh into now but the uh sec diff is designed to actually prace the processes that's running in your kubernetes environment to tell you what the second profile is as it's being applied. So this one's you know public and and you can do kind of fun stuff. So I wanted to talk about system calls today. I want to talk about hypervisors. We did some of that. talked about sec and I I'm pointing out a couple new tools and we talked about you know your local environment is kind of like your house uh your your Kubernetes environment is kind of like an apartment complex and then your RC as a service is like the uncle you have in um the RC as a service

stuff is obviously more fun so this is in my experience let me let me just let's do a poll here how many people are doing like normal Kubernetes they're just like keep the lights on I'm just like their environments are like I'm destroying an engine X service, right? Like it's not being engineered. How many people are exposed to Kubernetes like doing RC as a service and doing dangerous stuff instead of clusters? Okay, that's good. Last time I did this, it was like way too many people saying that they were doing this. Um because there's this other sliver in here of like there's a bunch of people doing that and then there's a bunch of people

not doing it correctly where the amount of people that are doing it safely um is extremely low. Um and it's because of this this problem of set comp and performance and all the the balance that you have to do. So B's going to summarize um some of the stuff that these are the main options. We can keep a container inside of a container do the standard stuff run your CIS benchmark tools and just come out says like don't run unprivileged or or run unprivileged pods run is nonroot. Um, Kubernetes just came out, the most recent version with uh user name spaces uh being supported. Uh, that seems crazy to me, but it's super um strong security

boundary if you want to turn that on. Um, don't do host mounts. Don't just I mean, there's a whole bunch of lists of like things not to do, but this is just standard hardening stuff. Don't mess around with set comp. Just use the defaults. Use the one that's that's already there. Just turn them on. Just make sure that app armor Linux and set comp are turned off. Like that's that's kind of your your baseline if you're just keeping the lights on for your Kubernetes cluster. One of the fallacies that I I try to go through is that like how you know useful is custom setcom profiles and how many times did setcom protect us from

previous exploits in uh in Docker and containerd or anything. Um and there's only been three times that it protected us from a CD. That's not to say that it doesn't protect us from like other things like other container breakout strategies, right? like just misconfigurations and it's still useful. I don't want I don't want to take away to get a set comp, but like let's not overinvest in what SEC comp is and like the amount of time and effort to to do it correctly is harder. So option B if you care if you've got RC as a service and you can pay for um virtualization and you can pay for the bare metal instances ka containers uh is

like a pointandclick virtualization container technology you can run in your Kubernetes clusters today. Um, do you advis your works universally? If anybody's using Google Cloud, does anybody use Google Cloud in here? Yeah, there's one. Yeah. Not Not many folks are. I mean, I personally use it, but like most people are on AWS or even like Azure, it seems to be more popular. Um, but in G Visor, it's just point and click to turn on like shielded um shielded nodes for for GCP. And something I've been working on lately is doing more QMU for like micro kernel stuff, but it's it's all the same thing. It's a virtualization or a strong boundary. Option C is the containers in the setcom

stuff that I've been trying to get. Yes, you can do this. Yes, you can install inspector gadget. There's a operator called the setcom operator that will you know I was I was mentioning that like BPF has a problem of it's not enough JSON. Well, Kubernetes came along and they're like, well, your JSON doesn't have enough YAML. So, it allows you to take the JSON file, convert it into a YAML object in Kubernetes. So now we're we're 10 layers of abstraction away from the actual assembly that's in the kernel, but you can run the sec operator to manage your sec profiles in the cluster. So you don't have to manage them on disk. So bunch of fun details of

like how to do this at scale. I've seen a couple of environments that that do this. It's challenging. It's possible especially for like super high-risisk. uh the question and like this is a lot of often how my my my talks end up coming at the end where it's like maybe not Kubernetes um it's not a not a popular opinion but like why don't we step back and say like why was Kubernetes chosen in the first place somebody needs to convince me that an EC2 instance or like Fargate doesn't provide the same if not better level of security and scalability like it's really nice that we've got this nice little Kubernetes API everybody uses containers but maybe it's really not

designed for the RC as a service thing and has it become this attractive nuisance. Um, so we said containers are not a security boundary, right? Like we've kind of validated why specifically it's not. I think there is some value in saying like it is sometimes a security boundary or some level of security improvement like ephemeral processes that are built and and destroyed all the time and that are being like isolated as a as a first principle is a good security process is is something that has like security benefit but we can't say we can rely on it as much as virtual machines but why do people keep coming in and saying like I want to turn my

Kubernetes environment into a bunch of sandboxes and I think it's because of this has anybody heard the attractive nuisance problem. It's like a legal thing. What's that? This is if anybody's seen the there's a curb episode about this where it's like the scenario of you got a Yeah. you got a pool in your backyard and it doesn't have enough fences and like somebody breaks into your house and your land and like and and dies in the pool cuz cuz they fell in or something like that. It's your responsibility because your pool was an attractive nuisance because it was so prettyl looking and shiny looking and and it looks so like fun to use and Kubernetes kind of like that

like people are just looking at Kubernetes like oo it's pretty and shiny I think I want to use this thing and they should know about the dangers that's one of the other properties and developers have known about the dangers of Kubernetes for years uh and they still don't like built on they don't build in security to Kubernetes day one they're now kind of bolting it on now um the the the other problem is that like if you just don't understand the risk it becomes an attractive nuisance and Kubernetes is so esoteric and complicated that I can only say I might know you know 20% how Kubernetes works and I don't know if anybody knows completely how the whole thing works at

this level and then the last thing that makes this an attractive nuisance that's like is is your now like legal responsibility is that um it's easy to secure and like you could just do a couple things and so now I've described you know like ka containers and some of this point andclick stuff that makes it easy to secure So I feel like this is what Kubernetes really is. It's like we've got EVPF to monitor the pool. We got these containers running in the pool, but it's really this like I don't know body of like murder water where like anybody can just be falling into accidentally doing Kubernetes and then you go like oh I died but now who's who's responsible for

it? Like there's a container breakout. Who's responsible for that? Is it because Kubernetes didn't provide the level of security that you expected or is it because you were just messing around with Kubernetes? didn't know how to use it and like now it's it's trying to do way too much. So in summary, we've got, you know, for performance stuff, if you really do care about RC as a service, the the set comp stuff and limiting system calls is the most performant way of handling most of this stuff. G Visor is the like universal standard for doing this across like any cloud, any environment. I really like that as a tool. Um, but it balances performance and security kind

of in between. And if you care about security the most that above all costs then a hypervisor like virtual machine stuff uh is the is the solution. And that's my time. So I appreciate you listening and looking at my uh system call characters.

I have some questions. I have any talk about um my cute little characters or they actually want to talk about. So that list of CVEs you had is that um at least based on your background and knowledge is that more or less exhaustive or are there more is that just a sample? I feel like it's exhaustive. Okay. Uh because I the take away for me was like there there are not a lot of container breakout like CVS like they're not like we were told in the beginning of containers like oh any Linux kernel privilege installation is going to be a CV it's going to all containers are going to fall over not really what ended up happening right but

that's that's kind of the list that we've seen of bugs that caused a container breakout if you're a AWS and a CNAP shop. Do you think that Fargate is the better solution than than the other options that you know we have? Um do you care about money? No. Then it is the best. It provides really strong um isolation. Firecracker is really great. Um it and the fact that you don't have to manage it uh perfect checks all the boxes as long as you are willing to pay the bill and like scaling up with it. It's it's probably a great solution. I'm not paying the bill. Yeah, exactly. Cool. Appreciate your time. I'll be around. Glad to see you in Buffalo for all the

towners. Thank you.

Life of a System Call — Why Containers Are So Hard To Secure

Related talks