BSidesSF 2026 - Sandboxes, Seccomp, and Syscalls: Chasing Isolation in Kubernetes (Mark Manning)

Name: BSidesSF 2026 - Sandboxes, Seccomp, and Syscalls: Chasing Isolation in Kubernetes (Mark Manning)
Uploaded: 2026-05-12
Duration: 43 min 28 s
Description: Sandboxes, Seccomp, and Syscalls: Chasing Isolation in Kubernetes Mark Manning Containers aren’t a security boundary — but we love dangerous things. This talk dives into real-world k8s sandboxing: sharp edges of seccomp, lines between virtualization and isolation, and what is "secure enough" for y

BSidesSF43:28263 viewsPublished 2026-05Watch on YouTube ↗

Mentioned in this talk

Tools used

AppArmor bpftrace cAdvisor Falco gVisor seccomp SELinux tracee

Platforms

Kubernetes

Concepts

eBPF

About this talk

Sandboxes, Seccomp, and Syscalls: Chasing Isolation in Kubernetes Mark Manning Containers aren’t a security boundary — but we love dangerous things. This talk dives into real-world k8s sandboxing: sharp edges of seccomp, lines between virtualization and isolation, and what is "secure enough" for your workloads? Learn how to build—or break—sandboxed workloads at scale. https://bsidessf2026.sched.com/event/84c9559937f1248112c7a7853727b63c

Show transcript [en]

Okay. Well, thanks for joining us today for our next session. We have Mark Manning all the way from Rochester, New York. >> All right. Represent. >> All right. He'll he'll be speaking uh to us today about sandboxes set and sis calls chasing isolation in Kuberneting Kubernetes. So um please go to slido uh uh QR code's up there. It's very tiny to post your Q&A questions if we have time at the end. Uh but yeah, please go there or if you can't see the QR code, it's at bsidesf.orgqna. Quebec November Alpha and uh yeah, and then uh go ahead, take it away, Mark. >> Thank you. Uh thanks for coming to this talk like right after lunch, too. I I

appreciate that. I felt like everybody was going to be skipping this like they're going to eat something. And I hope you don't fall asleep. That's kind of my job to keep you awake, but those chairs are extremely comfortable, so I also understand sandboxes, set comp, and system calls. We're going to be talking about Kubernetes and isolation um stuff in my experience that um uh has been a top of mind for a lot of organizations that I've been working with. I'll give you some takeaways up front kind of like lead you in the direction of where I'm going. Kubernetes in kind of like quote unquote hard mode requires additional sandboxing or some kind of like levels

of isolation, but like what what actually works? What's actually practical? Um, I've been researching a bunch of different ways to do sack comp and how people are doing sack comp. And I'm going to go into why this [ __ ] is actually pretty hard to do set comp and kubernetes. We'll talk about some alternatives. We'll talk about um some ways if you don't want to use set comp. I'll talk to you about some tools that I think are useful for it. And then at the end we'll talk about uh one is none and two is one. Tell you what that means later. So I'm Mark. I go by antitree. I've been working uh most of my career

as like an offensive uh person working in like containerized environments. I worked for Snowflake working on their like Java UDFs and some sandboxes they were doing there. I do uh some side consulting. We're working with some fun stuff with G Visor and Kubernetes things. I work for Chain Guard. They are a supply chain security company trying to build uh secure container images and really anything in your supply chain. But one of the things I'm working on over there is is how we build software correctly and what's the isolation model that we need. >> Um, and between all that stuff, I've been running this service called SEC compare. SECMC compare.com. You can check it out right now. It's just a a

thing. I've been collecting people's sec profiles that they've been using in uh in Kubernetes and we have some interesting uh I think takeaways from that. So I've played with different definitions of sandboxes um places where they do SEC comp filtered workloads things doing like prace supervisors QMU microVMs things like G visor all of them have different you know security boundaries and do different things um which one works the best which one do you want to use so I'll talk about system calls cuz it's everybody's favorite topic we'll talk about sec comp specifically for a lot of this we'll get into like more actual sandboxes like things you can kind rely upon and then I'll hopefully give you some practical

advice at the at the end. How many people have either heard this or believe it and and kind of like agree with it like containers are not a security boundary? I can see some hands there's like a leg. I think I see it too. Um this is something like we've said in the industry like containers are not a security boundary and we we agree, right? But also like what do you do? What if it's not a security boundary like what's next? Okay, so what is a security boundary? Um, I'm going to start this off with a story. It was the the year 2016. And I'll preface the story with I really don't know if this

is true. This has been passed down from Googlers through beers and through coffees and things like that. So, I actually really have no I'm not in the room for any of this. So, I could be wrong, but the moral is still still fair. 2016, um, there was something called Google App Engine. Google App Engine, if anybody remembers, was like something that would take Java and Python. you could just take arbitrary code and throw it to Google and they would run it inside of their environment. Kind of a nice feature. And one of the things that they were doing is like they were calling it like a Java virtual machine. You know, if anybody knows anything about Java, it's neither

a virtual nor a machine. And there was uh some researchers that were at the end of December of 2016 in Poland. They decided that they were going to spend some time and poke at this Java virtual machine thing and this Google App Engine thing and just see what they could do. And it turns out they could do a lot. So they compromised the Java runtime that broke out of the quote unquote virtual machine and they compromised the underlying base operating system that the Google App Engine was doing. This was a really interesting exploit. And as a reaction to that, Google came back and they said, "Listen, we we've got to have a hardened model for this." So they

started going through and saying, "Let's take Java and let's rip out all the dangerous bits." Yeah. Exactly. And you go like, "What's left of Java?" Uh but but they did that for Java and Python. They ripped out all the dangerous bits and they said there I'm going to make like a a hardened Java and a hardened Python. And they did this for like every version. So back then it was like Python 27 I think was still they were still working on they maintain these patches to filter out all the bad stuff and inherently kind of go well then only the good stuff is going to be possible. And that was part of their security model. Um there's more that

goes into this in in the book um building secure and reliable systems. They kind of go into some of the detail here. We'll come back to that um towards the end, but like their their original thing was, hey, we can put a security boundary on something that we kind of know is dangerous, but how did they do that? And if anybody's working with containers, there's kind of three modes in my mind that you go, there's the Docker mode, like local containers, like good luck, you know, we're not going to really talk about that much. Like whatever you're running in your local machine, I can't figure out your threat model. Um, there's normal Kubernetes, which is anybody that took a CKA exam or

like the standard Kubernetes of like this is how you should Kubernetes. It looks pretty normal. And then there's a bunch of us doing like Kubernetes hard mode and I these are my people like you're you're kind of doing something you know you shouldn't do like they're not going to tell you how to do any of this stuff. So these are the people that are trying to do like remote code execution as a service like run arbitrary things in a Kubernetes cluster. you go like containers are not a security boundary but like this is all I've got. So if I can see some hands again, how many people are dealing with a Kubernetes cluster or have worked with in the past

a Kubernetes cluster in like hard mode? It's doing something dumb or dangerous or Yeah. So there's a lot of these and and I've been asking this question over and over and like the the the uh the the answer is evolving over time or or devolving however you want to think about it. 2018 I asked this question like a couple hands you know kind of came up but I keep asking this question and more and more people keep raising their hands and I'm I'm getting more and more nervous but you know uh it's great for me u the the issue goes like why are we doing all this stuff we obviously have a problem of I need to run

Kubernetes and do dangerous things with it why are we doing that it's kind of I've seen most likely is like that's the platform that we have it's the it's the thing it's working over there it's doing everything we need it to do why not just reuse that for that super dangerous things. So, this could be like, you know, build servers, you know, I'm taking arbitrary code off the internet. I'm going to compile it. I know it's untrusted, but like let's just throw it in the cluster anyways or people doing like what Google App Engine was doing, right? Like taking arbitrary code and just execute it and then like hope for the best. So, I'm trying to break this down in

like two schools of thought. There's there's the Google thing that they were talking about of like filter out all the bad stuff. We can just harden it. we can just kind of say, you know, all of the uh containers are are are inherently okay, but we just need a little bit more hardening to remove the dangerous things that are happening there. Or we can go and like fake the good. We can kind of virtualize the thing. We can emulate the thing. We don't necessarily need to give direct access to to the thing. So, as I'm doing some of these diagrams, the blue boxes are always representing like the kernel. That's kind of going to be

the crown jewels. And like the the red's going to always represent some of the um the the filtering or some of the security controls we can put on. So to make this slightly more relevant, I want to give you like the lay of the land in terms of people that are attacking us cuz I've been doing this for a while and like attackers in 2023 and back, we had these really cool black hat talks and really like novel research that was like here's how you compromise a Kubernetes cluster, but like nine times out of 10 it was just like script kitties that would make like eight bucks in Monero. So it was really depressing. and you do all this research and then

they just kind of run a crypto miner inside your cluster and and they would move on. That's starting to change, at least I hope, because I'm getting kind of excited about the the different ways that they're attacking stuff. So, if anybody's heard of the the link pro root kit, it's kind of merging in two things that that I like. Like I love EbF and I love to hate on Jenkins. Um, so between these two things, this was a publicly exposed Jenkins server that there's always going to be, you know, a CV for Jenkins out there. So it got compromised. I don't really care so much about that. What they did after the fact was they compromised it with a BPF

rootkit that could hide the process list and do classic things like hiding net stat stuff and hiding the like with a BPF program list, hiding audit logs. So they were actually putting in some like post exploitation effort and they they even did things like if you had a sin packet that would be sent with like a certain window size it would create like a C2 for you. So all this is really cool like you should go if you haven't seen this it's it's just a fun article to read. But I don't need to come up with like these really like dangerous attack servers. I can talk about like the the more you know plausible thing which is

like we're trying to put AI in all kinds of things and we can imagine it's going to be inside of uh Kubernetes soon right we can imagine like we're trying to put sandboxes on top of AI because we know it's non-deterministic so we're kind of kind of trying to control it and I'm going to keep repeating this thing over and I'm trying to call them like they're not really sandboxes when we're trying to like confine the agentic weird you know slop that it gets generated it's not necessarily the same threat model so I've been trying to get the term slop box, you know, uh, caught on because it's trying to remove the number of slop

from hitting your operating system. So, we can already see I had this line in there of like we can imagine they're going to go into Kubernetes and they're going to start like running this stuff on there and then Nvidia Open Shell came out like last week, a couple days ago. Um, and that's exactly what it's kind of designed for. Run it inside of Kubernetes, run all your slop code in there and then and then perfect. So filtering out the bad, we go, what could we do to remove all the bad stuff that's happening? Uh what can we do like use like something like SEC comp to filter out um something malicious that might be happening? So operational

definition of like a sandbox, we go it's something that is doing something inside of an environment that we want to prevent from getting out. Like these are container breakouts. These are like the that's the dangerous bit. So you can imagine like a malware sandbox is kind of designed to make sure it can't just spread throughout the entire operating system or virtual machines or like G Visor, Firecracker, Kata. Those are like obviously like there's a strong security boundary for those things. And you don't need to hopefully be like a Linux kernel expert to understand some of these system call stuff, but let me give you a quick overview just so we understand what we're talking about. A

system call for any operating system is kind of like the API interface between your process and the rest of the OS. So things like, you know, I want to open a file. You're going to see system calls that say like open add and read and accept for like socket connections and a bunch of these fun things. So whether or not you do any of this like your process and I don't care if it's a Python script or whatever under the hood there's a bunch of these system calls that are interfacing with the kernel and you start going well cool that I know what my program needs and I and I know what attackers need. Like they like BPF like

that link pro rootkit I was talking about. So you start going, can I just filter out all the danger system calls? Can I filter out like P trace that would normally hook a process, look at its memory and kind of like be able to manipulate it. So if I know what I'm doing over here and I know what attackers are doing, can I can I come up with like a restriction for that? And that's where SEC comp came in. Secmp was created a long time ago, 2005 in the as right. And like it was the most pretentious name cuz like in 2005 they were like listen guys, we've got it. We should call this thing secure computing.

And like this was going to solve the entire industry. So this is secure computing mode for the Linux kernel. But Docker came along and they were like, hey, no one's really using this thing because it's kind of complicated. You got to import a library and do all this kind of crazy stuff. We've abstracted the processes away. So you can put a sec profile on like the outer boundaries of a container. And Docker came along and was like, you know what's the problem with sec is like it doesn't have enough JSON. So they're like, great, let me fix that for you. And then Kubernetes comes along. It's like you know what the problem with SEC comp is

right it's you don't have enough YAML. So now we've got this abstraction of you can you can put in YAML that includes JSON and this is how we're running it inside of our Kubernetes clusters right now. But you can see like it's a really simple idea like I'm going to take uh something like open app like open a path or something and I'm going to give it an action. I'm going to say when I see this system call I want you to log it. Like that's the simp the simplest thing you could say. I want you to kill it. I want you to create an error. I want you to allow it. there's a bunch of these

actions you can take and you can do complex things of like when the third argument of my system call equals four or or whatever it is um then take this action on it. So it's it's actually you know a well-designed system. It's complicated. It's it's kind of deep into the Linux kernel. So now a lot of people are using it. So here's a quick demo. I I know the text is small but it's like the biggest screen I've ever presented on. So it's um what we're trying to do is like restrict like socket bind listen. We're going to make a sec policy says block all of those. So here's my you know example of inside of a

Kubernetes cluster I'm connecting to port 4444. I see there's a secret there. Now I do the same thing with a sec restricted one. Oh operation not permitted. I can't make a socket. It's the exact same code the exact same containers but one of them has a sec profile. That's the simplest you know example of using sec to restrict something that that is potentially bad. So, one of the things that I've heard over the years is, hey, if you really care about security in Kubernetes, you should just use custom SECOM profiles. Has anybody ever heard that or like taken the CKS exam and they've been taught that? Yeah. Um, so there's two goals of this part of the presentation.

One is um if you're doing this, it takes a lot of effort and I want to like validate that like you're doing a great job and I want to applaud you for doing a sec comp correctly. There's a lot that needs to go into this. If you're doing it and you're doing it well, like it's great. It works out really well. And then for those of you that like maybe have heard about SECOM for the first time today or you're going like, "Hey, maybe this weekend I want to play with it." And like you've got like hopes and dreams of what you want to do with SEC. Like I want to stomp on that really

hard. So my only failure of this section is like if you're in between, you're like lukewarm about the entire thing, I'd like you to have like a strong opinion one way or another of like how to do it correctly. So, I'll give you a real world scenario. I've done this myself. Years ago, there was an organization that said, "Hey, um, I want you to do a security review of my environment. I've got this one micros service that's that's really dangerous. It takes in like customers images." And they just crops them, reformats them or whatever, like for their avatars or something. Um, we know it's dangerous. It's the dangerous software. Well, maybe besides Jenkins. Uh, it's called Image

Magic. Image Magic we know is going to have like every CV. In fact, if anybody wants to fire up Claude and point it at Image Magic right now, I guarantee at least like a dozen CVs uh will come out of it. So, they knew this at the time like there there's just so much parsing that the thing does that it's just inherently dangerous. So, they said build a sec policy for this so that you can kind of confine it um to only uh only do exactly what we need it to do. So I said great well s trace is a tool for this and this is a classic like Linux tool that that most people use.

This will just dump all the system calls for a given process. But we have a lot of Kubernetes like um contextaware tools now like tracy and inspector gadget cube cuddle trace a security profiles operator. These are tools that are designed to run in the background with something like BPF or even like audit d capture all the system calls store them in that you know beautiful JSON file and YAML file and then you can run them in the cluster and apply them to your workloads and and the goal is that is it can kind of automate a lot of this. So we'll take a simple example hello world let's do srace on it. We see at the

bottom there's like 170 system calls that were made by my hello world program. Okay, that seems that seems fine. Let's let's do S trace again. It's 172. Let's do it again. It's 171. The number of system calls that are emitted even though my code is static is not always the same. So like there's an ever evolving like depending on the execution environment you have to pay attention to. But that was cheating. I just ran it in a process. Let's put it in a container. I'll put it in like a chain guard tiny container. I'm going to throw in the hello world binary and just run it. I'm going to do the same s trace. So,

I'm going to copy out like how many system calls that are in there. It turns out it's 1,811. So, we went from 170 from a hello world program to now over almost 2,000 system calls that need to go in there. So, we've got these tools and Inspector Gadget is one of my favorite tools. The team is smart. They they know what they're doing. Unfortunately, they got acquired by Microsoft. Sorry. But the the team was smart. Uh and they were doing cool stuff. Um, and they were using BPF to hook the trace points of the Linux kernel and doing like really cool stuff in Kubernetes to be aware of like what's running um, in the cluster, being aware

of like the pod that's executing and saying, "Hey, I'm going to trace all the system calls um, for that single pod and then I'm going to store it inside the cluster or let you do stuff." So, this is a really easy tool. This doesn't like require a ton of deployments like Helm charts and stuff. You do like cube cuddle install uh, most of this stuff and it'll work great. So now we've got a custom setcom profile. It's going to look like like this. It's just a pile of, you know, system calls that my container is going to be using. We go great, you know, mission accomplished. Ship it to the customer. Here's your little, you know, image magic um, you

know, pod that you want to restrict. But there's a bunch of these scenarios where the system called tracing process is lying to us. Um, sometimes depending on the tooling, they will hook different trace points. So sometimes it's pra, sometimes it's like the ebpf like k probe stuff. So depending on what you're using and what you're hooking, you'll see different system calls or there's a risk or you'll just miss a system call which is going to result in the container crashing in production and like you you you know you've caused more problems than you than you uh than you wanted. There's also the scenario of you know sometimes system calls get automatically converted over by compilers. So depending on which

compiler you're using, it'll start going and like, "Oh, I see you're using fork, but the more cool version of that system call is actually clone now and they're they're effectively the same thing at this point." So like I'm going to convert that system call for you. So depending on how you compiled it and when you ran it, you're going to get a different result. And then there's just the classic problem of like all of these, especially ebpf, have these ring buffers that if you just have too many system calls that are going through the system, you're just going to miss them. So we still have to use our tools for a lot of this and we still have to rely

upon this for automation because there's not really a good way to kind of you know artisally craft these sec profiles. Um so which tools are good and and I don't want this section to be like uh this tool is better than that tool. But what I did is I took about a 100 images and I ran them and I configured them to run in like a reproducible way as like a test to say like here's here's engine X, here's like, you know, just Alpine or or here's whatever. And then I I ran through all of them and I I said here's three tools. I want you to trace them all. And the results were kind of like

eye opening to me cuz it was just all over the place in which ones were consistent about the system calls that they would get. It's the exact same images run in the exact same way in like a mini cube cluster and using the same using similar tools. They all gave you completely different results. So one of the things that was happening too is like I was mentioning what if you skip a system call that means it's a crash but there's another problem that's actually more dangerous from the security perspective. There's a scenario where a bunch of the system calls get added in that you don't even know about. For example, when this Hello World program will add in the BPF system call

and you're like, why does my hello world now like allow like a you know like a BPF back door to be installed? It doesn't make any sense. And one of the reasons is if the containers that you're profiling are are having things happen to them, you'll capture the things that are happening to them. Like if you're crazy enough to run like Crowd Strike in your Kubernetes cluster, you're going to see like a bunch of like real time monitoring stuff and like BPF system calls being uh being run against the container cuz they're kind of like analyzing it. But what you're attempting to do is saying like I want to capture all the system calls that are coming out

of the container. So depending on your tools, you might add in a bunch of dangerous system calls. And it turns out all the tools do this. So you know between Inspector Gadget, coupe cuddle trace and tracy, what I'm showing is like on the lefth hand side is the BPF system call. This allows you to install a rootkit using ebpf or like running anything in there. IOU ring key cuddle. These are all just dangerous like basically straight container breakout exploits. And each one of these tools um allowed that in by default. There's compensating things you can do. And like again, it's not about the tools, it's about like the way that we're tracing some things. So all of these tools are

are making it harder on our lives. So there's these three perils of system call filtering. If you wanted to do this stuff, you can still do it. There's there's these things we need to overcome. Like one is the risk of a missing system call. We talked about like it causes a crash. Okay. One is like an unrelated system call like BPF just shows up in the in the system call list and you go like I didn't want to run that. Like that impacts security. Now like now my hello world program now has a has a BPF back door. This is caused by either things like sidec cars that are running next to the pod or your

you know falco and your cy dig. Not to, you know, point blaming any of this, just like anything that's like running in the background trying to watch those pods might get wrapped into the system call list. And there's another scenario for especially for like databases. There's a bunch of like MongoDB does this. There's a bunch of um tools that will attempt various different system calls. They won't even pay attention to the kernel that it's running on and they'll just say like, "Hey, I'm going to try something like I ring." And if it fails, it'll go into into like a backup mode and use a different one. But it's expecting you to block it or don't block

it. But when you're tracing it, you you capture all of these like IOU ring things. So there's a bunch of things that uh that that are very difficult to get like in a consistent way to generate a sec profile is actually more secure than the one that we had before. So we have to measure it and we have to actually go in and say like is my sec profile successful? We have to go and and say like how how well am I doing? And one of the problems for this is like we're not building from ground zero. We're not saying let's build a a new uh setcom profile to harden it. We're competing with an existing setcom

profile that's already there by default in any any cluster today. So there's a default setup run uh sec profile. There's an app armor one. There's an SC Linux one. You might not even be, you know, realizing that it's there. Um and it does a pretty good job. Like I'm here to say like it's it's not great. Like you can definitely get it better, but it's not allowing any oays that we know about. and and it actually is a pretty complex um um profiling system for most runtimes I should say. Uh everything except podman but that's a separate talk. So I created this service called set compare that was trying to just say hey I've got these two seccom profiles I

want to compare it to the default one. What if I build a custom profile on it and then uh I compare it to the default one like which one's better? Am I actually improving security at all? So you give it kind of just you it's just a JSON compar comparison thing but you can give it different capabilities and stuff and show here's all these dangerous system calls that your custom one created and here's all the defa here's what the default one blocks or vice versa. So now you at least have some kind of metric where you can say hey I'm am I doing a better job than the default one? Let's let's compare these two

things. But in doing that, what I what I noticed is like um if anybody's studied for a a certified Kubernetes security thing, like what is that administrator, engineer, whatever. Um they give you a bunch of useful information about like securing Kubernetes and stuff that you want to know. But one of the things I keep saying that just really bothers me is and if you care about security, make sure you do sec. Make sure you they don't really say how. And in fact, if you look at any of the books, you'll see a picture just like this where it's like, hey, if you want to secure your cluster, make sure you block maked and like that's literally just making a

directory inside of your container. So, are we missing the point on like what sec can do and like how we're measuring the risk reduction? Cuz we can talk about benign system calls make, dur, read, rename, like just basic, you know, like like normal operations your process might run. And then there's these set of dangerous ones. Prace BPF in a module would allow you to install a Linux kernel module. So are we even measuring like success in the in the right way? So one of the things that um SEC compare does is is the underlying engine is this really complex um kind of detailed multi-dimensional sec comp analysis tool. And it has to be multi-dimensional cuz I'm measuring the dangerous system

calls that it allows and I'm measuring the number of system calls like as a whole that like your profile is allowing and comparing it to the default one. So you can take in your your JSON your YAML and you can pipe it through this. This is a command line tool. It's all open source. I don't care if you tell Claude to rewrite it. I'm not bothered by that. Like go ahead and like use this. But this is designed to be um a measurement of how good or like a greater of your sec profiles. So you kind of get some some interesting results. If you take all the SECOM profiles you can find on the internet and measure them. You get

some interesting stuff like this. And here's a nice graph. Green means good, red means bad. But like I think I know my audience a little bit better. So let's talk about like more like Dungeons and Dragons characters, right? We can imagine like lawful good SECOM profiles are ones that are more restrictive. means like they'll have less amount of like even the benign system calls they'll just have a less amount of them. It just reduces the attack surface um and they'll be less dangerous like the things like prace and bpf you know that they're not going to be there. So you've got like the good you know character that represents you know everything but then you've got

these like subtle weird things that would happen of like maybe it's less restrictive um in in terms of like it actually allows more benign system calls but do we necessarily care? But like that's that's a scenario that could happen. And in fact, that happens a lot of the time with these custom second profiles. And but it's still less dangerous. You dropped things like prace and you dropped things like BPF. And then you've got the top right corner which is just like full chaotic evil. It's like less restrictive means I can allow a whole bunch of additional system calls and I'm allowing the dangerous system calls. So you get like characters like this, right? So how we doing that whole sandbox

thing? Um I've been talking about how we can do capturing system calls but it's not that easy. We need to measure it. We need to like kind of iterate over this stuff. So, does anybody have a little hope? There's got to be Come on. One one person has still has hope, right? We we got hope over here. So, for that guy, I've got this section. Uh so, I've been talking about the operational burden of SEC comp. Um and like some scenarios kind of like just some sharp edges that you can kind of compensate, you can overcome. Um but there's a whole bunch of things around uh SEC comp that allow certain bypasses. So I'm going to ask a question here and

I need some some interaction. Uh what's the difference between these two? One on the on on my left is uh the privileged equals true and and I'm still applying the the runtime default. So this is running a Kubernetes pod in privilege mode. The other one is running a the same container but with all capabilities but with the runtime default. Anybody want to say what they think is like the difference between these two? Like effectively we go like privilege doesn't we don't care about you know any security we've just given it up on it right anybody want to anybody anything anybody want to give me wrong answers all asleep that's all right I understand privilege mode do you truly give up on

security you you remove um sec you remove app armor you move se Linux you move croups you just give straight access to the dev uh um uh uh environment on like the kubernetes or on the little Linux kernel. So you've dumped everything. If you did cap at all, you're still granting the same capabilities inside the container. So it's still not like reducing the risk. Like it can the the container can do anything that it wants to. We're granting that ability, but you keep set, you keep app armor, you keep SC Linux, and you keep a bunch of the other restrictions. So there's this interesting scenario that could come up of like hey we've never really tested

SEC comp because it's always been like a layer of defense and how do we test a layer like when the main container layer is there. So let's remove the container layer and see how secmp really does at providing any level of security for our container. So on the right hand side you're seeing like I'm I'm doing a a custom sec profile with hello.json but I've granted all capability. So the container can do whatever it wants, but my custom secon profile specifically says block socket, bind, listen, block network connections effectively is what I was showing in the in the previous demo. So there's a problem though is um there's a bunch of scenarios where we can bypass SEC comp if we're not paying

attention. And there's about a dozen of these and maybe like people know more. There used to be this this way of flipping um the different architectures to allow like sec to uh only apply to like x32 but not x64 like you've designed it to which would allow it to be bypassed. There's a bunch of things like just running it on ARM means like you need to pay attention to like which architecture your sec profile is for time of check time of use for system calls in general like there's there's a bunch of these different scenarios and we'll go through like IO ring as a as a practical scenario. Um does anybody know about IOU ring? Is anybody into that

kind of stuff? Okay, I got to thank you. I just want to make sure I'm not like completely insane. Um, MongoDB and a lot of databases will use IOU ring. And it was this idea that MongoDB was coming out and saying, um, I'm doing a ton of transactions constantly. That's what the database is for. I don't want to constantly have to do these system calls back and forth to the kernel. I've got a great idea. Why don't I multiplex system calls into one single system call? So IO Uring allows you to do like socket and connection. A bunch of IO normal um system calls and cram them into like one single system call. So this becomes

interesting because I just blocked socket and I blocked connect and I blocked send. But IO ring is a completely different one. So when you multiplex them together, you know, you've got this scenario of okay, I'm attempting to block socket bind and listen, right? That's what we were doing before. But then I go through and I say, "Hey, uh, I've accidentally allowed IOU ring and like oops or like maybe there's a default accept." So inherently you can you can allow this set of system calls. So if I want to make a connection by default using like netcat or just like wget just like a straight connection, my second policy says block socket block accept block connect. So it's going to

uh return blocked on that connection being made, right? Okay, mission accomplished. That comp did its job. It says exit code 7. I do the exact same thing and I've just rewritten that like wget or like the netcat listener to only use IO ring and send only IO ring socket calls and does the exact same thing. Makes the exact same connection but sec doesn't see it. In fact, it doesn't even see that entire set of of IO ring stuff that's happening. And now you can access you can bypass uh set comp and that network restriction. So I said there's two goals, right? If anybody's doing set comp, I want to validate all the effort that you just

put in because all the stuff that I was talking about like the operational details and the burden that goes into it like building consistent profiles and the automation to do this stuff takes a lot of effort. And usually the way that this gets done at scale is that you find like the most atomic workloads, the smallest pieces. And this is kind of what like the Chrome team does. They'll find like the the renderer and they'll say, "I'm going to put a second profile on that piece because I don't want to put a second profile on all every every container in my cluster." that wouldn't uh be easy to maintain. So you find like the most dangerous components that are

usually static and you manage those and then we have to measure the success like we don't even know if we're doing anything correct here. If you don't measure it, it doesn't exist, right? So all of our SEC profiles that we're generating, are we doing a good job? Like are we doing like a 5% improvement? You know, like what are we actually doing to to do this stuff? So we're constantly reassessing. The sec compare set compute tools are supposed to help with that. And then to stomp the dreams of people that are just trying to think about secmp. We were talking about missing system calls, you know, the problems of just uh uh catching unrelated system calls, unapproved

system calls. We talked about like the default seccom profile gives you like a base layer. So we're not just building from scratch. We're saying like okay, if the if there's some base level of security that the default profile is providing, we have to go above and beyond. And we often has to like say it's worth all that effort to go above and beyond. And the other problem is like just we have no inherent measure of this like layer of as a as a security control. Are we doing a good job? We have we have no clue. And then beyond the operational burdens, there's just the the bypass scenario. So okay, I just [ __ ] all over this thing, right, for a

while. Let me let me talk about like some alternatives and like some some ways that we can kind of fix this a little bit. What actually comes next? So a couple of whatif scenarios for you. What if we had better registry support for seccom profiles so we could load them along with the image? We can go like, okay, here's engine X, but somebody has curated, you know, an enginex sec profile for you. Wouldn't that be convenient? So you didn't have to go through all this like dangerous stuff. What if you had dedicated CI and tools that were designed to like more consistently capture system calls and and and kind of bypass all the challenges that I was I was trying to

explain to you. And then this other one that I've been I'm talking about to people is like what if we had like common behavior profiles of like web servers always look like this. These are this sets of system calls that they need. Shells always look like this. We could come up with like these common things even without profiling the um the environments. We can come up with like default setcom profiles for different types of behaviors and different templates. But one of the things that I was um kind of coming to realize is there's there's another angle to this where we say that we could potentially harden um container profiles. So another another question for everybody wake you up for a second.

How many of you are running a cluster that doesn't have a privileged workload in it? Okay. Like inverse there's a couple of you. Okay. How many people of you how many of you are running a privileged workload at least one privileged workload inside your cluster? Yeah. Uh, I don't believe any of the people that say they're not running any privilege workloads, first of all, because like even cube proxy, if you looked at it from like three or four versions ago, still ran in straight privilege and mini cube still runs it this way. So even the core components are running in privilege mode that you have to figure out how to evade and kind of control a little bit. So this is like

this pervasive thing. If you've ever done anything with like a GPU inside of a Kubernetes cluster, they go, "Hey, um, uh, by the way, step one is like run it in privilege mode." So my question is like what if we attacked the things that were already dangerous, the ones that we know are are super scary and they're they're already running in privilege mode. Why are we working on the things that are kind of already hardened and restricted? What if we took the things that are privileged and then just changed them into like just the capabilities they need? They'll still be dangerous and risky like capsis admin. Capnet admin gives you a bunch of capabilities to do dangerous stuff, but

what if you added in, you know, um, sec BPF restrictions on what they can actually do in there. Does this move the bar at all? This is like something that I think is worth explaining. So, I said there's two camps. So, let me cover the second camp um, and we can kind of wrap this up. The second camp is what if we could fake the good virtualization and emulation? Because we started out with we've got bare metal and then we came virtualization and virtualization was basically the only difference between virtualization and containers was this thing called a hypervisor. We don't know what it does but it seems to be kind of important to a lot of cloud environments right if a

hypervisor falls over um most of AWS GCP every cloud environment kind of falls over as well so we know that this is a hardened secure boundary and then containers we're like hey we don't need the hypervisor thing let's just call it like a supervisor and it's barely even that it's like a process that just manages other processes for containerd but now we got this next layer of like hey we went from bare metal to virtualization virtualization to containers but what if we went from containers back into virtualization and cram a hypervisor into a container. So we've got like a bare metal base, we've got the Linux kernel, we've got this, you know, still the supervisor

layer of containers and stuff like that, but now what if we ran like a virtual machine inside of every single one of our containers and and maybe this sounds crazy like when you draw it as a picture, it does look super crazy, but this is what a microVM is and this is what G Visor essentially does. So you get all the security properties of a of a something like a hypervisor with like a strong security boundary. Um but you still get the orchestration properties of a container. So a practical attack would be I need to compromise an application and then I need to compromise the kernel and then I need to compromise the hypervisor and then the

the the name space. So it does add a strong layer of protection and if anybody was asking me like I would go like obviously try this first. Okay, we've been through a lot of stuff. Let's let's summarize some of the things and some of the points that I that I've been trying to make here. So, which runtime do you choose? Like we still have that core question of like I need to do something dangerous inside of Kubernetes like what is it? What's it actually going to going to look like? Hypervisors ends up being the best bare metal if you've got bare metal environments or you've got something that does nested virtualization like this is the way to go. kata containers

uh styly there's a bunch of things that are doing hypervisor based like still container runtimes and same thing with g visor the reason I like g visor a lot of the time is you don't need the um nested virtualization or bare metal machines while it works on those you can apply them in any kubernetes cluster it emulates the kernel as opposed to leveraging hypervisors and virtual machines so there's a a cost to this in terms of performance but you get a lot of opportunities to um to accelerate that with some of the configuration options. If you haven't looked at G Visor in the last like 5 years, um take a look at it because there's there's

some new features that they've that they've just added. And then I don't want to sleep on set. I've been kind of dumping on it for a lot a lot of the time. It has a really good use case and it can be used in in an effective way. Uh it's great for high performance environments. It's great for um when you want to control it. It's also like a pretty big foot gun, right? So, you just need to invest heavily into it, not just go halfway. So we talked about Kubernetes and hard mode. What do we do? Secmp, microVMs, G visor or something like that. The fact that there's so many organizations that are still trying to do hard mode um in

Kubernetes means that there's like still a practical discussion that needs to be had here. Um we don't have really good tools for set comp. So sec compare set compute helps out with that. And we need to figure out like how to invest into this thing to to do it completely correct. And I still said like one is none and two is one. So, two last points I want to make. This is something maybe with a military background you already know about, but practical complexity is a peacetime disease. It was this idea that in the military people learned that if you hadn't seen war, conflict, and oh, we don't have that problem anymore. Uh you you were making like these brittle

processes and procedures because you hadn't really spent in the field and like experienced all the stuff that like really happens when you're out there. So, um, they they realized that this was happening. I think SEC comp kind of falls into that. It's like, ideally, we could do that, but it's not one of those things you just go, I'm going to make a sec profile this weekend, um, and I'm going to secure it because we end up securing the benign and then we become like the bigger risk to our organization because we think that we're trying to improve it, but we're kind of like fetishizing these security controls sometimes. And I said, one is none, two

is one. And this came back from uh, somebody was just talking about like we're going on a hike. would say like, "Oh, well, we need to have a backup of a flashlight." Okay, great. Two flashlights. But the important part of that is you have two working flashlights. You don't go and say, "Hey, I've got two broken flashlights in my backpack. I'm ready to go." You test them. You put the batteries in it. And the same thing with with SECMP. We have to verify the different layers. My very last point is remember that story I was telling you about Google? Well, they went on and they said, "Hey, I need to add in a bunch of like

security controls on top of the system. I'm going to start like isolating the environment. I'm going to make a P trace supervisor. I'm going to build this thing called Gopher. And because of this, they built G Visor because of those hackers in Poland that were doing all this stuff. They created G Visor as a reaction to that Google uh App Engine exploit. So, if you take nothing else away, take the fact that like every December, at the end of December, everybody should be super scared now that Claude exists and there's hackers with time on their hands that they're going to be compromising all of our environments. That's the time that I have. If you want

to talk to me after, we can talk about this. These are uh the if you want to scan this QR code, you can have all the slides if you give me feedback. Thanks very much.

Well, thank you very much, Mark. Uh maybe time for one or two questions. Uh they're brief. Let's try the first one. Can these tools intercept direct SIS calls to the kernel in a program otherwise known to inline and simile? I mean the ones bypassing lib trampolines which can be easily hooked. Um so some of them can all the ones I've been talking about like no like sec doesn't does that inside the kernel. There's a new set of uh challenges when you try to do inline system call interception and I mentioned kind of briefly like the time of check time of use problem. So like there's there's a whole different thing but secmp no it inherently will block it

at the kernel inside the kernel level but it won't do inline um like you're kind of talking about with like ebpf or something. >> This one might be a quick yes or no. Is there a CLI version of set compare? >> Uh yeah that's uh that's what set compute is. So to be clear like set compute is a command line tool that you can just run right now and point it at any secon profile. Okay. >> Well awesome. Thank you. Um going to swap swap that out. All right, we're going to throw up a QR code for surveys. Uh, thank you very much, uh, Mark. A pleasure. Um, just a few reminders before we go. Um,

actually, yeah, round of applause first before the reminders, please.

BSidesSF 2026 - Sandboxes, Seccomp, and Syscalls: Chasing Isolation in Kubernetes (Mark Manning)

Related talks