Kubernetes For Incident Responders - Travis Lowe

Name: Kubernetes For Incident Responders - Travis Lowe
Uploaded: 2025-06-02
Duration: 21 min 18 s
Description: Over the years, Kubernetes has grown massively in popularity with developers and IT teams. Has your security team grown with them? When a security incident happens within a Kubernetes environment do you know how to unpack the events in order to gain insight into the scope and impact of a security in

BSides KC21:1831 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Travis Lowe

Tags

CategoryTechnical

StyleTalk

About this talk

Over the years, Kubernetes has grown massively in popularity with developers and IT teams. Has your security team grown with them? When a security incident happens within a Kubernetes environment do you know how to unpack the events in order to gain insight into the scope and impact of a security incident? In this session we will walk through a Kubernetes investigation from runtime alert to root cause. Throughout the presentation we will be uncovering the attacker's behavior and the resulting impact from that behavior. Along the way we will discuss some unique features of Kubernetes that can be very powerful when conducting an investigation.

Show transcript [en]

All right, everyone hear me? Okay, I'm gonna try not to do actually what's um so first off Kubernetes talk at four o'clock. Um thanks. It's about all I got to say. Um, so the agenda, um, we are going to skip the brief primer and we're going to do this on the fly as we go. Um, normally this is a longer talk and I shrunk it. Uh, so we're going to talk about logs and then we're going to get into a real Kubernetes incident and kind of walk through the flow of how you work that incident from start to finish. Um, all right. So we skipped the slides. Perfect. Uh, so the anatomy of an event.

So in Kubernetes, there's all kinds of different logs from all kinds of different sources. One of the most critical are the audit events. Those events are basically the who, the what, the where, um, everything but the why basically of what's happening on your Kubernetes cluster. Um, so when we look at it, um, basically it's a JSON blob. Um, over the next three slides, we've got the whole event. Uh, we've got the request timestamp, received time stamp, which is when the event came in. Uh, so it's very, you know, as you build your timeline of what's occurring, when it's occurring, this is very critical. Uh the request URI is essentially what resource in the cluster is being modified.

There's all kinds of different resources. I think by default there's like 66 that get installed with Kubernetes just to make Kubernetes function. And then as the you know the DevOps teams extend the capabilities there's even more that get added. Um and the way the the request URIs are structured is very very important as well. So this one here says namespaces default secrets. And so what we can read out of this is that this request uh is interacting with the secrets in the default namespace. Um namespaces in Kubernetes are an organizational structure. It's you can think of it as super super super lightweight security boundaries, but in reality they're just buckets to organize workloads. Uh as we

move down the event, we have response status code of 200. Um anything that's successful in Kubernetes Oh, the noise changed up here. And I was like, yeah. Um so everything 200 is going to be success and then you know your rest your typical web codes. Uh the source IP is where the request came from. Um sometimes this will be an external IP, sometimes this will be an internal IP to the cluster. Uh and then the stage is response complete for this particular request. Um whenever you see response complete that means we're done. Um you'll see other stages like request received and things like that. Um I think there's four different stages in total but when you

see complete it's done. and there won't be any more follow on follow-up update up updates. Uh and then we've got user username. Uh this is who is making the request. So in this particular case, we've got system anonymous. If it's a service account, it'll say like system uh service account and then a namespace. We'll get into that a little bit later. Um and then the user agent is just like a standard web user agent and then the verb. And there are about six verbs in Kubernetes. So there's list, uh create, get, delete, patch, and then there's one more that I always forget. Um, but you don't see it very often. Um, the list and git are both read commands. So if

you see a list and a git, that's just someone extracting information. Patch and create and delete are all going to be modifiers that are going to change the resource on the cluster. Uh, and that's basically the anatomy of an event. So as we get into an incident, um, you come in in the morning, uh, and you can't read this, it's fine, but you have an alert that basically says, uh, critical XM rig has been found running on your cluster. Uh, and then we see the command, we see the the PID, and then the container with the container ID there. And then further on down, we have this broken off to where we've got it's the image of XM rig, the

tag of latest, the container name, which is core DNS, and then the pod name. As we start pulling on the thread of understanding what's happening in our cluster, this pod name is going to be critical because it's basically our first bit of information that we can start pulling on. Uh the Ka the Kate's namespace name at the bottom there is also very important uh because that's kind of like which bucket do you need to start looking in. Uh and then just a quick flashback moment. Um this is our rule that detected our XM rig container. Basically uh this is Falco. Uh it's a Falco rule that is basically it's kind of cheesy. It's basically looking for any uh

process with the name of XM rig uh and then it's execing. Um like I said it's a cheesy rule. for the demo. Um, but Falco is free and you can extend it and make it whatever you want to be. It's basically an open source edr is what it comes down to. Uh, and then we're also sucking in all of those audit events. We're also pulling in events that we haven't talked about yet. Um, such as the actual execution events that are happening in the pods themselves. So like uh spoiler alert, we'll see uh some web server logs here in the future. Uh we're we're bringing in those events as well. So any of the access logs and

things like that. Okay, back to it. So, here's our alert boiled down to things we actually care about. XMR rig pod name in the coupe system name space. So, uh if we do a search um and this is basically it's log scale query um this is basically what we'll search for. We'll search for the audit verb of create back to those verbs if we're looking for a pod creation event uh because we want to know where it came from. Uh, and then we just dump the pod name in that we saw on the previous slide. And then we see this table that we get back. Uh, and then at the very bottom we've got um a verb of

create the response status code of 2011 which means the pod was successfully created. We see the request URI is namespaces coupube system pods. That all aligns with the alert that we got. But then we see the username is um system service account coupube system damon set controller. So in Kubernetes, there's this concept of damon sets. And basically, when a damon set gets deployed, you're essentially telling Kubernetes that you want a pod deployed on every single node in the cluster. And you can think of a node just like a a virtual machine or an EC2 instance. It's just that physical server. That's it's the compute capability of the server basically. So a damon set is saying

deploy this pod on every single one. As we piece this back together again, we're seeing that XM rig has essentially been deployed on every single node in the cluster at least one time. So, as we start pulling on the thread even more, we can now say, well, where did this Damon set come from? Uh, a quick note about this particular service account and how we got to its being a Damon set. This Damon set controller service account is basically the controller service account responsible for managing Damon sets. kind of confusing, but it kind it all plays together. Um, the name is not necessarily unique though, so keep that in mind as you're doing incident response that someone malicious could

just create an essentially typo squad on the same name, but it's the only thread we have at the moment. So, we have to pull on it. So, as we search for this, um, you may ask yourself, well, how do I know how do I know what that API call looks like? Because that's where we're going to go next. We're going to start and look at the request URI for the Damon set creation. When you run coupubectl with a v like a -v6, it will actually show you the API call that is making on your behalf to the cluster. So you can see, you probably can't see it on the screenshot here, but as we ran

coupe control create damon set or get damon set, it gives you that full URI that you can look at. That way you know where to hunt for as you pull on the threads. Um, so back to our log scale query here, we we're looking for the audit verb of create still and but we've modified our request URI to now say show me any new Damon set that's been created. That way you can say you know what's been created. Uh, and then we get this event back that says the username um system service account coupe system coupube admin uh created this created this and then we've got this JSON blob uh that we break out down here. Uh so

we've got the kind of damon set which aligns uh we've got the name of core DNS which aligns with the pod name that we've been seeing. It's in the coupe system namespace and then you know we have XM rig arguments down below and then aligning with the alert we've got our XM rig latest image. So all this aligns at this point. What we know is that in the coupe system namespace a damon set called core DNS was created running XM rig and it was created by the coupube system coupube admin service account. So as we're pivoting now we've gone from a pod to a damon set to now we're going to start searching for activity from a particular service

account. So we'll say what else did this service account do? And so now we can pivot off of the search and say show me all of the events that the user username coup admin did and then we'll filter out some noise because we don't care about it and we'll get all kinds of stuff back. Um we'll go through some of these. Some of these we'll just end up skipping. Um the first one is really really important um because it is a create event for self- subject self-s subject rules review which is essentially someone saying what can I do when you run coupubectl o can I tact list or just a you'll get an output like

this right here which basically says these are your permissions on the cluster and that very first API call that we see the self-subject rules review is the result of that. So we know that whoever got access to the service account, the very first thing that they did, they said, "What can I do?" Um, after that, we see a git request to check to see if that core DNS damon set exists. They get a 404 back. Um, after that, we see a couple of other git requests and then we see another git request to that specific um, Damon set again and the but they're listing it to default Damon sets coupe system. So they're saying show me all the damon

sets in that particular namespace in the coupe system namespace. Remember the name spaces are all those buckets. Um they get a 404 back because there are no damon sets at this point in time. Um and then we see a pattern of on lines 67 and 8 a 404 a 200 and a 2011. That is behavior of someone using coupubectl to attempt to create something. Um so on line eight we see the actual create happen. But with coupubectl, it's smart enough to where if the resource already exists, what it will do is it'll issue a patch instead of a create command. So the first thing it does is it says, does this resource exist? That way I can know if I need to

do a create or try to do a patch. Uh malicious tools, if someone just has it all scripted, they'll bypass this particular behavior. And so instead of checking to see if it exists like on line six and then on line seven, they'll just jump to line eight and try to create if the resource already exists, they'll just get a 40 or four back. It doesn't hurt anything. Uh but in this case, it didn't. They got a 2011 here. Uh okay, so what else did they do? Uh font is getting a little bit smaller. Um so we see that they did a bunch of list commands and then some get commands. Um essentially here what they're doing is they're enumerating

secrets from the cluster. So you'll see several calls to uh an APIV v1 a namespace and then in this case on like line 3 it's argo CD and then secrets. Um what they're doing there is they're dumping secrets from the cluster. They first start by walking individual namespaces. Um and then by the time they get down to line 11 they just kind of get tired of walking individual namespaces and they the call changes. So now it's API v1 secrets which basically means that they're dumping every secret across every namespace in the cluster. from a responder point of view, your investigation just got really, really big. Um, because now you need to understand what those secrets are. A lot

of times developers will stick secrets in their workloads or in their namespaces. That way, their workloads can query those secrets and they'll be like API credentials. There'll be things like that. So, essentially, you've got to go in and unpack every secret to understand what it's for. That way you can understand the entire scope of the particular incident now because you're going to have to rotate all of those because they've all been compromised. Um with coupubectl when you when an attacker makes a query um there is no difference between them dumping all of the output and just querying for it. So in coupubectl you can say coupl get secrets in all all namespaces and then you can add a - o yaml or o json to dump

all that to your terminal. But no matter what, you're getting all that same data back. You're just kind of filtering and telling the binary what you want to see. So anytime you see a call like this, just know that any data associated with that resource has been sent to the client that requested it. So at this point, we know a lot of what's happened based off of pivoting on that one particular service account, that coupe admin service account that was that exists. It's doing weird malicious things. It created XM rig. Um, so we can pivot. We can pivot off of the IP that was utilized to make all those calls. Um, so we, you know, it's this

17085 address or we can try to figure out where did the service count come from. In some cases, you're going to have to do both. Uh, just kind of depends on the attacker and how they've been operating. Uh, in this case, we'll just pivot off of the IP. Um, and so here we've got a search. Basically, we're saying Kubernetes audit source IP zero. And then we dump in that address. And then we have a table command to make it look pretty for us. and we get a bunch of a different activity as well. So we've got a new user account called tester admin. So at the very top we've got the system service account playground which means this particular

service account that's was initially compromised or at least in this next step compromised is from the playground namespace and then it's the tester admin service account. Uh the number one request there again self- subject rules review. It's the first thing they did essentially said what can I do? Uh and then we see them basically doing some enumeration for a cluster role, a service account, and creating their new service account. Uh so in Kubernetes, we there's like three things that have to happen in order for a service account to have access to be able to query resources on a cluster. One, you need a service account. Two, you need a role that has permissions to do things. And

then you need a binding. And that binding ties the role and the service account together. That way, it's one happy family, basically. And that's what we see this behavior here is they're going through and they're creating all of that. Um, and then on line three and four, what they're doing is creating a token for that particular service account. So on line three, we see the default service accounts coupe admin token. They get a 404. They try it again, they get a 404, but then they realize, oh, I created that particular service account, the coupe admin service account in the coupe system namespace. And so you see the URI path modified on line five to now say coupe system

namespace and then request to generate the token and then they get a 2011 to successfully have a new new token. Um so at this point line six is the start of the behavior we've already investigated. Basically they're at this point starting you know saying hey what can I do just to make sure that they have everything correct. So at this point we now have another service account did weird things. Where did this come from? Like what is this particular service account associated with? We know it's in the playground namespace. So we can run coupubectl get pods in the namespace playground. And then we can output um and change the filter for the output. That way we see

the name of a pod as well as a service account associated with it. In this case, we get lucky and there's only one single pod that exists with that service account. Um, and we see it says EngineX deployment. So, we can now pivot and say, show me the logs for this. And so, we can, you know, do coupubectl logs and type in the pod. And then we see, of course, it's a local file inclusion where they have been able to access the service account token. um the local certificate for trust because everybody loves certificates and that's what Kubernetes runs on is trust and certificates. Uh and then we see that the very first request uh is a service

account namespace where they're just saying what's the namespace this service account is associated with. Uh so now we know that hey someone compromised a service account through engine X and then they use that to pivot. Actually my next slide shows this. Great. So we've got an LFI with www user accessing local storage stuff for uh a service account in Kubernetes. Uh it's the Kate's tester admin account. They pull all that information off that way they can access the cluster at the API level. They enumerate permissions. They create persistence. So they create a cluster role and a binding and all all the things required to access the cluster. They pivot into that new identity that they created. They enumerate permissions

and then they look for their existing mining Damon set. It's not there so they create it. At that point we get an alert from Falco saying hey XMR rig was detected and then that's when we come in and start doing our investigation. Uh but during all that the attackers also gone in and done list namespaces access various secrets on the cluster. Uh at some point they also accessed a config map. Uh and then they dumped all the secrets as well. um we didn't talk about it but config maps um usually those are like application configurations. So in the world of engineext because you know Kubernetes is designed to run web servers. So that's what we're going to

do. Um basically the config map itself is going to be like your engine X config. So that's where developers will store that. And what that allows them to do is have a pod that's very bare bones. They don't have to have config. They don't have to have configurations stored within the manifest to create the pod. They can modify just the config map and then when the next pod comes up, it'll pull in the new config. Um, but you'll also find sensitive information in config maps as well. Okay. So, other things that we can do now that we kind of have a story here. Uh, we could checkpoint the container image. So we could go in and

say, "Hey, this XMR rig, uh not the XMR rig one, uh but the EngineX container. We could go in and try to take a a container snapshot of that." Um checkpointing is very new and it's very touchy. Make sure you have this as a well doumented process because chances are it's not going to work the way you think it's going to or it's not going to play out very well. um we can cordon off a node um for XM rig itself since XMRG got deployed as a damon set we don't want to cordon that off because that's going to be every node in the cluster and you'll just cordon everything off and then kubernetes as it manages state

we'll just spin up 12 more nodes for you and then xm rig will be on all of those as well and it'll just be this never- ending process but where the um where the engine x pod is you can cordon that node off which basically sticks it in a corner and says no new workloads can be scheduled here. This allows you time to breathe and to do any deeper forensics because very quickly if you want to do deeper forensics, you're going to go to host level forensics because it's all just Linux under the hood and you're going to start dumping memory the way you would normally when you do Linux and it's incident response. Um, you can also

isolate the pod. Um, so you can create what's called a network policy and you can think of network policies as essentially firewall rules that go around your pods. Um and this one here basically says ingress egress is z you know nothing it's nil and that will essentially snip all network traffic um for your pod. So nothing new can be established no new in no new out. So that just kind of isolates just the pod itself. Again this gives you time to breathe basically. Um, so additional act after actions, uh, we can fix the security gaps because obviously engine X has a problem. We need to go figure out what that is, get that patched or pull the

website down if it's possible. More than likely it's not. So we need to figure out a patch for it or we need to stick a W in front of it that's going to, you know, stop local file inclusion attacks. Um, we need to remove all of the artifacts. So the coupube admin service account being created, the all the cluster role, the RO binding, the service account, all that's got to get removed. we have to remove the damon set. Um, and then after that, this is all pretty much cleaned up because it's a pretty straightforward incident. Um, alternatively, you'll see attackers instead of creating a damon set that just runs xm rig, you'll see them actually escape to the host. So, they'll

create a damon set that does a host mount. That way, they can access the underlying host file system, in which case then they start to do things at the host level of the node itself. So, they'll create like a cron like Linux cron job or something along those lines. um in your scope and the amount of cleanup that you're going to have to do is significant at that point because now you're full-fledged on Linux persistence techniques that you've got to deal with and try to understand and unpack. Um okay, that was a very short version of this presentation.

Kubernetes For Incident Responders - Travis Lowe

Related talks