Self-healing K8s: Lessons Learned in Automation of Incident Response

Name: Self-healing K8s: Lessons Learned in Automation of Incident Response
Uploaded: 2025-11-09
Duration: 21 min 16 s
Description: Ross Clarke presents research on automating the incident response pipeline for Kubernetes, combining eBPF monitoring, neural network-based attacker prediction, and language-model-assisted code patching. The talk covers autonomous alert triage, kill-chain prediction, and live remediation tools, while

BSides Charleston · 202521:1637 viewsPublished 2025-11Watch on YouTube ↗

Speakers

Ross Clarke

Tags

CategoryResearch Technical

TopicAI Security Container Security Detection Engineering DevSecOps

ResearchCase Studies and Incidents Analysis Technical Deep-dives

StyleTalk

About this talk

Ross Clarke presents research on automating the incident response pipeline for Kubernetes, combining eBPF monitoring, neural network-based attacker prediction, and language-model-assisted code patching. The talk covers autonomous alert triage, kill-chain prediction, and live remediation tools, while candidly addressing the real-world risks and limitations of fully automated incident response at scale.

Show original YouTube description

This presentation is a lecture and public release of tooling covering the culmination of his research while at RIT. The topic area of this research focuses on the strengths and pitfalls of automating the incident response pipeline. Work, including eBPF programs to monitor K8s pods for compromise, deep neural networks for attacker next-step prediction, and fine-tuned language models for code patching, will be covered. The main focus, however, will be on a Python library written to conduct live-patching of source code directly within CI/CD pipelines. This library is the result of my Master's research and closes the automation gap in incident response. While it is only partially successful in fully automated incident response, this presentation will cover the risks and limitations of such a system in the real world. This work was motivated by a number of state-of-the-art research on language models and their widespread use to replace human security professionals. The library and approach presented are both novel and developed by me. I have attached the research paper outlining the work performed.

Show transcript [en]

Hey folks, uh, keep this short because I know I'm the last thing between you and a beverage. So hopefully it won't be too bad. Um, and before we get going, I also want to say like don't save questions for the end. Interrupt me, shout out, uh, this was a hour and a half long thesis defense presentation that has been changed into a 20 minute AI generated images. Let's make this fun and laid-back. So, I don't want it to be like uptight or anything. So, please ask questions and interrupt. Um, but to get started, I want to make a a quick point about what self-healing even means in terms of computing. Um, if you hear the

term self-healing, it's oftent times referred to in the physical world for like material sciences. Um, and the concept usually there is a return back to base state. Um and so you know you damage something it returns back to its original form or a form that could be accepted as good. Um with computing a lot of the times when you find something is wrong that base state is where there was an issue a vulnerability and misconfiguration. Um, and so when I started researching what it means to have self-healing compute a few years ago, I kind of caught on to this fact of it's not necessarily bringing you back to a base state the same way that it is in a lot

of material sciences, but rather making fixes to an already imperfect state to take you further forward towards a improved state. Um, now of course self-healing computers as a whole is a very nice like topic and very far-fetched and you can sit around and talk about that for a while. Um, but never actually get anything done. We have to have some sort of target, something to specifically talk about. Um, so I started looking at well Kubernetes it's something easy something very prolific cloud computing everyone uses it whether you know it or not. Um, and then looking at this even further, how do we frame this in a way where we actually know when something bad happens? You you can scan for

vulnerabilities and know that vulnerabilities exist. Um, but you know, maybe they aren't exploitable. Maybe it's not what you should focus on and it's definitely not going to get funding. Um and so taking a look through the lens of instant response of right now this is a process that's long drawn out and how do we actually make this better and faster such that we can actually automate the instant response pipeline and process. Um so a quick introductory slide of who I am. Um I'm Ross uh and I went to College of Charleston. Um I was on the cyber team here for a while and then I went up to RIT in Western New York. Um that's where I did all of this research.

Um thank them for being an awesome research institution. They really truly you know push into their students there. So I had a great time. Um you know as you can guess from the topic I love Kubernetes, cloud computing, containerization. Uh I also love artificial intelligence. Um but less so artificial intelligence and more traditional machine learning. Uh and so you'll kind of see that through this talk. Uh, and then also I'm a big supporter of uh, free open source software. Uh, and I'm going to use this also to say everything I talk about here is my own opinion. Everything that I talk about here is also free and open source online. Um, please come talk to

me afterwards. If you want to try to contribute, there's two main projects that were the outcome of this. Both are on GitHub. Both I would love to have some more maintainers helping me out. So to kick things off a little bit uh in the instant response realm um I use this in every single talk I've ever given, every paper I've ever written. This I think perfectly illustrates what instant response looks like for a single cycle. We all know instant response isn't just one incident. You respond to it once. It's a cyclic nature. And so as you can kind of see here at the very end when you leave the funnel of fidelity, you'll come back out and redeploy something and

start collecting again. Uh so starting on the far left we collect so we deploy sensors across all of our machines across our network and then we write some sort of detections on that data that we're collecting. We then have to you know triage all of those detections down into alerts. Those alerts will then go into someone's queue to investigate. You go through you know picking out what's a false positive, what's actually something that I can you know take further into this lead. And then now that we've had this lead, now we have to come up with some sort of remediation for what happens next. How do we fix the way the attacker got in? How do we fix

the thing that caused this outage? You know, it might not have been an attacker. It could have been some sort of insider threat. You know, not even on purpose. Well, if you're looking at trying to automate this pipeline, you might already be sitting here thinking to yourself like all of these things, you know, if you're working on a security team, I'm already doing. I'm already collecting with sensors. I'm already writing detections on the data collected by those sensors. And then, you know, maybe we have some sort of cool tooling on top of it to already automate all of this. Uh so some of that tooling like soore you know security orchestration and response XDR you know this fancy new version of

EDR you know all of these things were kind of marketed to us as hey you collect the data we'll analyze it and automatically fix everything for you you know it's you know the orchestration response response is built in XDR uses AI and machine learning to automatically respond to threats and mitigate the threats. Uh or even if you have old school edr, you know, the same thing was promised back in the golden ages. Uh and now, you know, since 2022 moving forward, all of our sysos are going to people like this that say, "Hey, look, if you don't have it fully automated, just buy our AI product and our AI language models will fully automate this

pipeline." Now you might implemented a tool that your syso bought and it ended up looking a little bit something like this where oh it hallucinated some sort of Python package and so it created a PR to my code repository. I committed it because it was fully automated. I'm you know lazy and all of a sudden now our servers are all down. Our website dropped whatever it may be. Now why is that? If we're looking at the whole pipeline, all of those tools are trying to automate in one fell swoop that last three steps all in one tool. And when I started in on this research, uh um when I first got to RIT, um I was told, hey,

look, everyone in industry is doing this, but there's really no academic writing to back all of this up. What does it look like to actually implement one of these tools and have it be fully automated, no human in the loop? Uh, and I kind of realized, hey, we're really good at detection. You know, we can detect things on computers by deploying sensors and just harvesting data really, really well. But when we start moving down the funnel of fidelity, we're getting less and less good at doing what we're trying to do because we're lumping it all together and approximating a single function instead of the individual functions that each of these represent. So we very first started with looking at

well if we're good at detection, how can we be better at triaging? What can we do to actually automate the triage process? Uh so myself and three other researchers uh implemented what we call autonomous cyber alert triage. Um it's a fun name. A cat uh you know we like cats and kittens and stuff so we thought it was pretty nifty. Um but at the end of the day it takes an alert Q. Um specifically we built this on top of Suricotta. Um but all the mathematics extend to anything else. Uh and the very first thing that it does is it clusters all of your alerts together. Now you might say, hey, if I'm just clustering

alerts, what good is this? What what am I actually getting out of these clusters? Well, what we're actually doing is we're clustering the alerts together into these attack paths. And these attack paths allow us to build kill chains. So let's say for example, we start on the far left. You know, that's maybe where the attacker first got in. There were alerts fired there. They then moved on up to the next uh you know computer in your network that then that other computer happened to have a access to your code repository. And for this example, code repository is your crown jewels. uh then maybe you know we push some sort of malicious logic to xfill some sort of code through like a CI

runner to another computer in your network and then maybe that computer network didn't quite have a you know internetf facing uh connection so then it was you know pushed further somewhere else that is that attack path that that attacker took to xfill that data that they originally wanted because they don't absolutely need access to the crown jewels to begin with so they're going to make noise as they make their way to your crown jewels and on their way out. So after we've created this graph representation of the attack path that attackers are taking, we have this idea of okay, well now we know what an actual alert is because we know an attack happens. You know, if we just have a

detection and it's not plugged into any graph whatsoever, we can probably guess that it might be a false positive or it just might be noise in our data. But if we can actually link it together and understand that it goes with one of these attack paths, then we can see, okay, that's actually an alert that we should take advantage of and look into. Well, we then moved on to the next step of, well, we can't just say we're done here and you know, these are alerts because that's still an ever growing queue. You know, you're still going to have a human have to manage this. So how do you actually take about investigating these leads and understanding you know

every alert is not going to be an actual true positive. How do we filter out these false positives even more? So we then extended this even further and we generated a neural classifier. Um it's not actually anything vague. It's just a simple deep neural network. Um, but what it allows us to do is it allows us to create four different possible labels for each step in an attack path. Uh, so we classified uh recon action, exploit action, exfiltrate action, or then a false, you know, something was included in your graph that doesn't actually go with this attack path. Um, and so what this allows us to do is after we've generated each of our attack paths, we

can layer on top what we think each of the steps actually uh is representative of. So looking back at that example from before, we can say, okay, well, first they exploited this computer and now they're taking recon action. And so the alert that fired from this computer was maybe they were scanning for other computers in the network. Then after they found another computer, that's when they actually exploited uh where you know your uh SCM where you're storing your source code. Uh and then they exfiltrated it out to this other computer that also fired an alert. Then that was further exfiltrated to another host where then it was passed along. And we can see, okay, these other two alerts

don't actually go with anything that the attacker really was going for. And so we can exclude them from our final attack path. and we can exclude it from the kill chain. Now, the important piece is that we're able to build these kill chains which allow us to essentially know what the robbers are going to do. You can predict the next step in what an attacker is going to do by looking at what typically would happen with an attacker as they're chiseling their way through your network. And so now that you know okay well you know you've taken the action to pivot through our network to actually exploit our SCM and then you've exfiltrated it to a host that is

not internetf facing we can guess that hey we know there's only one host that that host is connected to that is internetf facing. they're probably going to try to shuffle that data over. And so maybe we need to apply some sort of preemptive remediation to that host to prevent them from actually being able to do something, which allows us to be like these police officers stopping the robbers before they even step outside of the bank. Now, this is really cool, but this still leaves that final piece of the puzzle of this pipeline of how do you actually remediate? we we can know what an attacker is going to do next, but how do you actually stop them from

doing that next? So, I built this one tool and I called it Rad Cube because I was like, "Oh, it's a radical look at Kubernetes." You know, you have a, you know, a pod, a container that's under attack and we know that this is where the attacker is going to pivot to next. Well, what do we do? We can quarantine it. That's super simple, right? you know, now they can't, you know, pivot to this next container. Well, if you look at how this guy looks, he's kind of sad. Uh, and that's because customers can't use him either, too. So, what's the point of security if you don't have a business to secure? Uh, you still need that to be an active and

running Kubernetes pod. Uh, and so then looked at it again and said, well, isolation isn't enough. we need to actually apply some sort of active remediation to the hosts to the services running on that host and to those applications. Uh and so that's where I then extended it with this new idea of cube healer. Um and so awesome. Uh and so what this does is it takes uh dirty YAML files. So again narrowing the scope down this is a framework. We can apply it to a whole lot of anything more. Uh but first what we're looking at is Kubernetes configuration files. So how your actual infrastructure is configured. We apply some sort of tool call against that

insecure file. So in this case a llinter that is then added on top with the intelligence that we've gathered from the steps before. Uh and then we pass it through a language model to suggest edits. Uh and then finally hopefully you have a nice clean file. This is a a loop that you can run multiple times. Uh but you know at the end of the day just throwing a language model even though we've narrowed the scope to this final small piece is not going to be enough. You need to have superhuman accuracy. You can't just have something that will apply changes and deploy it to prod and when a human could do the same thing and break things

at the same speed. Uh and so the way that we introduce extra accuracy into these models is two steps. First, we quantize the model down so it's smaller, easier to train with, and easier to train live with your company data essentially. Uh and then we apply uh what's called a low rank adapter. Uh and so this low rank adapter is just a super easy way to retrain a model to introduce new knowledge into the model. So then now we have accuracy introduced into that model that we're using. It still needs to outperform humans. Um you know your typical language models will output you know 80 to 350 tokens a second running on consumer grade hardware. Um which if you type really

really fast, hey maybe you can keep up with something like that. Uh but we need to push that even further. Uh so the way that we can push it further is not actually trying to clean one Kubernetes configuration file at a time but rather do it in batches and use uh you know shared memory of your GPU to lint multiple files at once. Pull the tool call out of your agentic AI workflow. Don't have your language model call a tool. rather use a Python script before you even invoke your language model to generate that context that you're going to use to put into your language model. And then a nifty tool called VLM. Um it's a shared cache and shared memory

scheme for your language models that can massively improve speed. So now have we successfully done this? Like we detect, we triage, we in, you know, investigate and we remediate, right? Well, not quite yet. you know, we still have only done this for the actual infrastructure piece. We haven't done anything for application vulnerabilities or operating system vulnerabilities. These are the next steps that researchers and you know folks here can continue to contribute and work on. So, moving forward, there's a couple of lessons that we need to take away with how we're actually applying automation in a security context. Um and the first is let's learn from dynamic programming and introduce dynamic problem solving. Let's not only look at

a problem as a whole but rather break it down into smaller problems that are easier to solve with the tools that we already have at our disposal. Uh let's not rely on fuzzy logic of Beijian systems. You know don't trust the reasoning of a language model. You know implement some sort of actual decision decision matrix instead. Uh speed really does matter because if you're going to automate something, it better be faster than the human doing it themselves. Uh and then finally, language models are not world models. They're not humans. Use them as tools, not to replace a human. Augment the human with this. Augment your pipelines with this instead. Um so this is just a couple

more fun like AI done images to show, you know, hey, I have a problem. Fix it. Pass it in AI. it doesn't really give you anything good. So instead, you want to break it down with your tool call outside the language model to understand what are your discrete issues to solve. Solve those. You can even use language models and then hopefully that leads to success. Same thing with fuzzy systems. Accuracy drops the more of these you chain together. Speed matters. You can just use reax. You don't have to use AI. And then finally, there's a whole heck of a lot more machine learning algorithms to use beyond language models. Look to apply some of these in

your day-to-day automation pipelines instead for maybe higher uh accuracy outcome or lower compute cost. Um there's a lot more reading. Um some of my favorite papers I have linked here as well. I'll be sure to share these links out or feel free to take a picture. Um, but also special thanks to my adviserss that helped with this research uh and funded the research. Um, so any questions? >> Those links, but can I type those links into my browser in my computer? >> Yeah. Yeah. >> When I go home, like I can type those links. >> Yep. Yeah. So, they're all open on the internet, >> right? And they all relate to >> Yeah. So, the the first tool uh first

one uh two are papers. So the first one's a paper that is way too long. I don't expect anyone to read it, but that's where this came from. Um, second one is that shared paper I talked about in the middle. Um, third and fourth are the two tools that are now released on GitHub. So feel free to check those out, play around with them. Um, and then following through are open- source papers that I personally love um, and pulled a lot of information from uh, during this work. what was like a use case or like um more so like an application stack that you were using as a demo like >> yeah so uh Kubernetes configuration

files uh and so the data set that I tested this on uh there's actually um a a GitHub um it's I I published it as well um but it is 300,000 Kubernetes configuration files So these files represent 300,000 of the most commonly deployed Kubernetes stacks uh that are able to scrape from GitHub. So if you you know set all your Helm charts on GitHub and you put your Kubernetes configuration file on there with it, I scraped that config file, pulled it down and that's what we ran to test this against. Um now of course that was just in like virtualized environments. actual Kubernetes deployments with this running, we only tested around 50 and those were the 50 most common ones.

>> Awesome. Thank you all.

Self-healing K8s: Lessons Learned in Automation of Incident Response

Related talks