Incident Response in containerized and ephemeral environments

Name: Incident Response in containerized and ephemeral environments
Uploaded: 2023-01-31
Duration: 40 min 35 s
Description: David Mitchell and Adrian Wood examine incident response challenges in containerized and ephemeral environments, covering preparation strategies, containment and quarantine techniques, and advanced defensive tools like eBPF and machine learning for detecting anomalous behavior. The talk explores pra

BSides Charleston · 202240:3561 viewsPublished 2023-01Watch on YouTube ↗

Speakers

David Mitchell Adrian Wood

Tags

CategoryTechnical

TopicContainer Security Detection Engineering DFIR

TeamBlue

StyleTalk

Mentioned in this talk

Tools used

AppArmor iptables KeePass SELinux SentinelOne

Platforms

Frameworks

Concepts

Vendors

About this talk

David Mitchell and Adrian Wood examine incident response challenges in containerized and ephemeral environments, covering preparation strategies, containment and quarantine techniques, and advanced defensive tools like eBPF and machine learning for detecting anomalous behavior. The talk explores practical IR workflows in Kubernetes, volatile artifact collection, and emerging threats such as eBPF-based malware.

Show original YouTube description

Security BSides 2022 Folly Beach, SC November 19, 2022 @BSidesCHS Title: Incident Response in containerized and ephemeral environments Speaker: David Mitchell and Adrian Wood

Show transcript [en]

containers and making sure that it doesn't exhaust the resources in a field it moves them around and gives it certain barriers and parameters of what it's allowed to do and once again we're talking about your infrastructure being cattle not a pet so if something goes wrong you just get another one so all of these things introduce a number of complexities primarily those being like from an IR perspective tracing what happened in a container all the way up through the stack back through like a kubernetes infrastructure or something like that is a real headache it's a very difficult thing to do and manage and you're also increasing the amount of layers of identities that you have to manage your

environment and so the configuration of those identities and just tracing like who did what where and when it just gets a lot harder you're talking about a massive increase in logging for those who are in our Workshop um the the lab that you stood up an empty kubernetes cluster with full logging enabled is generating three megabytes of logs a second it comes out of the box with hundreds of Secrets already in it so keeping track of logging in a production environment is incredibly expensive and difficult to the point where many companies don't actually log everything like people tell them to just because it's not it's not affordable you have a change in your attack surface

I'm not going to say if it's better or worse it's just different and different always means like starting over with trying to you know go through the process of finding everything that is wrong you got a lot of people who are migrating to these ephemeral environments migrating to clouds doing hybrid clouds you know most of that time that's being done by people who haven't done it before they're learning so you know stuff happens identity management features again because it is a real pain um and ephemeral instrumentation what we're talking about there is if you're running your business on shared hardware there are certain things from an IR perspective that are not possible to do because you can't snapshot a tenant that

you share with other people so let's cover a bit about preparation for an issue when you've got like hackers or some kind of malware in these containerizer and ephemeral environments so preparation comes down to two main things we're talking about prevention that's sort of like your vulnerability management side of things but also your collection how you get ready to deal with these kinds of problems and there's a lot of steps to uh prevention but we're just going to cover a couple of the pink ones the first thing that generally speaking people don't do is create an incident response project within their cloud provider which has pre-configured sets of rules of like containment within that environment of like this project space

has the capability to pull any container or any host out of another part of the business and pull it over here so that your forensics people can stare at it and do things to it or just like see what's gone wrong most people don't have that in advance and when you don't have that in advance then you have to do it on the Fly which is like difficult and a pain because you probably don't have permission CI CD controls are really important if you are discover if you are using your live environment to detect like issues where attackers are trying to enumerate permissions like seeing if they can mount the host file system or trying to

start a privileged process those look the same as when developers are figuring out how to do kubernetes so you're going to have thousands of log events a day of people trying to do a thing while they learn but then you can't tell the difference between them and an attacker um David would you like to cover sandboxing and quarantining yeah um I just don't know about that so yeah the idea with these controls is just make sure you the developers are blocked before it gets into your environment so you don't have those alerts firing um the sandboxing and quarantine patterning is uh basically the idea of uh isolating your processes so you already have containerization you

already have things isolated um you want to have sort of what he's talking about having that project you want to have something where you can label an environment label a workload say this is quarantine and make sure that no automated processes or other people come in and touch this workload because it's uh it's being under investigation think of it like putting crime scene tape up around uh around a like a crime obviously so the idea here is you want to set there's a set of things you want to put in place we'll cover later to make sure that that stuff that that gets touched yeah one of the beauties of kubernetes is you can in advance through

preparation leverage a lot of Automation in the IR process you can you can literally type a one-line command to quarantine a host and have kubernetes automatically go through a bunch of steps to like move workloads around um make sure that other like uh pods or containers aren't issued to that node that you're having problems with so you're not like having your workloads run next to a machine that you think might be infected security tools on the host is really important um sometimes you're seeing people that are running their security tools in the container with the workload which means that you're giving the attacker a lot of opportunities to [ __ ] with that and you're also giving the attacker a lot of

opportunities to um like better understand what's happening in your environment and work around that so you really want to make sure that your security tools are up one layer out of the container looking in now the problem with when you do that is you've got now lower Fidelity of what your alerting looks like because it's outside of the container so a lot of traditional security tools tools really tend to fall over here because they're giving you low low Fidelity information and you can't tell what's going on very well there's some new technologies like ebpf that we'll cover in a moment that really increase the level of visibility you get whilst um keeping your tools further

away from attackers foreign the incident response process hasn't really changed in its fundamentals it's just the processes you need to go on to look different you still need logs you still need live info and you still need disk info but the process of getting that is just a little different so in many of these like cloud or ephemeral environments you have some new logging sources that weren't available to you before most of them are like really well structured uh really easy to read and pretty good data sources but they can be kind of expensive to to collect so you need to make some decisions about like some of these sources and what you want from where because you really want to

drop down on the amount of duplication of the same information and with that said it's like you don't want to ignore your application logs just because you can burn down the application and start over a thing you hear a lot of people say about ephemeral environments is oh if it gets hacked I'll just burn it down and spin up a new one and then it's like no harm no foul well the attacker will just come back through whatever mechanism got them there in the first place likely they'll just keep repeating that until they hit their end goals so you really don't want to just like neglect those sources of like data that you think you don't need

anymore so collecting live info in these kinds of environments is certainly more difficult than it used to be and I think the main takeaway from this slide is you need to really be prepared that the reality is that you're going to have more than one container infected at once it's probably going to be multiple for a variety of reasons and being prepared to take uh like live info snapshots from multiple uh infected places at once is going to be key generally speaking one of the tools that helps you with this in these kinds of ephemeral environments is is called like a container sidecar this is basically a container that lives slightly outside of your workload container that say all the

network traffic passes through and it gives you a place to like log and collect all that information the issue is that many places by the time the security team is brought in and said hey what do you need from these hosts to do your job and you say oh I need a sidecar that does XYZ the business turns around and says we're already running three sidecars we can't add a fourth because latency because cost because complexity and then now you have to figure something else out so involvement and preparedness early on is going to be key so with this collection this is generally speaking this is like a lot easier in Cloud environments because you

can just like literally type A one-liner and snapshot a disk in like two seconds with containers it's a little different but the underlying file systems are going to be available through the provided a snapshot and snapping a disk in a regular environment you know that's no change there what I would say is you want to make sure that your snapshots it's going to be you're going to be doing incident response in an environment you're not familiar with it could be an application that's been affected at your business you've never even heard of because you have 3 000 of them you're going to need to be able to diff the the bad one against the good

one or the known good one to just like narrow down the amount of stuff you have to look at so if you're if your plan is to do snapshots across your Fleet coming back to my earlier point about uh having an IR project that has permissions across the project space or across the fleet you know in AWS or gcloud or whatever it is is do you actually have permissions from a well-protected account to snapshot some host that you're worried about somewhere in like the retail side of the business right or are you are you gonna like have to go talk to seven different people in order to do that that's not going to be good enough when the time comes

and definitely worth noting that any incident response tooling or process is like hella powerful so make sure that those accounts are well protected and audited themselves so that you aren't the cause of the problem uh here's a visualization of what I'm talking about here so in Google cloud or AWS this is the typical structure of what the project spaces look like each line of business or project within your company is going to have different uh like project buckets it's mainly done this way for billing purposes so we can see if the factor is spending a lot more money than the retail side or if the IT team is like spending too much money but within this you'll have a forensic space

that'll have permissions to reach out and touch things in other places and pull them in so a quick note on kubernetes logs so here's an abstraction of a very basic kubernetes cluster with one node uh you know a normal business might have you know 10 000 nodes but each each node has these logging components and you can see that their your container run time and your pod loggings these are additional logs on top of what you were collecting before and all of this is being logged through a Daemon set to the control plane now a quick note on Damon sets that's really really neat is a Daemon set is set in in in the uh in

like the Manifest file that you deploy with kubernetes which is the list of instructions of what you want kubernetes to do and where and this diamond set basically says that every resource needs to have these minimum things installed on it so you can use this as a CI CD control where you can say in GitHub or whatever you use as a CI Runner every single thing in my business better have a manifest file where a diamond set is listed as our like scene like Splunk or you know whatever tool you use for doing that that means some application you've never even heard of within the business won't be able to deploy unless they meet your minimum requirements

which is going to be really really important and easy and fun so audit logging and kubernetes you see advice all the time from you know everyone is just like turn on kubernetes audit logging if you go to your business right now and turn on kubernetes audit logging full audit logging you'll probably send the place broke by the end of the month it's extremely expensive you need to make really careful decisions based upon the logging you're getting elsewhere about what you can afford to get out of kubernetes logs when you're running like full request response logging like I said an empty cluster is generating three Megs of logs every second so a very large business might be running you know 50 million

dollars worth of kubernetes logs a month like that would be pretty much part of the course when we talk about what these logs in kubernetes look like it's really nice human readable Json and all kubernetes is making on the back end to do things there's a bunch of HTTP web requests so you can quickly look at this and see what is happening with a you know like a create command you can see the who what when where why so you can see why everyone says turn on full audit logging because all the information is right there and it's super easy but yeah you're going to want to balance this against joining logs from different sources in order to save

money unfortunately so container forensics you know you've used your logging and you know you've got a problem unfortunately we can't just burn it down and start over we do need to be able to tell a story about what happened it won't be good enough to just say yeah we just burnt that down we don't really know what happened and we're starting over that's probably not gonna you know cut it so some general notes is don't log into the container that you think is infected you can stay on the host machine but really you want to rely on your tools don't like literally log into the Container ideally should never decision anything right yeah or anything yeah

ideally yeah so some of the strategies Dave yeah so um when you're dealing with the incident response you sort of have to have sort of a plan ahead of time you took your playbooks and and so what we have here is sort of like the logic or the reasoning behind sort of the the four main uh reactions to an event the first one is isolating that's the one we're talking about that's your best case scenario that's where you take a workload that's infected you move it somewhere are you are you isolated in such a way that it can't talk to other workloads and you're able to uh take your time look at what the attacker is

doing find out what they're doing from there figure out what else they're they might be doing elsewhere in your environment um so that's your I that's your ideal that's what you want to do but there's cases where you may end up doing the other things the reasons the other was positive this is where you stop the workload but you don't destroy it uh that's the next best case scenario um the third one is so probably technically the uh the laziest which is just to restart the workload but you may end up getting overruled by somebody in business who says I just want things back up and running um and you just you don't have you don't

have the choice of stopping things you have to just get the business backup and running so you sort of need to be prepared but even then you need to have sort of like a game plan on what collecting the forensics trying to keep the attacker uh important in and not spread throughout your environment uh the last one is killing everything uh killing your workloads this is basically the nightmare scenario where you end up uh you're trying to deal with something where the attackers like leaking data you have some information you have information compromise that kind of thing so a talk through like these a little more detail um when you talk about isolation we talked about there's a couple of

strategies for this uh kubernetes has a concept called Courtney which I'll talk about in a minute but the idea is like the first thing you want to do is this applies to the VMS this applies to kubernetes as well apply a label to your uh environment label maybe it's a label you don't want these to be obvious to attack her oh I'm under investigation it's not something you want the attacker to see but it's something that would immediately say okay do not stop this do not stop or shut down or move or reschedule this workload keep it here keep it running and then based on that you might have some other automated processes in the following the first

thing you want to do is if there's any security credentials tied to the Pod or the workload's in there you want to immediately maybe if you revoke revoke their access maybe keep them around but like don't just make it like change the roles for book taxes to keep the attacker in so they can't use those credentials elsewhere uh you can also create network policy rules to block egress aggressor egress especially to other portions of your environment like you don't want pod to pod communication going on that kind of thing um and then of course coordinating the node and draining other workflows off of it so um if you can't if you can't move it to

a project what you do is you you make the node that it's running on the workloads of running on your uh effectively you make that your crime scene you get everything you give everyone else out and you you know start examining the body more or less um the other thing you want to do is start capturing any volatile artifacts ASAP this is basically what's in uh in memory if you have like logs that are gonna like the retention policies because of expense are like very short-lived you want to go ahead and start snapshotting those like within the window a certain window okay great um thank you uh so yeah this is just a quick demo of

like doing a kubernetes important so yeah that idea between behind coordinating is you're just telling the scheduler inside of kubernetes do not put any more workloads on this on this uh worker node and once that's done then you can start draining other workloads off the ones that aren't infected or compromised so yeah sorry they're not compromised um next one the next the next strategy is pausing your workloads now if you're taking a snapshot generally the processes get paused anyway while you know capturing memory that kind of thing um the other thing you might want to do this for is if you're trying to if you have a workload like a a crypto Miner that's uh consuming a lot

of resources that's where you want to pause a workload or a very least throttle it so it's not running up your bill [Music] but yeah it's possible you may want not want to do this because it may tip off an attacker that you're on um again this is the one we talked about the one you don't want to do is restarting it um it won't fix your problems the attacker's just going to come back uh the only white the only reason you may want to be doing restart is you're actually rolling out a face like you're redeploying or redeploying a new version of your application to fix the problem allow them in the first place

and then of course this is the final option as uh your last report option which is just to kill the processes now a quick note here is that when you do something called a Docker stop it sends the process a nice kill signal well this may tip off the attacker that the process is being killed and they may have something automated to stay in your environment or whatever you're tipping it off so if you're doing a kill you want to do like the hard kill like basically the kill-9 where it just kills the process right away this is the thing you would do if you were uh there you see gotta x-fill or um they're compromised they've

compromised your information they're changing the information in your systems um so let's talk a little bit about some fancy detection technologies that have come along to help simplify the process of seeing what's happening inside some of these ephemeral workloads where things have been extract abstracted and it's getting a little bit more difficult so ebpf was introduced into the Linux kernel in like December 2014 and it lets you extend the kernel that's the operating system without having to like literally own maintenance in the operating system which is something that most businesses do not want to do it's extremely involved and time consuming and breaks a lot uh it's very expensive so this was introduced as a way to

get kernel level observability and control without having to clear that huge power of maintaining a kernel module so how evpf works is it's kind of like if you are sitting in a restaurant and you want to know what's happening in the kitchen uh you know like in the in the operating system of the of the place you you want to know you're not allowed to interact with like the stove or anything else inside the kitchen of the restaurant you don't work there you're not allowed with evpf you're basically putting an observing Chef in the kitchen who can touch everything in the kitchen open up and see what's going on and react appropriately and then let you know

like a health inspector yeah like a health inspector so what is happening here is we can use evpf to monitor anything that's happening within the kernel that could be like system calls or network events or like any call to any kind of piece of Hardware or anything like that and we can report bits of that information that we choose back into the user space like back into the normal place where users live um this technology is being used also for root kits and malware as well um and so when you're talking about ebpf malware you're talking about some of the most resilient and stealthy pieces of malware that you can possibly imagine um I have some information about a

particular instance of this that happened over the course of like eight years undetected but if you've ever used TCP dump you've used a tool called bvf under the wood basically and with evpf we've taken the concept of uh TCP dump and just applied it to every single thing within the operating system so when you make a little uh observing Chef to put in the kitchen what you're saying is you're basically creating a list of hooks of things that might happen within the system that you care about and saying when you see this event or this number of events or this type of event like let me know so evpf gives you really really deep and

detailed views into what is happening within the operating system and because attackers can use ebpf2 it means that if an attacker is using ebpf and you are not they have better control of your computer than you do because if you are using a tool like say using a little program like PS to see a list of running processes well if the attacker is running ebpf malware they own whatever response that PS decides to give so you can't fight ebpf malware without evdf and because of the amount of like really really detailed viewing you get it really is one of the best ways to like stay on top of a handle or what's happening within your containers

um in this gift here let me see which one is this okay what we've got here is a gif of a open source tool called Tracy which is a uh kind of like an open source EDI you could think of but for like containerized workloads and various things of that nature and all of its rules and detections are all ebpf based meaning that they can't be tricked by really really fancy rickets and you also get really really high fidelity alerting and detection that um it's very difficult for attackers to get around an example would be like a lot of tools like OS query and different things like that that are say like pulling the batch history off a machine

and then looking for times where someone like carried some secret file could just be avoided by like using a space in front of your command or a little bit of obfuscation or you know calling it through platform first with these kinds of things because you're really looking for a call made from the kernel you can't you can't trip that right because at the end of the day all of those abstractions are doing system calls under the hood oh wait just to add a note to the evpf is that um we talked earlier about having a head sidecar uh sidecar pods or systems for monitoring potato monitoring also having to have security tools running inside your kubernetes environment with like

elevated permissions evpf because it's running outside of it saves you that trouble you don't have to worry about giving some service account to the attacker might Target or exploit because it's in the kernel it's going to see everything from outside the kubernetes at the host level one really cool thing about evpf is because you're talking about well-structured data is there are a number of machine learning approaches or techniques that make quite a lot of sense to leverage alongside ebpf from a detection point of view a research point of view and like a rehearsing point of view you see a lot of strengths with behavioral profiling and anomaly detection um and I've been using ebpf in some

reversing on some reversing related things to monitor processes and I'll give you an example of that shortly so we can use ebpf alongside machine learning tools in this example we're just using tensorflow to look for anomalies in the behavior of a process so this could be a this could be used in a defensive capacity or a research capacity in the example we've taken a baseline of the behavior of a password to a tool called keepass so for 30 minutes we use keepass normally and we just log all of the system calls that it makes into a histogram which is basically just a CSV and what happens then is we we train a model on that data it's just completely

just just literally take the daughter as it is and train a model and then we start messing around with k-pops doing things like you know trying lots of wrong passwords consecutively and you can see after the learning step we save all of these samples and we do our malicious activity and the tool will start reporting to us any system calls that keep us makes that it wasn't making before so there's a lot of use cases for this as you can imagine if you're trying to like find errors or weird Behavior you know you could pair this with something like app armor or SEO Linux to refine your uh your ruling to get yourself like a nice runtime application security

protection happening um I use this also against edrs the place I work uses um you know an EDR something like you know crowdstrike or Sentinel one or something like that you can actually just monitor the EDR process while you build malware and then since you can't see you know the EDR as a proprietary software but if you're monitoring on system calls you can be aware of when it starts opening a lot of sockets and phoning home to find out what to do or when it starts trying to like read things about your files you can literally see the uptick in system activity as an anomaly and then you know if you're on the right track or not with

like hiding your malware so ebpf malware as I mentioned yeah one thing to note too is when you buy machine learning products they're going to have their stuff trained on like other environments the they're going to get more value out of trading your models on your own environment first yeah yeah thank you so ebpf malware is really really exciting uh area space so people make the argument that evpf malware isn't like that new or exciting because it's not that different to a rough chain but anyone who's ever used a Roth chain to try and like maintain persistence at a business is probably never particularly shocked when they come to come to uh use their uh Beacon one day and it's just

not there right but with ebpf malware you have something that's embedded into the operating system it's past static analyzer checks it's now like part of the computer um it's extremely stable it's wireless and very very sneaky it only wakes up in order to react to something that happened so it's not just like actively burning all the time in uh around 2014 equation groups started using uh BPF malware the piece of mirror is called bvp47 if you want to look it up I have nine dots on the map but there's actually should be 45 dots um I just got tired of adding dots but this malware actually never got detected in the wild it was only discovered and

like fully uncovered by pangu labs this year as a result of like secret keys that were uh leaked as part of like the shadow Brokers leak a few years ago so even with all of that leg up it was still so stealthy that it took a very long time to uncover and basically how it worked was it monitored the uh TCP drivers of the system and when it received how it received its commands and its instructions was it would uh the attacker would send a TCP packet to initiate a handshake that was actually out of the spec they added a data frame outside of you know the initial handshake and what that meant was the driver would

say oh hey I've received this out of spec thing what do that exception woke up the bbp malware it read the commands executed them and then you know phone home if it needed a phone home and then it just went back to sleep again until it received another out of spec packet now if you're using IP tables rules to protect and close ports well the network driver is still receiving those packets even to employ support meaning that PVP malware can listen on a close port so I hope that you found some useful preparation steps within the store and uh you can leverage those in the future we have labs and resources all the labs that we did last night and at Defcon are

available at that URL this talk will go up there shortly too but I believe it's being recorded anyway but you can use that to gather any additional resources you need yeah and then like cover the last Slide the main thing we want people to get is you don't want to be doing your first time you don't first time it's a response should not be in a live event there's lots of preparation and consideration needs to happen ahead of time especially with terms of preparing your environment for the event and having those uh things like the labeling automated processes and stuff like that setups that you can react very quickly because because the lower flows are

ephemeral they're not going to sit around so long so the evidence may be destroyed just by the time you even be able to react to it or we can take a couple of questions I have 50 questions go ahead um cicd so I was on the other side of people stack a lot of containers and all that um so number one do you do the are there any good tools to scan your containers before you get them to production like would you recommend if you want them too how do you monitor cicd because I would run 50 times my CI to see something accomplished and deploy containers and then you know how can you monitor that

and say it's not malicious it's a developer yeah so the first question is what are some tools that I can use in my CI pipeline to you know basically do kind of like vulnerability scanning on the containers or pods I'm going to deploy there's a few the first one that comes to mind is Sam graph it has a paid and open source version and it has out of the box on GitHub a number of rules of like safe container behavior and a number of like patterns and any patterns that you can use for that there's also a number of other products that are escaping my thoughts right now I know that there's a security tool in

GitHub that doesn't as well yeah sneak uh s-y-n-k is another really good one with good sets of container rules the second question is how do I within my CI pipeline tell when a developer is the one behind the like malicious attempt of not an attacker yeah exactly because they could be deploying from CI to prequel or something yeah I mean times per day because they're testing yeah yeah that's where the answer to that so you kubernetes survival you can usually set up uh like web hooks or they have like sort of a admissions controller so it's like you're trying to like apply to pod to an environment it'll do checks on the hotspots to say are you defining

resources so the other things you can see is like re mounting host volumes are you trying to uh make it a privileged a privileged pod which means it contains on the side will be running those privilege permissions um you can have those things set Place ahead of time so that the developer will get denied before deployment and that all your detections will only fire if it gets deployed it's also a question of provenance your question is like is the person pushing code the person pushing code so that really comes down to like what kind of authentication mechanisms are involved and signing is involved in pushing code like are you using web often and things like UB keys in order to authenticate to

your get repos in order to push code and is there any signing on top of that right like these are the kind of things that help you like validate that the person doing the thing is that person thank you any other questions

correlations yeah uh I'll imagine the back just said you can use the CI pipeline logs as well to help establish Providence about region one um

all made us does behavioral profile and get easier or hotter and ephemeral workloads and I think the answer to that really has to do with how you can get your Baseline profiles in the first place like are your test cases like built out enough to help you get quality uh quality Baseline so I think the ephemerality doesn't really change too much what kind of yeah what kind of Baseline you can get but really A lot of these like these demos I showed of doing anomaly detection they definitely work on work really well if if you're just like messing around with it on things that don't change too much like let's say you ran a crypto exchange and you wanted to

get like really detailed behavioral profiling you might put that just on like certain key features that don't change very often like access to a very particular resources rather than on everything so you start you start with like the Keys of the Kingdom that people don't touch much and then expand as you resources allow so education and then monitor those Behavior sets yeah like yeah if you have all you know if you have a very well-defined like authentication mechanism there's establishing doesn't change much that's also very easy to instrument because there's only like three or four things that they actually do and we've run out of time if you want we can ask more questions offline so let's

thank our speaker [Applause]

adults there

Incident Response in containerized and ephemeral environments

Related talks