
all right thank you and Welcome to our talk on a practical kubernetes security at scale this is based on joint work between two teams in in shipstead developer foundations and product and application security my name is I'm a software engineer at chipset I've been focusing on infrastructure and how development teams can effectively use cloud platforms and I'm still on I'm a security engineer this is work I was involved with while at chipstead and I'm currently at the Norway's wealth fund all right over the past few years a lot of companies and organizations have embraced the running containers containers provide a standard way to package your application code configurations and dependencies into a single unit that you can launch wherever you like containers are lightweight and portable as they already contain everything that the application needs and can be run on any platform kubernetes is an open source platform for managing containerized workloads and services and is widely used in the industries with more and more organizations adopting kubernetes as the platform of choice shipstead is a Nordic family of digital consumer brands with a mission to empower people in their everyday lives we have leading brands in the north Nordic market across news media online marketplaces and Technology Ventures many of these organizations are embracing and adopting kubernetes new technology brings new challenges and how to secure our platforms as they are moved to kubernetes is one of them in this talk we'll outline some of the steps we've taken to enable security measures and controls for kubernetes configurations across different organizations within shipstub organizations within ships that are autonomous when it comes to technology choices which oftentimes leads to a dissimilar setup and having to come up with unique brand specific Solutions ships that has different building blocks that are shared and can be leveraged by different brands some examples include identity platforms privacy Services payment services and data and analytical capabilities developer tooling and infrastructure components provide provided centrally are similarly shared building blocks that support the software development lifecycle and product teams across different organizations the building blocks form a foundation that organizations can leverage and build on top of AWS is a cloud provider that's widely used in ships that it provides a lot of services which now include eks which is their kubernetes offering despite eks providing some things out of the box there is still a learning curve to operating kubernetes cluster and running workloads efficiently not all teams have the same resource capacity or expertise when it comes to investing time and effort in infrastructure and therefore having the ability to leverage a shared building block can be appealing to teams both small and large for this reason we've introduced skates which is a managed kubernetes configuration which comes with batteries included to create a fast track to a kubernetes configuration and a runtime that's ready for production so skates is built on top of eks and comes with Integrations to existing cicd systems and has a lot of capabilities provided out of the box and is a setup that is similar across the different organizations so we are currently operating close to a hundred individual clusters across multiple different organizations with the steady growth of workloads being migrated and hosted on these clusters so one of the challenges we are presented with is how to ensure a base level of security across clusters that we are managing for different organizations and how we can drive improvements over time so kubernetes is a complex topic and there's a steep learning curve to master it the complexity makes it so it's possible to simplify application the complexity makes it so it's possible to simplify applications as kubernetes now takes care of many of the things traditionally handled inside of applications kubernetes consists of several different parts firstly we have the the control plane the control plane Works to maintain the desired state of the cluster it has several components etcd the is a is a key Value Store where the state of the cluster is persisted there's a scheduler which is in charge of scheduling ports or workloads onto worker nodes API server is the core of the control plane and how users external components and parts of the cluster all communicate with each other lastly we have a controller that watches the shared state of the of the cluster and makes changes attempting to move the current state of the cluster towards the desired state then we have worker nodes worker nodes run the applications and and workloads a pot represents a group of one or more containers running together each worker node can run multiple workloads and a cluster can have multiple worker nodes or groups of worker nodes as it scales over time inside of a worker node we have a couple of components there's the cubelet which is an agent that connects to the control plane and registers the the worker node and then we have q proxy which is a network proxy that runs on each node in the cluster it maintains networking rules for nodes and these Network rules allow communication to your pods from Network sessions inside and outside of the cluster there also need to be components to to support or enable incoming and outcoming uh traffic to the to the cluster and in a cloud environment there are also Cloud specific components that become relevant additionally further customizations can be done by installing uh external modules that can extend and enhance the behavior of the cluster and shape it into something that's usable for you or your organization so in summary there's a lot of components and parts to kubernetes that make it possible to run applications on top of it but kubernetes does provide some basic security features but it is in the hand of the cluster operator to implement robust security protocols when it comes to security and compliance enforcement ensuring we have security measures and controls in our platforms and in the building blocks as I mentioned earlier is key to being able to operate them efficiently but where do we start to implement Security in kubernetes in a way that will not introduce hurdles that would kill the momentum that development teams have gotten from adopting the platform so there are a lot of guides on best practices that are out there and best security practices those are all readily available on the internet there's a lot of uh solutions from security vendors so commercial Solutions uh out there there's also a lot of Open Source tools and solutions that are readily available so so for us we have a lot of options which is great but deciding what to do and how to do it that's the challenge depending on your business and compliance requirements there may be different aspects that need to be considered and it's not a one-size-fits-all what we will be talking about are the steps that we have taken and the journey to improve the security posture of kubernetes configurations in shipstead Cloud native Computing Foundation maintains an overview of cloud native projects which are applicable in the kubernetes context which covers a lot of different topics there are a lot of established projects out there as well as new and up up and coming both commercial and non-commercial so cncf keeps track of the majority of these projects and the landscape can serve as a guide exploring open source solutions to evaluate what gives us value has been our approach so far and by and going by what is out there we have a lot of options but finding the right fit is that is the challenge so our approach is repeatable security according to the nist cyber security framework tier three to build as little as possible by leveraging existing tools since we already have uh clusters out there with production workloads or or teams that are developing solutions that that are deployed to clusters we need to ensure that we don't dis disrupt that in in a way that the hampers them we need to recognize that there may be different requirements in different teams and different organizations and align the work with existing efforts in Cloud SEC and appsec as they all contribute to the overall security posture um and lastly to learn from others there are other kubernetes setups uh in in ships that at this moment and we try to incorporate the learnings from from those into into into skates AWS defines a shared responsibility model that states that they will what they will be responsible for and what's left to the user to take care of in the case of kubernetes a lot of a lot is left to the user but with a shared building block like skates we are able to encapsulate some of the complexity that users would otherwise be exposed to similarly we want to clarify clearly Define responsibilities for a central team that's operating clusters and product teams that may be running workloads on them therefore broadly speaking the operator is to be concerned with the operations and security of the the cluster overall and the users are responsible for the operations and security of their applications or workloads that are launched in the cluster we as operators ensure guard rails are put in place that will enable the product teams and developers to do so so this continues to empower developers that can still use the build it you run it mindset with some additional support so automation is key to scaling security we use infrastructure as code to get repeatable security across all of the Clusters this enables the same hardened setup to be used while also ensuring that the state does not drift furthermore it enables gradual rollout of guard rails and security measures and the ability to customize those per team organizations or even down to the cluster level we use Chekhov which is a static code analysis tools for scanning infrastructure as code files for misconfiguration that may lead to security or compliance problems it enables uh sorry it includes predefined policies for checking for common misconfiguration issues and it supports many different types of infrastructure as code flavors and it's it allows us to evaluate our terraform modules as well as our terraform plants before we apply them so Chekhov can be integrated into existing cicd pipelines which gives us a way to catch things before they are being deployed and this gives us more confidence when it comes to making changes and applying them across the board where we don't compromise existing setups so before we get into how to secure SQL clusters is worth noting that the best isolation like if you have different workloads that should be completely you should do it in separate clusters so shipstates are a collection of a lot of Brands and they already run in their own AWS accounts in a single AWS organization and it made sense for us to also have individual clusters in in those accounts um so the benefit is that we get the extra isolation but we don't get some of the benefits of having kubernetes you know the scaling part and all of that so this is our Target Baseline we say Target because we are still working on it and Baseline because we expect some teams to go beyond this for instance there are already teams that do egress filtering as well as rasp type tooling so we're going to focus on the kubernetes parts here and even though it's it's pretty AWS heavy we think that there's learnings here for any kubernetes setup and the goal there is to get as much security as possible with each measure while not getting in the way of the users and also with the least amount of effort from our Central teams first off we have securing the control plane and here like there is complexity in operating and securing the control plan and you can largely Outsource it using eks as was mentioned which is the manage kubernetes from AWS and basically in terms of the control plane you have everything you need out of the box it's secured enough and you don't have to do additional hardening uh you might want to do IPA allow listing if you're concerned about having the API exposed to the internet but it should be saved by by default so having kind of sold the control plane we can look at the data plane and that's all about or mostly centers around the Pod so like how can you break into a pod and once you're there like what else can you have access to so the port might be exposed as a service inside the cluster and you might also have external traffic here coming into it another way in for an attacker would be through the supply chain introducing malicious called in the Pod and then so once you're there we as cluster operators want to make it as hard as possible to do larger movement to other things either VR container Escape or the other network and before getting more into that let's first a brief history or isolation so you basically started out with the processes trying to run different apps on a single operating system uh and there's some Hardware features to back this up The Next Step was to run multiple OSS on a single physical machine and so we had VMS and again new hardware features were added to enable this and then more recently some tried to get the most best of both worlds in terms of process processes and VMS and we got biker VMS and if you look more broadly a new features tend to be you know separate Hardware or there's also Research into new CPU instructions to improve isolation so we if you look at AWS and also the other big cloud vendors they are built on VMS so a lot of effort has gone into hardened those so they used to be more sort of um less custom Hardware uh so the trend the the last few years for AWS is that they have more more custom Hardware which is this is a Nitro system so the current generation of VMS you get on AWS is based on Nitro so they also have and a micro VM option which is the uh is called firecracker and that's used for both serverless functions like Lambda and you can also use it for uh for containers so in contrast to this uh kubernetes came out to new software features in the Linux kernel so you still have the process model but it's um it's in software and you're kind of losing out on all of the hardening and Innovation that's been happening in the VM space as that's why we need to do some extra work when we're trying to secure kubernetes so you can run it in microvms as well but it's you you lose some flexibility and it's unclear how much security you gain by that so um kubernetes is a leaky abstraction these low level current officials I talked about are exposed to users trying to deploy to the cluster so when you deploy your pod you can also specify these low-level Linux things like namespaces SL Linux and capabilities and basically what it does is that it limits your application your pod to add your Escape into the system so if you do an escape you you can get to other pods running on the same machine and you can also get to this control processes that are also just processes like the cubelet and Cube proxy and the incentives there are misaligned because you are protecting the rest of the system from your application rather than all the way around so if you're just doing a one-off you might not want to do a lot of hardening to protect the rest of the system from from Europe so the solution we're going for for this is a combination of two things so you have the container optimized OS called bottle rocket which is a hardening of the VM and then you can set limitations on volcano security settings are allowed into the cluster via the admission control so if you look at bottle rocket it was made for this purpose to be able to run containerized workloads more safely it's a open source Linux based operating system from AWS and it is focused on avoiding persistence rather than isolating different pods from each other but it also helps in that regard using custom SC Linux tools so a couple of the features so they they have the minimized amount of binary so if if an attacker were to get some sort of told having access binary is can be useful so they minimize that and they also Harden the binaries that are there they use a combination of read-only file systems and ephemeral file systems to make it hard to tamper with Falls and also make sure that after reboot you're in a clean state and updates are done atomically so that rather mutating individual files you um you get a you know a fresh set of files that make up the whole version so it makes it really easy to to roll forward and backwards and they also have a convenience function for improving the speed of security patching in the cluster and unlike we we just bought lucky because it's a great uh combination of security and flexibility so there's a lot of hardening but you can still install custom security tools as well as all the components from the larger kubernetes ecosystem so the second part to the question is the admission controller and it's limiting what is allowed into the cluster in terms of this low-level Linux settings and you could make your custom like custom policies uh going through like each setting and see what should be allowed and we actually did that exercise and we came up pretty close to the existing kubernetes standard called old security is done and PSS is implemented with something called PSA and is currently in beta in kubernetes and you can also use it with older kubernetes clusters so you can get a lot of a lot out of the box using those systems and it's basically like three levels so you have your disabled you have your Baseline which is unlikely to break your applications while also provide good security and you are restricted if you need more flexibility than that you I can use something like OPI GateKeeper so open gatekeeper allows you to make more some other types of policies as well including um limiting what Registries images can be pulled from and also verifying signatures we actually implemented both PSA and Opera gatekeeper and open gatekeeper is both harder to integrate and harder to use so if PSS is good enough for you that's where you should start uh to further drive this home that both rocket and PSS is a good match we have this list of recommendations from the GitHub page of both rocket so the first colon we have the recommendations and the second you have the priority and in the third you have our sort of those PSS cover this or not and as you can see like PSS covers uh a lot and it's a great fit with bottle rocket so for instance you can look at the second one here about privilege escalation and that's restricted in PSS Baseline so it's like a couple of the things for instance you have a setting called privilege like true false that's blocked by this and you also have these various Linux capabilities that can be dangerous that are also blocked foreign container images are a convenient way to package and distribute applications they may include many attack attack surfaces image scanning is the process of identifying known security vulnerabilities in the packages listed in a container image scanning images at build time enables us to fix any vulnerabilities identified before they are deployed anywhere there are multiple tools available out there for this purpose one of which is uh called trivi trivia is an open source vulnerability scanner from Aqua security and 3v detects vulnerabilities of operating system and language specific packages as well as having the ability of scanning for hard-coded Secrets like passwords API keys and tokens in an image trivi uh has a has a database of vulnerabilities which include remediation recommendations and that's periodically updated but not all vulnerabilities have fixes and depending on the context of where an image is running the vulnerabilities may not be necessarily exploitable so we need to ensure that there is a a good sign