Launch Control — Automating a Security Baseline in the Cloud at Scale

Name: Launch Control — Automating a Security Baseline in the Cloud at Scale
Uploaded: 2023-05-10
Duration: 22 min 12 s
Description: Launch Control - Automating a Security Baseline in the Cloud at Scale David Levitsky, Olivia Hillman In this talk, we’ll demonstrate how to seamlessly deploy cloud-native security controls across cloud environments, whether it’s 5 or 500 accounts. We’ll walk through designing and building a declara

BSidesSF · 202322:12330 viewsPublished 2023-05Watch on YouTube ↗

Speakers

David Levitsky Olivia Hillman

Tags

CategoryTechnical

About this talk

Launch Control - Automating a Security Baseline in the Cloud at Scale David Levitsky, Olivia Hillman In this talk, we’ll demonstrate how to seamlessly deploy cloud-native security controls across cloud environments, whether it’s 5 or 500 accounts. We’ll walk through designing and building a declarative configuration that enables an automated, self-healing, and repeatable security baseline. https://bsidessf2023.sched.com/event/1HzuB/launch-control-automating-a-security-baseline-in-the-cloud-at-scale

Show transcript [en]

hey there thanks for hanging in there with us um I'm Olivia Hillman I'm on the security data engineering team as part of security operations at benchling this is my colleague David today we're going to be going over the concept of a cloud security Baseline layer how we Define it how we built it and ultimately how you can benefit first some intros to get everyone familiar benchling is a cloud-based platform for biotech research and development as part of our work in helping our customers unlock the power of biotechnology we're responsible for the security of the sensitive data that's associated with the sense of the scientific work and identifying and addressing any consequent risk and that's where security operations comes in as benchling's detection response function the job of security operations is to detect threats and malicious activity to the company and its customers and remediate them as quickly as possible security data engineering is a subset of security operations and we're responsible for building key services and platforms that enable this DNR workflow for example we run a threat detection pipeline that collects security Telemetry from various sources and it feeds it to our detection team so they can write tailored rules to identify threats David and our colleague Brian presented on this yesterday so definitely go back and watch that if you missed it so who's this talk for this for security teams looking to scalably manage their organization's Cloud environments without causing friction and slowing down engineering this is for engineers that are responsible for deploying Security Services and everyone else in this room because there's something for everyone we'll be discussing technical details particularly around terraform but we'll aim to keep everyone on the same page so you'll still get the gist by the end of this talk you should have an understanding of Common Real World threats to Cloud environments be familiar with the concept of a baseline security layer in a public Cloud environment as well as some idea how to build one and ideally you'll stay awake but today at the end of a long conference weekend uh acclaimed success at 50 percent so now that we're aware of who we all are we'll get on the same page about what typically goes into a cloud account defining a security Baseline our implementation strategy and challenges and we'll wrap up with a summary and maybe some time for questions great let's get started going over Cloud environments in a general sense here's an outline of your typical cloud account or environment we've got the base infrastructure Network layer compute and application layer but is there anything missing hopefully the name of our talk tipped you off security layer layer 0 the foundations and why do we need a security layer anyway let's take a step back and let's look at how Cloud environments are breached today Google's Cloud threat intelligence team publishes a quarterly report on real world actors and compromises and that provides a great source of information to use when shaping our security policies further report nearly 60 percent of cloud compromises can be attributed to lack of auth or just basic misconfigurations with a sprawl of cloud services it's easy to get config wrong especially when you have many accounts managed by different business units without a standardized set of guardrails and policies and developers shouldn't be the ones having to worry about getting those things right they should have paved roads and Cloud environments set up with secure defaults but a baseline security layer encompasses much more than just standardization and base defaults for example let's take something as basic as a public S3 bucket these are so easily identifiable by malicious actors that according to 2023 state of the cloud by Wiz if your bucket names in GitHub it can typically be exploited within seven hours if you've accidentally made it public if your bucket has a predictable name like company Secrets or benchling users or my super secret data hackers can usually find and access your misconfigured bucket within 13 hours so we know private data should not be in public buckets but it happens all the time this incident with airport data uncovered by Sky High security exposed everything from employee IDs and photos which because present which could present a serious threat if leveraged by terrorist groups or criminal organizations to things like information about planes fuel lines and GPS map coordinates and directly from the security report they highlight how simple yet so harmful a misconfiguration like this can be so it's obviously important but what is a security Baseline layer security Baseline presents a guard rail that puts the onus on the security team or Engineers explicitly responsible for deploying Security Services to Define and manage and maintain these services this would remove the burden from other developers needing to make these decisions on a per account basis and security teams they need visibility into Cloud environments to enable things like Cloud log sources to identify and manage configuration and inventory and More in order to detect on and respond to malicious activity the Baseline provides a standardized way of doing that the configuration should be consistent across the company thus providing a foundation of hardened environments to make developers lives easier but how do you know if you and your company need one well if you have multiple Cloud counselor environments you've got more than one person working in the cloud you want repeatable security configuration across accounts you need to enable collection of logs and other security Telemetry for threat section incident response work and or you have expectations as a company to uphold particular standards for configuration and security so after reading through this list I'd bet the majority you fall under at least one of these categories uh great so let's help you figure out how to make one the cloud security Baseline should be a centralized platform we want everything managed in one place for Simplicity clear division of responsibility and uniformity in our foundation this gives us consistency across accounts and a repeatable process to instill confidence in the deployments that are going out the Baseline should be self-healing and out of the way of production outage impact it needs the ability to autonomously resolve issues based on a desired state and further any hiccup in the deployment process of your Telemetry aggregation should not stand in the way a business as usual for Developers and finally we want this Baseline to apply the Securities code philosophy we'll get better quality security deployments by following software engineering principles and we gain auditability with code as a source of Truth to know what's in place across Accounts at any given time so once all these requirements are achieved what does this enable for us well security teams gain insights Telemetry and centralized Cloud governance Engineers can work without worrying about deploying Security Services and defaults or having their work interrupted to do so the business can move fast knowing that the Baseline is covered and your life becomes easier when security is a platform developing and deploying Security Services faster and keeping up at scale and now I'll pass it on to David to discuss in detail how we built this to enable these wins for ourselves so Olivia outlined kind of the goals for uh what we want on build or what we want to see in a cloud security Baseline so now let's talk about some implementation details and how to actually build something like this so with all these things in mind we kind of had a couple of initial challenges that we knew we needed to solve um so you know as we want to build something that continues to enable people to do their jobs right you don't want to have something that's continuously going to get in their way so as we scale and as we continue to grow we want to make sure that this kind of security deployment pipeline is working successfully uh we're in AWS shop so there's a lot of different things with you know different regions different environments to support depending on business use cases so we need to make sure we have an ability to handle that as the as the business continues to grow roles and responsibilities Olivia touched on this but what's the distinction between the infrastructure team and the security team ownership so not necessarily the technical side of it but the people in the process side of it who responds if something breaks who gets the ticket who fixes it uh what does ownership look like so the hardest thing for us honestly wasn't even the technical side of this uh technical side was challenging which we'll talk about but it was really a culture shift in a paradigm shift for how we wanted to operate um and we're going to continuously say this throughout the rest of the presentation but we really believe in a security as a platform approach rather than a consultant style so pretty often you can have a business where you have your infrastructure team and you have your security team and the security team kind of operates in an ad hoc manner where they will create docs they will write tickets they will give recommendations and then it goes to the infrastructure team and most infrastructure teams are usually underwater they have their hands full with plenty of different things to do and the last thing they want to do is get this continuous barrage of security requests kind of thrown over the fence that they then have to prioritize there's never enough time in the day for anyone to get their work done so a lot of security things go a lot slower than they possibly could so we want to kind of shift this model in saying that hey we have you know a security team that is more than competent more than capable to do these things let's kind of manage and separate these duties out a little bit and reduce the decision fatigue from infrastructure and developer teams to kind of worry about the security aspect so it's a win-win for everyone um in order to kind of build this platform we decided to leverage terraform uh this is what our company uses so we wanted to standardize on that stack uh you can do something with cloud formation you can leverage your ISC tooling of choice um but terraform is is what we use and if you've never used terraform before I'm just going to have one quick slide which looks like it has much more text so I won't read through all of it but depending on your familiarity there's really only a couple things that I want to call out um State refers to your collection of resources so you have your IAC and you have your representation of what these resources actually look like in the cloud and this is what you're going to look at to see like what exists what doesn't exist what the Delta is um there's also a concept of a terraform provider which is the most important piece that I want to call out a terraform provider is essentially an API plugin for how you're going to be communicating with whatever you know third-party API you're integrating with in our case this is the AWS control plane you know if you leverage gcp or Azure something else there's specific providers for that there's tons of different terraform providers for a ton of different services not even necessarily Cloud providers but really want to make that definition as clear as possible because this is where the Crux of our technical challenge ended up being uh so uh there's a problem uh in terraform called well we're calling it The terraform State problem which is how do you kind of interact with different accounts in different regions to deploy a large number of resources so remember that a plug a provider is a plug-in used to interact with third-party apis so if we want to manage AWS resources we would use the terraform AWS provider which would take our resource declarations call the relevant AWS apis to get resources to a state that we requested well also in terraform it requires us to create a unique provider per account and per region because that's kind of how the AWS control plane operates so if you want to deploy in ec2 in account one in US West 1 and Usos 2 you're going to need two separate terraform providers for that to handle the different regionality you can imagine that this will start to break at scale if you have n accounts times M regions that's kind of the math that ends up happening and you're going to end up copy pasting stuff over and over again if you have something as simple as 20 accounts which is not a huge AWS deployment and you have 10 regions that you want to support you know across different parts of the world you have 200 different modules instantiations which good luck trying to keep track of and good luck kind of copy pasting that code over and over again it's going to be a nightmare so when we first hit this we were worried that we were missing something so we were kind of looking for a fix we found the GitHub issues page for terraform and you can see that we were not the only ones to um run across this and not necessarily stoked about it so you can see folks saying they love terraform uh but you know this is a major drawback kind of an issue that's been open for a while not really addressed so from the hashicorp support page which documentation is really great we love terraform in general there's a couple of different workarounds to solve this specific problem of how do I dynamically generate providers for different regions and different accounts there's three options there one is to use external tooling to Auto generate your kind of terraform files they recommend bash or Powershell a popular one is also Terra Grant essentially the concept is don't repeat yourself so how do you dynamically avoid the you know the problem of repeating yourself the second one is to pay for terraform cloud and leverage their provider to kind of create this concept of a workspace which was defined on an earlier slide and they kind of handle that for you so you can create different workspaces and then use the workspaces to kind of handle your terraform State and then you don't need to reinstantiate Providers you can see that context into a different workspace lastly is using the cdk for terraform at the time we didn't really want to spend an innovation token on this in my personal opinion it's not really as fully fledged and developed as where it needs to to be so we really didn't want to risk our kind of production Security workflows on something like this we prefer to use kind of standard terraform which was much more proven and has a lot more support so let's kind of look at all these options and see what we chose to do so we kind of had a couple of different options and so the first one was hey we're going to use some sort of tooling to Auto generate all of the terraform for us so we're going to write some custom batch scripts or maybe use Terra Grant to kind of dynamically create things for us we're not going to repeat it we're going to see some sort of config have some code spit it all out and in this example the way that it's going to look is we have four AWS accounts we want to deploy resources across two different regions regions are represented with A and B accounts are numbered one through four so for this we need a bunch of different providers so we wanted to kind of Shard things vertically and say okay we're going to divide things into Dev staging and prod we're going to take all the dev accounts put them in a single workspace and then Auto generate the terraform for it do the same for staging and production it's going to all flow vertically there's a couple Cons with this approach you're going to need a very heavy cross account authentication because your workspace is going to need to go into all the accounts that are defined in it and applied terraform changes there so we didn't love that from kind of a least privileged model more importantly there's very tight coupling between different accounts and regions so if you have a failure in 1B on the very left side none of your changes are going to get propagated to 2A and 2B so a failure in one account and one region for an arbitrary thing things happen all the time could result in serious issues and break your deployment the other option is taking a horizontal approach where we still split things in Dev staging and prod so that you know following software engineering principles we can kind of test our security Baseline over time and the pro uh the biggest benefit with this is you don't have to Auto generate any terraform providers you can just bootstrap the workspaces seat all the context in there so you'll have one workspace for each account in each region this was cool but also would result in a position where we would have a ton of workspaces and if you wanted to figure out all the configuration for a single account if you have 10 regions you have 10 different workspaces tracking State we felt like that might be kind of challenging so we decided to kind of come up with our own approach which was to take a combination of approach one approach two and come up with a hybrid approach uh so this was kind of the final model that we settled on and again it's super basic here there's only four accounts in two regions so it looks pretty simple I'd like you to picture an environment where there's a thousand accounts and 15 regions that you need to support and the concept of scale kind of starts to have a benefit of taking this approach very very quickly so here we have a one-to-one mapping of account and security configurations we have one Security account one workspace associated with it all the regions and all the resources that need to exist with it are kind of in that single block it's a much easier mental model to look at an account holistically so kind of wanted to show what this architecture looks like specifically around how we have a different bake cycle for Dev staging and production environments so what we do is we bootstrap the creation of all these workspaces in an infrastructure repo managed by our infrastructure team this is the only Reliance Independence that we have on our infrastructure team to deploy our infrastructure so as a new account gets added it gets added to a configuration a workspace automatically gets created depending on if account slides into a Dev staging or production configuration uh from there we do all the management in the security Baseline repo so we have workspaces that are created from the working directory in the infrarepo everything is mapped by environment and we use the security Baseline repo for orchestration and what we do where we have the actual logic and the application side even though this is infrastructure is on the very right we have all of our terraform modules so this is things like Telemetry whether it's DNS logging VPC flow logging you can have Auto remediation modules pretty much any security thing or function that you need to accomplish or deploy gets put in those modules gets sharded out across different environments is very easy to manage uh so we wanted to kind of show a real life example of how we implemented Network Telemetry uh specifically Route 53 logging which is DNS logs inside of AWS vpcs and so if you think about the concept of rolling out DNS logging across all of your Cloud infrastructure and all of your Cloud environments this might sound like a daunting task maybe you have a bunch of different VPC instantiations you have different service owners you have different service teams that's actually part of the reason why we decided to write this this could be a multi-month project with this change you can roll all this out in less than a day so what you do is you establish your Route 53 logging module you'll test in a Dev environment once that looks good you'll prom

Launch Control — Automating a Security Baseline in the Cloud at Scale

Related talks