WhizBangLambdaFix: where AWS Misconfigurations meet Auto-Fix-It Antics

Name: WhizBangLambdaFix: where AWS Misconfigurations meet Auto-Fix-It Antics
Uploaded: 2024-07-09
Duration: 32 min 23 s
Description: WhizBangLambdaFix: where AWS Misconfigurations meet Auto-Fix-It Antics Lily Chau, Lakshmanan Murthy Explore our AWS Lambda-powered tool for the cloud that not only automates misconfiguration remediation, but also cuts costs and reduces attack surfaces. Dive into our dual-angle approach, fusing sec

BSidesSF · 202432:23378 viewsPublished 2024-07Watch on YouTube ↗

Speakers

Lily Chau Lakshmanan Murthy

Tags

CategoryTechnical

About this talk

WhizBangLambdaFix: where AWS Misconfigurations meet Auto-Fix-It Antics Lily Chau, Lakshmanan Murthy Explore our AWS Lambda-powered tool for the cloud that not only automates misconfiguration remediation, but also cuts costs and reduces attack surfaces. Dive into our dual-angle approach, fusing secure defaults with impactful playbooks and user Slackbot responses for hands-free AWS management. https://bsidessf2024.sched.com/event/9c524a596374007fa48b5a98fa64a42f

Show transcript [en]

our next talk is on Lambda fixes I think um from Lily and lockman take it away a reminder before we start sorry I just said that a reminder before we start uh questions you probably know this by now questions are on the slido please submit your questions on the slido to bides sf.org qna and I will present them to your presenters uh online thank you right okay all right hi everyone I'm Lily chow and I'm here with lakman and mury and we're here to talk to you about the chaos we created at Roku with our whizbang Lambda fix framework where AWS misconfigurations meet our Auto fix it Antics so the agenda for today we'll go

over the problem we're trying to solve the order ation frame framework some practical playbooks and we'll conclude with some metrics so we have more visibility than ever into our Cloud environments but what that also means is there are too many findings too many findings that aren't critical enough to make it into a jur ticket as well as just too many jur tickets too many untouch J tickets as well as too many slack alerts too many unresponded slack alerts as well as just too long of a time frame to remediate with traditional bug fixing via jira and slack despite all your efforts and diligently writing detailed steps explaining the vulnerability and the mediation step step by step timely

resolution Still Remains a problem so one obstacle is time and resource constraints within the teams so this can impact the quick attention needed to address critical security issues this is compounded by the lack of security expertise on the developer side especially with the growing number of AWS services and the possible misconfigurations that it can have so we recognize a need for change Security Professionals are uniquely positioned to identify understand and remediate ads misconfigurations our aim in this talk is to empower security professionals by providing the framework the playbooks to safely and effectively mitigate AWS misconfigurations that are impactful to your organization so not stopping our automation workflows once we file a j ticket and not stopping once we file a

slack alert but to go beyond and fix the issue at scale okay so before diving into the solution let's outline the typical direction of an organiz organization Cloud environment for new Services adopting service mesh architecture for better availability scalability security at Roku we use ISO for existing Services migrating them to a service mesh architecture while ensuring uninterrupted service because business continuity and then everything else new and existing that bypasses service mesh adoption so this can include spinning up your ec2s manually containerizing your applications with Docker in communities um using lambdas or platform as a service such as Heroku to deploy your applications so while it's technically possible to securely deploy your applications in these non-approved environments it's uncommon

more often than not these practices are manual and introduce security vulnerabilities now for your organ ation your approved golden standard may not be IO it could be a golden image in which case the direction is still the same golden image for new Services migrating existing services to your golden image and then somehow controlling everything else that spun outside of the golden image okay so moving on to addressing Security in AWS we'll take a dual approach security via secure defaults and gods and security full for non-compliant or invariant services so in part one on the left is your perfect dream of security foundations so you set up organizations with guard juty cloud trail leas privilege IM roles you set up

guard roles with service control policies and you have thousands of secure by default infrastructures code templates Frameworks and modules because and because everything is in the code any security issues will just be in the code which you can scan for Block correct and even ask AI for a me solution so okay wait why are people spinning up ec2s manually so we've heard it all before I'm new to AWS I just want to spin up something really quickly for for test purposes I'm more familiar with Docker and kubernetes I still don't understand service mesh okay cloud is hard so if you're going to spin up something in ec2 let's make sure you are doing it securely so while you

learn service mesh and make a plan to migrate to service mesh let's make sure any of your manual click operations don't introduce vulnerabilities that we can't live with h so this is the focus of our talk so the stuff on the right so order mediation against ISS issues introduced by manual actions also to secure configurations that we could not secure via scps because scps are not great if you have a lot of edge cases and scps are not great at providing usable error messages I mean take a look at this cryptic error message you are not authorized to perform this operation I mean you would never have guessed that what this means is even as an admin you

cannot launch an ec2 that uses the vulnerable imds V1 setting hi uh let's take a generic workflow diagram how how Auto remediation Playbook looks like so first we start by defining the config policy uh once we Define the config policy we we scan for our AWS uh we scan for the misconfigurations in the AWS accounts uh now depending on the risk appetite uh sorry oh now depending on the risk appetite does the misconfiguration need approval from the owner uh if not just fix it tag it and record it just in case we need to roll back so if the misconfiguration needs approval from the owner uh we send them a slack message asking them to review and fix the

misconfiguration uh because if not you have two options uh they can either apply for a one month exception or they do nothing in 48 hours we go and fix it now if the misconfiguration is in a production account uh since it's we don't want to touch production accounts uh we send a slack message with three options apply for a one month exception apply for a one-ear exception or do nothing ignore the slack message and uh your misconfiguration will be Auto remediated in 48 hours the whole workflow is performed just by lambdas and each remediation workflow is itself is our own Lambda in addition to our goal of no jira uh we also have a goal

of no slack we don't want to bombard users with millions of misconfiguration messages per misconfigure uh per misconfiguration so we use slack messages to very minimal when we tune out the false positive in the initial stages so before jumping into the architecture we need two basic IM roles in every AWS account that we intended to perform the uh run the playbooks so as standard the first one is the security auditor I IM role with the security audit permissions uh which gives readon permissions for all AWS configurations and this role can be assumed only by the AWS Security account similarly we require a second I IM role uh which allows the security team to tag the

misconfigured resources in any AWS account that the organization uses uh so why why we use tagging so tagging easily help us us to track resources uh as well uh we follow a strict tagging policy in our Organization for example uh Roku security prefix is only uh allocated to the security team this uh this allows uh uh so this is because we don't override any user tax as well as the users can know who applied these tax uh another use case for tagging is the ownership tagging uh we can use cloud TR locks to pass for the owners uh when they based on the create related events in the cloud trail locks and then uh we can

fetch the email addresses from there and then we tag it to the resource so this helps us quickly track track the resource owners in case of an incident so we have two types of remediation one is scheduled remediation that happens daily at a fixed point of time and then we have real-time remediation first let's take a look at the scheduled remed uh so as as I said before uh each misconfiguration Playbook is itself as a Lambda so the Lambda iterates through the list of active AWS accounts and the list of active regions for that account uh next we TR through the targeted misconfigured resource so when we find a misconfiguration the Lambda assumes a third IM role that has

the permission to remediate the misconfigured resource in the targeted AWS account uh so some example for scheduled Auto remediation includes uh Del deleting elastic IPS that are not in use uh deleting unused EBS volumes uh updating ec2s over uh to use imds V2 over imds V1 uh know this architecture is very bad no even Bridge no Cloud watch no SNS no sqs this is because the misconfigurations we are trying to fix here are not that critical we can wait for 24 hours so we can get rid of any comp that may incur additional cost so next we have real time uh remediation it's it's almost the same as scheduled remediation the only uh difference is the component of the left

uh which has cloud trail involved so we have separate AWS account at Roku for centralized cloud trail logging from all the member accounts in the organization uh then we configure even bridge to injust these locks uh next we create multiple filters based on the specific events on depending on those events and and invoke the appropriate Lambda so some example for uh realtime remediation includes manual creation of ec2s U in production accounts manual creation of IM user and manual deletion of EBS environments uh this just let us know like uh I coding practices were not used so let's take a deep dive into the uh one scheduled uh remediation Playbook um in this Playbook we are trying to

find the unused EBS uh volumes that are not associated to an ec2 instances for the last 45 days that means that EBS volumes are just sitting idle and no one is using it so first we start by scanning the scanning for EBS volumes in the all active AWS accounts and then uh we look at existing tax to see if the last scen tax are present if not this tells us like the Playbook is looking at the EBS volume for the first time uh if the tags are not present we we add an invariant tag to Define which uh Auto remediation Playbook is this and then we also create a last scene with our current time stamp

uh so next we look for any exception tax uh so we we Define it as security EBS all list equal to true if if those taxs are present it means the owner resource owner acknowledged us that that e s is in use so we can end the workflow there uh if not we have to pass the cloud tray locks uh for attached detach volume EV uh so so in scheduled remediation we are not looking at the resource continuously we are looking at the resource at a single point of time every 24 hours so uh the resource can be used in the past 24 hours and the Playbook doesn't have visibility to it so by analyzing the

cloud TR locks we can take a correct decision that uh if this EBS volume is used so if the volume has attached detach uh events we just update the last scene tag and then if it's not we'll check if the last scene tag is older than 45 days uh if it's older than 45 days we we we can delete it and then we store the locks and we send notification email to the admins okay so let's let's go through how we are detecting manual actions outside of approved infrastructure as code workflows so previously we used to analyz the user agent to infer from the cloud TR logs whether the event was a manual action now we kind of look at

events initiated by employees that's where the role session name ends with at roku.com CU then we know there is a responsible user making the alert actionable we also look at the cloud tra event to see if it has the read only flag set so that means the event is a mutating action or a create update delete type of event which is much better than thinking of what to filter out of the list scat described type of actions hey so let's go through an example how we would quarantine an ec2 that is created manually first we create an event Bridge filter to capture the Run instances event in cloud trail we then filter for the user identity to

know who spun up the ec2 we create a threshold of five ec2s spun up by the same user within a 15minute time frame if this threshold is exceeded we terminate the workflow otherwise we invoke the ec2 quarantine Lambda Playbook The Playbook involves tagging the ec2 As invariant and then quarantining the ec2 Via restricting Ingress traffic via security groups and attaching an IM policy to deny access to aw resources after 1 month if it still exists we delete the instance now the reason why we have the 15minute time frame is because we've had cases where a user was using a python script on their laptop using the ssco SSO credentials to bulk spin up ec2s for one

minute of data processing before deleting them so even though this is technically a manual action they have a common way of deployment so but if they introduce other security vulnerabilities we can't live with we will deal with those with other Lambda playbooks okayy so let's go through how we would perform residual remediation after deleting an S3 bucket manually so this is this is a sign of bad cleanup and release management so first we create an event Bridge filter to filter for the delete bucket event if the event is detected we invoke the S3 subdomain takeover Lambda Playbook so the Playbook first involves formulating all the possible domain names that that bucket can have so if my bucket is named food

then your possible Dame Mains could be fu. s3. amazon.com fu. s3. region. Amazon abs.com and fu. s3- website- region. amazon.com so now that we know all those domain names we need to check both Route 53 C names and the list of cloudfront distributions to see if it matches if so we delete them now if your three bucket has a domain that is centrally managed by IAC we need to extend our mediation efforts to the code repository so this Playbook is blurred between the secure default track and the manual remediation track so our Route 53 adheres to the principle of secure default where to ensure there's no dri drift between the code Route 53 and the cloud Route 53 so

if someone manually deletes an Route 53 record it would just fix itself but if it's the resource that is being deleted in which case this case the S3 bucket we cannot perform a CLI based fix because your subdomain takeover vulnerability would just keep reopening up so we have to we have to apply a terraform fix so same thing with an elb deletion so I'll just go through the possible domain names that you may not be aware of so if your load balancer is named Fu you have three possible domain name combinations so fu f.o. amazon.com fu. region. ob. amazon.com and du stack. fu. o. amazon. a.com so the dual stack means your load balancer supports IP V6 otherwise same

same still need to check R 53 clown front and plyer fix terraform CLI depending on your company so if you want to solve for your subdomain takeover completely we need to follow this same remediation playbook for other resources such as elastic IPS cloudfront distribution API gateways and elastic beant stock okay so let's walk through how we would reduce Lal movement so first we need to create IM mappings between a principal and a resource so a principal can be an IM user role or User Group a resource can be another IM IM user or it could be an ec2 eks or a Lambda so every time you see a PR principle with admin IM permissions over a resource you

create a mapping every time you see a principle with impersonate permissions to a resource create a mapping you are now able to chain these mappings to detect lateral movement so this could be an E2 is associated with an IM IM that points to another IM which points to another IM which points to another IM with admin IM permissions so when I say admin I am permissions what I mean is create user create policy Wild Card IM permissions so as you can kind of guess a resources such as an ec2 has no legitimate reason to has no legitimate reason to um create an IM user okay so for remediation let's pull all these Resources with admin I

permissions and pull all the resources that are chained to have admin I am permissions you can do this for two three 10 Hops and let's double check in IM access analyzer or cloud trail if those IM and impersonate permissions are actually legitimately used within the last year if not we create a new am policy with all the permissions stripped out attach the new policy detach the old policy and you can do this for any IM permissions so as we scale up to a large number of Lambda playbooks managing them can be can get out of hand and you may have observed there are certain reoccurring modules Within These playbooks so this is where you can

consider wrapping your playbooks within Argo workflows or any orchestration tool to just injust your mix configuration and just boom boom boom remediate okay so you may be thinking how is this different to ad config so config is a service that keeps track of your configuration or your metadata of your AWS resources it has a predefined list of config rules to check against but it also provides a means to remediate Via AWS systems manager documents or SSM documents to reiterate our point at the beginning ideally you should have no use for config and you should instead use infrastructure as code if you're using I you know the resources that are being managed and can correct the configuration for and if you

need to do configuration tracking you can do it much more cheaply with a rent Bridge instead of config but let's learn from Alma stake and go through some of the pain points that we saw so data loss so we've had many missed configurations because of config so config uses a lot of API cause for all our rules all our rules against those resources that it actually impacts some of your more business critical API requests from working and if you have your mediation playbooks you need to make sure you deploy to all regions for all accounts if you don't want to miss anything and costs so the cost is way too high for just logging configuration changes so

note I said configuration changes and not logging actions or events that are more useful for incident response type scenarios config is also prone to cost spikes in the event of restart Loops or if your deployment method involves a lot of creating recreating and deleting resources you may also be thinking how is this different to Cloud custodian so actually our framework similar to Cloud custodian config and even twi's sockless operates on the same principle it's just the remediation playbooks is defined in a different format whether it's a yamu file SSM document or some other rapper file format the key difference between us and the other Frameworks is in the playbooks okay so let's say you buy a

cloud tool you get millions of findings but then you get stuck what do I remediate what do I remediate that will actually make a significant impact and then you start to notice that the only thing people fix and the only thing that people jump up and down about are public S3 buckets so you basically bought a tool to fix public S3 buckets I mean it's the same thing with buying a soul or some orchestration tool what did you do with it bing the only thing you automated is fishing Okay so then the other problem is we don't know what to fix we just don't know how to prioritize and when we do know how to prioritize we know how to

fix it for one system but not necessarily systemically for the whole organization so this is where we come in so I'm going to list out exactly what you can remediate that will make an impact and this revolves around cost optimization minimizing attack surface and reducing the blast radius so I've broken the list of playbooks into different categories so depending on your security maturity you can Implement them all start from the basics just pick and choose based on the needs of your company okay so let's go through some not all of the security configurations that would make an impact so anything related to IMD DS V1 so in your ec2s UT scaling groups Amis eks so this is the

single best thing you can do to reduce ssrf attacks note that adur now defaults to imds V2 for new instances so it's important to clean up your Tech debt now for Route 53 if the only records you have for a subdomain are NS records nothing else no txt records no C names no a a records just delete the whole thing so this mitigates the risk of a future DNS Zone takeover to save costs okay anything you can do with RDS is the biggest biggest cost saving you can do so if there's no connections within 45 days delete it okay realistically if people are keeping it around for storage they don't need to make a connection so did we snapshot it

delete the database and if the snapshot is not used within a year delete the snapshot same thing with red shift and stop ec2s snapshot and delete in 45 days and delete the snapshot if it's not used within a year for anything unassociated such as elastic IPS load balancer EBS delete it for ec2s that are underutilized it's a little hard so uh for newly launched ec2s will enable cloudwatch metrics analyze the CPU usage is less than 10% and the network IO is less than 5 megabytes over the course of 45 days and if if that's the case we'll delete it because it was probably forgotten now for existing ec2s it's still hard to tell if it's needed it may

be running a cron drob requiring no user connection so for existing ec2s it's safer to just snapshot and relaunch on a more optimized instance type which AWS luckily gives us a recommendation for thread detection so we always have a possibility for false positive so we want to make sure we require slack user responses some high signal low noise things yet you can detect are Cloud shell so if someone is downloading Secrets more likely than not it's an attacker honey pots in our Cloud environment we sprinkle adus Keys everywhere so when they are triggered it's probably because it's a human so quarantine the ec2 the I am now for your indicators of compromise for threat detection they can quickly go out of

date so it's important to focus on the attack tactics that cause the most damage such as credential exfiltration tax office we went through this with lateral movement standards so if you have a tagging standard or are mature enough to enforce a tagging standard including for public resources remediate if they deviate in variance if you are further along with your infrastructures code enforcement then control any sort of your manual create and delete events metrics okay so what I want you to get out of this slide is while we are trying to remediate something like imds V1 we notice a lot of deployment methods so fixing imds V1 via CLI will solve 43% of issues whereas cloud formation is 50 terraform two so

make sure you are targeting your mediation playbooks on the 12 methods that your company uses the most but in the case of terraform so you should note that 2% may not seem like a big number but it was just the one change that affected a large number of clusters all at once that's good thing about Golden standards so in terms of the biggest monthly cost savings was for ads and we had 33% of our databases being idle note that some of these databases only saved us 12 bucks whereas others were 2K a month your actual savings may vary but note that we saved this number even though we had a dedicated Cloud Financial operations engineer on the

team let's compare this cost to how config cost us which was 50k month over a month and as you see as time goes by even if you reduce your attack surface you'll see diminishing costs as the attack surface decreases and this is in comparison to Lambda which was consistently just costing us a dollar a month okay challenges company buying hisan but we'll skip over that so prevention and Bing prevention versus remediation okay is the prevention is the only thing you can do to significantly improve the difference between your huge attack surface and your small remediation capacity and one day remediation shouldn't fire unless something goes really really wrong lessons learn so tagging is our unsung

hero by tagging something and then looking for it at the cost Explorer we are able to estimate the savings of our mediation efforts and when it comes to remediation keep a close eye on surges because they tell us to reassess our remediation strategies cuz maybe they're getting rolled back for whatever reason or someone is persisting in spitting up a non-compliant resource and finally even something as universal as encryption and TLS standards there are still edge cases future work AI please in summary you can do this you can remediate you have the framework you have the playbooks so don't stop your automation Ure and don't stop at slack go beyond and actually fix the issue but

first solve as many problems as you can with secure defaults and then focus on reming invariance to those defaults and don't feel Bound by AWS built-in Services config is expensive and just know that we can't fix everything bugs will always exist and with that as they say in the movie it's time to fade to black unless it's a PowerPoint I'm going to click and show okay thank you so much we have a few questions here on the slido I note that we are at time but it's a 30-minute break before the next session so I we can use that time to take some questions so I will share them uh first question what was your

biggest pain in using SCP uh so SCP is yes the biggest one is usable error messages so like you you think SCP is going to solve all your problems but then when the user tries to spin up something it's like how how can we even format it so that they can be more usable but there's no way at at this moment um what else oh and so scps are not great if you have a lot of edge cases so if this one account needs this one region you can't really use scps it's like it's like an all or nothing so if you have edge cases and ideally you want to tag and say I want to exclude

this Edge case you can't it's an all nothing cool uh next question do you limit and control egas traffic as well yeah yes uh yes yes and we do that via security groups and we as a double measure we'll attach an IM am policy to ensure it doesn't um talk to any other a services and finally did you encounter many instances where Auto applying a fix broke a service and how confident would you be doing this in production without manual review and approval all these questions were Anonymous by the way that was a good one uh so um I know I was kind of willy-nilly when I say delete this delete that but that was actually

not the case so all of these remediation playbooks took like each each one was like six to one six months to a year um each so you know it's um first we're saying okay company we are now enforcing this new standard so whatever compliance thing you can't do this anymore we then send them list of all the resources that are non-compliant you need to review this otherwise we're going to fix this ourselves um and then you know we send weekly emails there's a reminder to say hey this is coming up your thing is going to fix so please please please check cuz there may be false positives um and then even during the day of the

order remediation we don't even fix all of it only for Dev cuz as you kind of guess production is not great so we we we say we're going to fix but we don't touch production and we need to constantly just hound hound hound and be like hey please please revie is it okay is it okay so each of these playbooks was just like that and eventually you get to a stage like okay we didn't break anything and you can trust us to continue to do more remediation all right thank you very very much hey lilan luxmanan thank you so much for this talk really appreciate it big thanks to our sponsors especially socket security who has donated these

very nice speaker gifts

WhizBangLambdaFix: where AWS Misconfigurations meet Auto-Fix-It Antics

Related talks