← All talks

Overwatch: A serverless approach to orchestrating your security automation

BSidesSF · 202327:371.4K viewsPublished 2023-05Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Overwatch: A serverless approach to orchestrating your security automation Sanchay Jaipuriyar We've gleaned many benefits from shifting security left, but some problems it has brought upon security teams include managing a plethora of CI files, or actively maintaining infra across a lean team. To simplify this we built Overwatch, our serverless security orchestration approach written in Go. https://bsidessf2023.sched.com/event/1Hzt7/overwatch-a-serverless-approach-to-orchestrating-your-security-automation
Show transcript [en]

hey everyone theater15 let's give it up for Sanchez our next speaker hey everyone thanks for coming to this talk about OverWatch my name is Sanchez I'm super excited to be here uh b-sides is where I've learned I think most of my security Knowledge from so it's very humbling and you know I'm extremely honored to be able to speak here today um before I actually talk about OverWatch or any of these other buzzwords that you see on the slide just with a show of hands how many of you have heard of the phrase shift left before okay yeah I the lights kind of bright but I think I saw pretty much everyone's hands go up and obviously shift left is this very powerful concept that's been used in our industry quite a bit and it's generally just to refresh everyone's knowledge it's this idea that the earlier or left that we go into the software development life cycle um the cheaper uh security bugs are to fix and you know just the more traction we get and that's fundamentally why we're here right we're trying to stop security problems before they become a real problem post-production and last I checked which was this morning AI has not completely eliminated our need for shift left just yet so I think this talk is still relevant today um and it was about you know some of the buzzwords on that slide reference security automation right and when we're talking about security Automation in you know while referring to shift left typically we're talking about the pull request stage uh it's sort of a phase where developers are pushing some code up they're looking for some feedback and we can kind of inject our security feedback somewhere over there and so today we'll talk a little bit about that pull request phase and so what's a practical implementation of shift left add a pull request phase well it could look something like this a simple GitHub actions file that runs some you know my Custom Security Automation and in 15 lines of yaml we have shifted left right we have a way to funnel feedback to our awesome developers that are just waiting to hear about all these security issues that you know we're detecting in their pipelines but what's what's one of the challenges with this well this is installed in every repository and security is an ever-evolving landscape so as things inevitably change we have to go modify this in every single repository that we've gone and install this in and that's not the only problem sometimes we have one version of our automation that needs to run on a set of repositories and a different version on another and again we're managing managing all of this on a repository by repository basis right which can obviously get a little bit dicey you might say hey that's fine I only have 10 repositories and I'd say God I wish I was in your shoes um because standard positives I think this approach makes complete sense but as our organizations grow what happens when we hit 100 what happens when we hit a thousand and at chime we're closer to that number and also we understand that there's organizations that vastly outsize us too right it just can get pretty dicey I think again we're all bought into this idea of shift left but I don't think we all signed the dotted line that said hey I want to be a glorified pipeline manager for the rest of my security career um in addition to this you know as we're you know adding different forms of security automation as we identify all these Cool Tools that we want to just inform developers at their pull request about this is unfortunately what a triaging experience can look like for some people it's you know navigating through a CI Pipeline and parsing the output figuring out if something went wrong or if the tool just broke and for other people it's like logging into single sign-on for the first time of that tool finding their dashboard and the necessary security vulnerabilities and this is obviously just the frustrating experience right so we're trying to combine these two ideas this like non-unified vulnerability management you know way of surfacing issues at the pull request level and just this growing SRE devops type busy work that we didn't really sign up for we're like can we do something better because we feel like we can so let's quickly throw on our product manager hats and go like what can we build to kind of solve some of these problems and so uh one approach which obviously the paypath concept very well popularized by Netflix we're saying can we use that concept of the pave path but apply it to enabling and managing our security automation right create One path where we're not reprimanding people for not using this what we want to do is create an attractive enough solution that people will voluntarily come and choose our paid path to spin up their own forms of Automation and if we truly are able to get a paid up and running then we might have a chance of unifying and kind of having an intuitive triaging experience that can come of it you know of like one single path so we're going to try to solve that um we're also trying to going to use some other aspects right like automation should be isolated from each other we don't want it to impact each other we also want to ensure that we're lowering the operational burden we don't want more responsibilities to come so really any failures that come from this should be async we don't want anyone to be on call and lastly you know by default we don't want to play whack-a-mole with our GitHub inventory it's constantly changing so by default everything should just be covered new and existing and so that last point is actually the simplest one to address first if you're on GitHub and familiar with GitHub apps you can just kind of install this org wide and you can get events from everywhere so this part's fine if you're not familiar with GitHub apps uh basically think of them as glorified service accounts with some amount of permissions you know fine grade permissions you can have it just do very specific things like on pull requests and receive events um just related to pull requests so as we're receiving all these events like for example maybe we receive an event that a certain developer has pushed to commit to a pull request and now we want to do something about it well we have to send these events somewhere so what we're going to do is spin up a very simple go web server and we're going to send events to like this GitHub events endpoint right so we have some place that all these events are landing and and as a name suggests it says that blue box says orchestrator right so it's going to do some orchestration like things it's going to decide for a given event should I run any forms of automation or not and these forms of automation as you saw from our requirements very easily kind of lead us to this Lambda route they're isolated there's less Ops with them they're ephemeral they kind of fit that requirements very well and as we're going through this remember we're trying to reduce the burden on anybody spinning up or using our pave path what we're going to do is take any common functionality so think of parsing a git diff or cloning some code or you know manipulating in any way that is shared across multiple forms of automation we're going to have our orchestrator just handle that don't put it on the lambdas let's take the burden on the platform and just create that pave path for everyone to walk through there's two names of lambdas up there somewhere up and break man but they're not the only ones we've integrated we have six today one around the corner and more coming and I promise this isn't some type of flex of like we have more automation than you like that's not the purpose of this at all this list will grow and reduce in size as required we're just chasing high confidence stuff that fits into our program but it needs to be very flexible this is just transparency um a little bit more insight into this orchestrator itself uh I mentioned you know four given requests should I trigger some automation or should I not um so some examples are like there's depend about pull requests and we get tons of them we don't really need to trigger any automation for that um any security automation for that because really they're they're actually opening pull requests to solve security problems usually in the first place um and so this orchestrator is going to filter out these events and decide for any given event do I actually need to invoke a given Lambda or not that's one of its core responsibilities and everyone's definition of scale is you know something slightly different so again for transparency we receive you know like roughly 22 000 events per day uh which is plenty for the go web server to handle I mean you can handle that just fine spikes can look like you know 3x 4x's volume mostly depending on the GitHub activity that's coming in and roughly 15 of these events are worth acting on like at least one Lambda will get triggered for any of these given you know for this volume of events basically and as you can see from the slide they're just very cheap to run in case you were wondering about cost um so this is all great okay for a given event it's deciding what to do but the orchestrator still needs to know what lambdas are available which ones to trigger and so kind of emulating that CI experience we have one yaml file essentially inside the OverWatch repository where you can Define and place your lambdas that you want to spin up these have additional certain options you know for example for Brakeman or sembrap you can add extensions you know certain file names that you want to trigger or Lambda on these are in addition to some of the core you know event parsing that our orchestrator is already doing but remember we're only defining this in one place one yaml file and you can deploy your Lambda out to our entire fleet through this flow and so going back to our diagram a request came in orchestrator decided okay maybe this one's worth acting on you know based on the configurations we triggered some forms of automation some of these lambdas and using some shared flow like the S3 flow and then these lambdas perform their operations they they go ahead and you know drop a in this case a report to another S3 bucket so as they're finishing they're they're dropping their reports and as those reports are being generated we're going to go respond in a sort of unified way back to the apis we've integrated with in this case GitHub and slack these are the two that we've integrated with yours might look different but just think of these as replaceable apis in case anything ever changed and uh fortunately unfortunately this is not a system designed one-on-one class so I'm not going to talk about every single failure point that could go wrong but just think about this diagram as anything should be allowed to feel here we're not trying to add burden or Ops work on any you know individual or engineer that is maintaining any part of this diagram so just add you know your SNS topics or retries as you see fit over here and that's something we can talk about later too okay so we talked about unification of alerts a lot so what does that actually look like so here's an example of an alert coming from the semcraft Lambda it's reported inside GitHub and has a pretty intuitive experience there are some triage buttons here if you want to know how to fix the alert you can click on show more details now a lot of this is actually powered by our usage of GitHub Advanced security we are partnered with them and we're using these API calls so it's really not super magical that this is this alert is showing up but what I think is cool is that for every Lambda that you integrate with our platform there is a unified playing field so now our Breakman alerts our mob SF alerts and every other Lambda you saw integrated they can all report alerts in the exact same way you're not logging into a multiple dashboards and you're all trying to find how to fix vulnerabilities if you've seen even one of these alerts you probably know how to triage them with you know some uniqueness to to each of the lambdas in addition we want to take the Ops down on our security team we mentioned that and so our security team generally consumes alerts in slack and these are also looking strikingly familiar at this point and so here's alerts from two of our lambdas and Piers alerts from two other lambdas and they're all using a very default template that has been defined by our orchestrator and like anything in this flow you can override it if you need to so need something custom feel free to go at it but here's the pave path or default mode of operation um one note about those buttons our source of Truth we're using as GitHub so any triage you do whether it's a security engineer on the slack side or if it's somebody triaging on the GitHub side fundamentally they're mapping to the exact same alert we're not syncing alerts across multiple sources and so this lets us do that and any on fixed alerts or anything again they're living in GitHub it's our source of Truth for code related you know security vulnerabilities and that's how we use that and so you might say okay this is all fine cool that you found a way to slightly unify your alerts um but I haven't dealt with Lambda before and like what does that take I kind of miss my 15 lines of of yaml at this point um so let's let's kind of break this down like what's in this Lambda how do you set it up so we've gotten to the point where all you have to do is add basically one line which is your Lambda function name with some optional parameters like timeout or memory it's going to automatically configure all the necessary resources and permissions that Lambda needs to do its job and just as a reminder its job is essentially pulling some code from S3 running its automation whatever that means and then uploading a report to S3 if it kind of fits that mold and so again if you're not familiar with the code here this is totally fine there's basically just three functions mapping to those exact operations I just destroyed get code do some automation dump the code I mean sorry dump the report and so the only bit that's actually unique is that section in the middle that's the one that all Lambda authors really control and they can do whatever they want in sem group's example it could be as simple as this one line let's look at the difference in Breakman it's just a different line but that run scanner interface is essentially identical and depending on your automation needs it's in scale upper up or down but the point is it's fairly simple you have the terraform module and then you have this one line that you're fundamentally controlling everything else is shared and imported and you kind of probably need a way to test this right like hey I have a Lambda I want to see if it's working with the orchestrator I want to see how the alerts are showing up or what it looks like in GitHub and so if you're familiar with versions and aliases and Lambda this will be a little bit obvious too but if not just know that there's a way that every time you push a commit on your test Branch or your custom Branch within your Lambda repository we are creating versions and aliases of your Lambda that you can pin to and you can use so as you merge to main or you know your main production Branch there's a prod version now available and you can you know in the orchestrator remember that one yaml setting file not just the Lambda function name you can also add this custom version or Alias to pin to that exact version that you want to run with additional options and so why is this powerful why am I telling you about all these versions and aliases well just because we found a way to spam GitHub in the pull request comments it's not what we want to do with all our lambdas so a typical rollout strategy would probably look like hey I want to take the test version of my Lambda I want to run it in slack for a while feel confident about the rules you know maybe a test version on my test Branch once I'm confident I can merge that in I have a prod version now available and maybe if I'm happy I can turn on GitHub commenting for that and that's a little bit more thoughtful that is not going to ruin trust with your engineering counterparts and gives you a lot to play with one thing to note is you can actually run multiple versions of the same Lambda right like your prod Lambda and a test Lambda all in one place in parallel it's it's all up to you we're just giving all the options to everyone that wants to roll out this Automation and once again remember that anytime you define anything here it's getting deployed Fleet wide across all your GitHub repositories and so all this also might sound great but really let's highlight what is happening by separating out all these automations like by not rolling it up into like one golden image or keeping them all playing together um really the brakeman Lambda can just fail and it should not impact anything else right and so typically if you were rolling a Breakman check let's say in CI uh this is a flag or like you know a pipeline failure that a developer might see um if you know even if github's down or breakman's not operating well anymore and we really don't want to add this cognitive load to developers in GitHub at all if it's something that they can't fix if it's something that only the owners of Breakman or that Lambda can fix like why even bother people in in the pull requests so because we control every API call in this entire flow we can actually just suppress it we can suppress from all the pipeline checks that hey Breakman actually never even ran just check out the other stuff that did if there's any action to take on that and and this doesn't mean we're suppressing the alert from Brakeman at all what we're doing is reducing the burden on the team managing brake man because we're filling up their dashboard so here's a default data dog dashboard for the Breakman Lambda they can come and debug this as they see fit in their own time they're not on call they don't have to respond immediately and they're not blocking any CI pipelines in GitHub um and so one of the cool things of setting up a flow like this is we actually have an extremely talented team at chime working at service to service authorization authentication if you don't know what that means that's totally fine just know that they had a custom enough way that they couldn't just roll up rules in semcraft it required spinning up their own automation very custom configuration checks that had a lot of dynamic elements to it and really what they wanted to do is nudge developers to use the right authorization schemes over time they wanted slack alerts for every invalid configuration and you know spoiler alert do we have a tool or a pave path that can kind of solve their problem for them and that's exactly what they chose so we rolled their form of automation fleetwide via OverWatch and it's not exactly an appsec example right it's not like running semcraft or Brakeman this is slightly modified but still fits the pave path model going even further and this one excited me a lot is we had engineering teams outside of our security team reaching out to us they were like hey we heard about a tool that can nudge people to do the right thing in a different direction it can give us slack alerts you can deploy it Fleet wide on you know just one yaml configuration we're deprecating a database and we don't want people to be using this database pattern anymore can we add a rule to OverWatch of course we said yes and so here's a Sam rep rule that basically they added to dynamically you know prevent people from changing a deprecated path and they had a rollout strategy for them and that team completely owns that rule but they're not taking any SRE burden they're not they don't have to flag all these anti-patterns or spin up their