← All talks

Scalable security: how to win friends and not burn out everyone

BSidesSF · 202322:35319 viewsPublished 2023-05Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
Scalable security: how to win friends and not burn out everyone Eric Chiang, Brandon Weeks Brandon and Eric have been involved in numerous security efforts over the last 5 years at Google. Some successfully, others… less so. Hear lessons learned scaling processes for lots of users, pissing off as few coworkers as possible, and (when we’re lucky) doing a little bit of security. https://bsidessf2023.sched.com/event/1JrCm/scalable-security-how-to-win-friends-and-not-burn-out-everyone
Show transcript [en]

everyone let's give it up for our two presenters to wrap up the evening uh hello everyone my name is Eric this is Brandon helped to introduce things yeah uh welcome to scalable security how to win friends and not burn out everyone uh so as I said I'm Eric this is Brandon uh we work in the Privacy Safety and Security Org at Google um so that's a really big org that doesn't really mean any much I work on like internal firewall management and you do fun stuff with like Hardware security and device attestation fun stuff um but yeah like I think one thing that we feel and that maybe hopefully some of you do is that like we're not really on the red team or The Blue Team or the purple team or the orange team or whatever color you've you know picked today uh we're on that team that kind of just gets given security problems and then is like hi could you please go fix that um and then we were really lucky because we work at a really big organization so they're like can you go fix this for hundreds of thousands of people um and then can you do this again oh and again and again and again and um over the last five years you're almost at seven which is insane um we've done this a bunch uh we've had some mixed success but that's okay um and we've only burnt out a few times and are always in kind of a constant state of either burnt out or not burnt out but that's okay yeah anyway so this talk is going to be about some lessons we've learned burning out and trying to deploy controls um and we're gonna be all confident and be like yeah this is how you don't do this uh so yeah the agenda today is first we're going to kind of talk about what we mean by scale I recognize that's a buzzword but hopefully that's why you're all here it's because buzzwords um and then we'll talk through some general themes we think that are important when you're doing this kind of grunt work of like you know take out control roll it out and try to not piss off as many people as possible so we're going to talk about measurement and good controls and bad controls and being proactive and not reactive and communication and all this is going to get repeated so it's not worth you know going again anyway Brandon uh what I actually mean by security at scale Google is a big company we have hundreds of thousands hundreds of thousands of employees billions of users dozens of chat apps and the company has grown incredibly quickly in software engineering at Google we don't think about can you scale your product by 2x or 3x we think can you scale it 10x or 100x and that's what we mean by scale in this talk this presentation is about how scale applies specifically to security could your team handle 10x more work if you review every feature before it launches or every SAS app before it's deployed to your company could you handle 10x more of those if you have to manually process every vulnerability on one platform could you handle four more of them if users reach out to you every time you break them and they can't do their work could you handle having 10x more employees we once ran a remediation we're getting hundreds of tickets a day with users requesting exceptions and that was a good remediation these problems aren't unique to security every team at Google has these problems this slide deck had to be reviewed by a PR team that has to review every slide deck ever presented if your workload increases with the number of users you have the number of devices in your organization the number of vulnerabilities that are discovered you won't be able to keep up without burning out you cannot work harder to keep up with scale your team cannot hire fast enough to keep up with scale if you try to be a hero you will burn out your team cannot adapt to scale by itself it turns out talking to people is actually really important and as an organization scales your job quickly becomes convincing other people to do security not actually doing security yourself everything in this presentation is going to seem super obvious and we understand that um but actually doing these things consistently and well is much harder than it seems so we're going to talk about measurements controls being proactive versus reactive and strategies for communicating with your users um and again to just reiterate the like talking about simple things rather than complicated things I don't know what my idea of what like a security engineer did before I got this job but afterwards it's definitely like doing Excel spreadsheets and you know writing SQL queries measurement actually turns out to be a big part of basically all we do is like figuring out how bad this is going to be or how much we're going to you know mess up people's days um so it's well again it seems super simple the amount of times we've had people just come to us and not be able to answer these questions means is you know begs repeating and one second because I can't see whatever all right so before you begin you know on your security Journey um you should be able to answer very basic questions like what is the size of your problem if you have some app that is malicious and being installed on your devices have you do you know how many devices it's installed on is it two is it 500 um if you're looking to turn down a feature because it's insecure how many people use this in the last 90 days I've been in a million meetings where people are talking past each other where one query would have just solved that problem immediately um and then of course once you have that data you it's never enough to just get the initial value you have to go deeper so you know is 300 a big number in your organization uh for us it might not be or it might right and then finally like are you impacting a single team or a role right when we're doing analysis we're never just thinking like how what is the number right it's oh is this number all from one team is it from you know a particular product that somebody's working on answering these questions will let you know like can you just go talk to that one team and solve the problem or is this something where you're going to have to involve lots and lots of people and then finally you're just going to hit limits at some point um at some I don't forget when this was but we were working with somebody who was giving a presentation to us and said look I know that this impacts somewhere between 10 000 people at the company and 200 000 and I know that that's a big range but at least it's a range and Elites it somewhere you can go and then finally like you're just not going to have data at some point and that's okay uh we had plenty of scenarios where you're giving it you know thing to a VP and you're like hi I I need a little bit more time to get better data to figure out how many people were going to break and the VP is like look just break them we'll figure it out later and that's also happens um and then there's a shocking amount of actual engineering work that goes into getting this data I have written controls that have never been deployed anywhere in any enforcing mode but we write them in in audit mode to see you know if we deployed this and tried to turn up the pain like how many people would actually come getting mad at us and that is really informative because it it puts you in the firing line without actually putting you in the firing line and then we do not roll out controls all at once right you take one percent of the company you roll it out two percent ten percent so on and so forth and this is great because it gives you a little period where the size is actually manageable where you can go talk to every person that you might be impacting and then a little trick we've learned is uh if you just change the defaults you'd be surprised so if you know you're a company and you're hiring a lot or if you're rotating uh like Hardware um if you just flip the default for a particular control and just don't tell anyone or you know tell people but whatever if you wait a couple years you might get 80 compliance um so don't do the hard things before you do the easy things definitely and speaking of good controls or Brandon more importantly bad controls this is a bad control uh telling users not to click on suspicious links doesn't work uh but what is a good control it may seem obvious but having a structured way to evaluate what is actually a good control in your organization will help you inadvertently deploy bad controls so what makes a good control you can not rely on your users to do the right thing telling users don't click on suspicious links doesn't work your users don't know what suspicious links are telling users having a security policy that tells users not to use unapproved vpns doesn't work your users don't know what the definition of an unapproved VPN is and to that point if your users can install unimproved vpns in the first place what are you even doing if you actually care about something enforce on it don't shift the blame or the burden to the users if deploying a new control requires talking to everyone it impacts you won't be able to keep up spend the time up front to develop documentation and tools to help users migrate themselves and security controls inevitably increase over time in site reliability engineering they have a concept of an error budget where they predetermine how much pain they're willing to inflict on users before they take action consider having a control budget how much friction are you willing to cause to your users for you have to change something for example if your team is going to deploy yet another agent onto the platforms how does this impact performance and is there another agent you can get rid of instead be wary of controls that have to be updated every time software changes examples of this are kernel extensions SC Linux policies application allow lists the ecosystem moves faster than you can update these policies and when you introduce new controls eventually your team will have more controls it can maintain so be willing to turn down controls some of them probably don't really matter that much consider using controls that are integrated into the platforms you're securing instead of buying controls from vendors or building tools yourself and if you can't do either of those things find another team that's in a better position to maintain these controls like your it team and just for the last slide like I think we've ripped out more controls from Google than we've ever written so just to give you a sense of like the scale of that problem for us at least um inside of Google we have this motto that goes around it's from like OG you know Aurora error Google um but it's secure by default insecure by exception um and I actually really like this because it packs up really quite a lot into just six words so we're gonna go through it and I think the first part right secure by default if you're going to do something at a company like let's say you're going to deploy something to production or you're going to share some information with a user or a customer if you search the wiki and you get an answer that is the top answer and says do it this way that had better be secure and if you're not working on securing those well-lit paths you're doing it wrong right those should always be the safest way to do something the easiest way should be the safest way but inevitably your controls will just break at some point or to the point where a user is going to need to do something to do their job where the control blocks them and you still need to do your job right and this is where we get into this idea of exceptions so exceptions at Google are actual things that are always built into every control we have so the idea being that if we Implement a control we always have a mechanism to turn it off for a particular set of use cases these are not just like an email thread from a security engineer that you then quote later on these are going to be tracked in group management systems somewhere we have an idea of when they expire we also kind of know what the justification for that exception was so when we Implement a control we might have a bug and a lot of people will need to get Exempted we will track that so when we fix the bug we can actually go back to those people and and fix it up and again like when we're rolling out controls we try to hand out exceptions like candy right like if you are implementing control and you know somebody will be blocked by it it's an expectation that you're just gonna give it to them ahead of time and then clean it up later um so always again trying to think user first about these kind of things um but there are some anti-patterns with this um this is a slide uh from France actually and these little I don't think the things actually have a name but these little metal things on the concrete are basically to stop people from either sleeping on it or sitting on it if you've you know lived in San Francisco even five seconds you've probably seen stuff like this around this has a general term called hostile architecture hostile architecture is architecture that serves no purpose other than to discourage something so it's not for Value it's just again the value comes from preventing someone from doing something I've seen a lot of teams so exceptions in general feel bad because you're turning off a control it feels like a security risk and it is but I've seen a lot of teams effectively give users grief when they come asking for a security exception to do their job and it's very important just for mentally when you're thinking about this exceptions are a good thing they let you roll out of control in a way that you could roll it out without worrying about every detail and when a user is asking you that should not be the time when you're having a security conversation and again this does not mean that you should give everyone an exception you should be clear when you don't but definitely don't treat this as an opportunity to you know cause your users pain exceptions are a good thing people who need exceptions should get them proactive not reactive some security incidents are genuinely oh moments this is one of them we both spent a week over Christmas 2021 responding to log4j and that's okay attempting to plan for these type of incidents isn't worth your time however every day there's another 9.2 CVSs vulnerability there's another breach in one of your vendors or there's another super scary tweet that your leadership saw um these are not oh moments uh they're your job uh and attempting to treat them as such will cause you to burn out be proactive a threat model tells you what risks really matter but more importantly it tells you what risks don't matter when your VP asks about that scary tweet use Telemetry to put the risk in context so you can get back to proactively defending your organization instead of dealing with this establish remediation slos before vulnerabilities happen for everything that's not log for J Don't Panic stick to your slos ing your team or partner teams to patch faster will lead to burnout there are security teams like incident response teams that are built to be constantly handling emergencies if you aren't on such a team constantly jumping from emergency to emergency we'll burn you out if a problem isn't worth involving your incident response team it's not worth working nights or weekends go home Eric foreign and then finally we're going to talk about communication um so I'm really lucky to work somewhere where we hire a bunch of smart people it's actually kind of like a gift and a curse but that's a different talk um I've been in rooms where I have the security title and I am the least qualified person to talk about security for a particular problem right and that's okay but I think what this has really taught us is that these teams don't put up with your BS of like this is important because security which well that obviously sounds like a straw man that is the essential argument for a lot of the things that I hear sometimes from security teams why is this important because security we want to get you past because security uh one of my biggest pet peeves is one anyone ever says the word privilege of you know principle of least privilege or mandatory Access Control versus discretionary Access Control if you ever say that I'm muting the thread and I'm not listening to you and then be transparent why is more important than how expressing what you're trying to do or what you're trying to protect is more important than how you are trying to do it again going back to those rooms I've said like I want to do this control because I'm scared about this data and some senior SRE shows up and says oh I can get to that data five different ways that's actually a good conversation right because you've expressed to them what you're trying to do rather than how to do it and then be consistent it it's really hard on these teams because you're going to have to hire Junior Engineers occasionally and they're going to have to sit in those meetings too and if they don't have a reason other than because security it's hard for them to have the same sort of sense of consistency with those teams and get the same outcome you know you cannot be the person triaging every single conversation or you're gonna have to go on vacation sometime and then in a more meta level like if you can't explain why something is important it might not be so this is actually a really good exercise for yourself to figure out like why is this important to us at all and then finally uh be rigorous so um our friend Guillaume came over from Mozilla to our team and introduced this thing called the rapid risk assessment this is like a two-page template that says like here's how you write up a security assessment and the craziest thing about it is it just says like what is the data are you concerned about what are some examples of how an attacker might get around this control and get access to that data what are the mitigations and what is like things we should get to eventually and what is Over My Dead Body and that seems like not a lot but it is dramatically different just to have some structure particularly for any of your users or any of your partner teams like that drives conversations so much better so we use a framework if you can we found that security is as much about workplace culture as it is actually defending the organization users in different types of organizations have radically different expectations as an extreme example the NSA gets away with sticking their users in a windowless box with no cell phone Google cannot get away with putting our users in a windowless box getting the balance of security and culture can be sometimes incredibly difficult and getting it wrong can lead to backlashes that destroy your company culture and burn out your team you can that leads me to my next point you can inadvertently have huge impact on your company culture just by deploying security controls as an example if you start deploying if you start requiring your users to use MDM this fundamentally on their personal device this fundamentally changes their relationship with the company and that isn't necessarily a bad thing but be honest with yourself when you're rolling out security controls that have cultural contexts in your organization and um yeah I think that's basically the rest of the talk and I think that hopefully what this talk has at least described is like I feel a lot of pressure a lot of the times with a lot of alarm Bells going off of you know things that seem critical or important and for me that always you know drains me and burns me out and it is rough sometimes to be having security conversations because all conversations kind of feel like that uh scale is unforgiving right you can keep having those conversations but that doesn't mean that you're actually going to be solving the core security problems and you