← All talks

Two Sides of the Same Coin

BSides Boulder · 202333:5138 viewsPublished 2024-02Watch on YouTube ↗
Tags
StyleTalk
About this talk
Cybersecurity and site reliability engineering (SRE) often seem like distinct fields; each with its own practices, norms, and challenges. However, this talk will explore the idea that they are actually two sides of the same coin, both part of the broader field of Systems Safety. By recognizing their shared foundation, professionals in both areas can benefit from each other's insights to better safeguard our digital systems and maintain smooth operations. In this talk, we'll delve into SRE concepts like The Myth of Human Error, Safety-II, and Service Level Objectives, and discuss how cybersecurity experts can apply these ideas to enhance their own work. Additionally, we'll examine the struggles SRE faces and the valuable lessons it can learn from the cybersecurity field.
Show transcript [en]

go ahead and get started uh so we have Alexander Weiss that's going to be talking about the uh interplay of site reliable site reliability engineering and cyber security so please give him a warm welcome yeah thank thanks drone yeah um so I'm I'm I'm Alex um my talk here is is two sides of the same coin uh what security clearing from reliability and vice versa um you know I'm going to start uh with some introductions I'll talk a little bit about the relationship between Sr and uh cyber security talk about some I spend most of my days on the reliability side of the house these right now so uh some cool conversations that I'm having

that I think are interesting um especially on the defensive side for security and then um conclude with with what I think the SEC security as an industry as a field as a discipline does a lot better than site reliability um but try to create the communication there so uh me I'm Alex wise um most of my career has been spent flip-flopping between security and um and reliability in SRE um started to help desk and just kind of went back and forth and back and forth um and uh never really looked back find me on Twitter AWS architect find me on Blue Sky AWS architect find me on Mastadon AWS starchitect um if I get a

third dog I will probably name it AWS starchitect so easy to find there um on the weekend I like to teach for a local group here called um the software Freedom School we teach people how to use and wanted use open source software I think next month I'm teaching a class on open Telemetry um and in this end of the summer we're looking at putting together study group for like a kubernetes certification um you know if if that sounds interesting do you know we have a couple of our regular Learners uh at this conference today so um if you want to Chey about that hit me up uh for my day job I work for a company called Vera

we're a small startup we kind of live in this space between security and reliability um founded by by kind of the earth figures of of chaos engineering we also have an incident database the void that I'll I'll talk about later um that tracks software outages the Technologies involv D um and if there are like postmortem or writeups about what happened we also have Prowler Pro which is the managed version of the open-source Cloud security posture management tool Prowler um creators and maintainers uh are part of our company and have built like an easy to use uh cheap security offering um in in PR Pro so that's what I do with my day kind of live in that space between security and

and reliability um won't go too much into the takeaways of this talk now I I I have a list of cool conversations I've been having and and things I've been thinking about and things that come up a lot in in SRE these days that I think would be interesting uh to security we'll dive into them now because that's what the talk is about um why this talk is important um you know it's neat that Sr and and security have this kind of overlap um and it's neat that I think we can learn from each other but but but I also think that uh in a world where we are where software is becoming such a

critical part of people's lives where software incidents can strand people in airports um where where ransomware hits hospitals that that understanding um the learning from other fields about safety and how to build safety critical systems and how to talk about safety um it's really important for all of software to learn and I think Sr and security have the biggest part to play there so that's me um I'm not the first person to talk about this uh a lot of smart people have talked about it um and and and probably said it more succinctly and and better than me um the the title of this talk two sides of the same coin phrase actually comes from from my boss Casey

Rosenthal who's over there on the on the right getting attacked by a goose um in his sron talk uh 2021 talking about how um you specifically security and Sr are both parts of system safety and in the same boat as far as being able to um deliver success and and explain their success uh to to business um also I submitted this talk in the proposal for this talk in February here today in the intervening months the brilliant Kelly Shortridge released a book her book uh security chaos engineering and that book has a whole chapter on on SR and security and how uh you know lessons that that they can learn from each other so you know

she's incredibly brilliant and and pulls from a number of fields and it's just thoroughly researched so um you know by her book um it's probably got a lot of uh a lot better things to say about it than mine than than this do here um so those are two people who've talked about this before lots more um but let's dive into the meat um these are four things here that I think think four lessons that and that that I I deal with in in my reliability work and conversations that I have and and and hard things that we do um that I think have uh a strong security backing and that would be useful in a security context um

the first I'm going to talk about here is is to measure meaningful things um let sure I got my time up here um this is the 2022 void report um I mentioned uh in the introductions that we we do run a thing called the void um that is an open incident database free database of all of the the software outages that we found um tagged and curated uh we also uh Courtney Nash I should say is our our incident librarian our senior researcher um also releases an annual report where she dives into that data looks at at what types of outages are happening patterns across those outages um and tries to pull out many details kind of

like a dbir but for Sr um in it uh she talks a lot about uh in this most recent one talks a lot about uh meantime to resolution and I know every sock I've been in has has just like a wall of screens with dashboards on it um just lots of metrics grph Pew PE charts um all that all that sort of stuff and I know um mttr is is tracked a lot so mttr is the the time to resolution recovery uh remediation reactivating LinkedIn premium like you know whatever uh whatever you want the art to stand for um the time between knowing a thing is wrong and the thing is no longer wrong in your

environment um and mean that that first word there mean is right the arithmetic mean we're taking the average so I want to talk about mean a little bit um mean is a measure of central tendency it is used like alongside median and mode or other um measures of central tendency and and and measures of central tendency are super useful when you have this kind of bell curve shaped uh normal data set um you can you can track your mean or your measure of central tendency and if your data set starts moving or trending in a direction then your mean your measure of central tendency will also start moving and you can just track the one thing to to be

representative of the data set but notice I said that uh you have to have this sort of bell curve shaped like data what if I told you that your incident data actually looks like that so this is a graph from the void report it is a it's what we call positively skewed it's all pushed over there to one side um this is the rollup view across all companies for inst in data for the void um and you might be saying okay well the rollup view across all companies that distribution doesn't really matter right the you could have a rollup data that looks like this and the individual companies might might still be normal um your company's probably not special

um all of the companies we looked at their their data set um look like this this this little skateboard ramp of a of a graph um we didn't even really use the rollup data um methodologically we were looking at individual companies to see if uh you know to see what how mttr how effective mttr was against their incident data um so the problem with using a mean on data like this is it's like way over here it doesn't really track changes in the data set uh the way you would expect it to um I mean really I think for fif something like 53% of our incidents in the in uh in the in the database are resolved in under two

hours um so you know that's a Reas everyone's pretty good at their jobs um but but the mean is just difficult to use here it doesn't behave the way you want it to and and tracking it doesn't tell you the thing you think it's telling you um but you might say hey we've got 200 years of brilliant statisticians who can have to be able to take this this distribution of data and do something to it so that we can use uh you use some sort of measure of central tendency to track it so if I zoom in on on Big Brain Bay over here yeah you can log normalize data and in the void report we do this

um and uh another a Google researcher in a separate experiment um on a similar data set also took this approach you can log normalize your your duration data for your incidents and make it look like this the the drawback here is that when you do this it the data lose some of their uh explanatory power um and just how much explanatory power well we have the the raw data and we have this log transform data we can run experiments um and see you know if we for example decrease our raw duration Times by 15% how does that affect the mean there um and and and Courtney did this for the void report um and and uh

SE and davidovich also did a s did similar experiments and found found that you need an awful lot of incidents um there's really only one company in the void data set that that is having enough incidents and it's a rollup across a you know a very large company to to meaningfully transform their data like this and track mttr so so unless you're you know having thousands and thousands of incidents um you probably can't take an approach like this so if you're tracking mttr um for for security incidents for for for time to resolution um I anything be be very careful of of of using the mean there um what do this kind of mean if

you were to use the mean um so for me it means that if you are on a team and you have kpis or targets associated with mttr um that if you know let's say your mttr you're tracking mttr and mttr goes up quarter over quarter and that you have to suddenly do a lot of like make work to to figure out how you're going to make that number go back down again um if you have bonuses tied to it uh you're really better off rolling a dice than than trying to to meaningfully impact that metric in a way that that you can control um that really it depends more on whether that quarter you get apples or

Volkswagen uh for your incidents than than anything that you you were going to do um but probably not fair to say you know you can't use mttr without giving you uh some kind of a a pointer toward what to use instead um there's not really a drop in replacement for mttr that we found um there was some some great uh conversations at learning from incidents con uh down in down in Denver uh earlier this year um if prei supported hard emojis I would I would just there'd be a ton of hard emojis there but they don't support Unicode um which is interesting to me but can't if you can't use mttr what might you do um learning

is probably the the best thing that you can gear toward and your Security Org or in your in your reliability org um there's no magic wand you can wave that says our incidents are definitely going to be shorter um but you can do a lot of things to make sure that when you have incidents you're looking at them you're understanding why they happen and and most importantly sharing that out disseminating that out in in your or um at at LF ion earlier this year uh Eric dos indeed had a good talk about their the how surprising it was when they when they started looking at incidents and looking at what the risks were what the what was um what

challenges that they were dealing with uh in their incidents and and did that review and opened those learnings up to the crowd how many different parts of the organization came in um unexpected that the people were really eager for it and and un stood and increased communication across across the teams and and really um really open things up and made a really strong community that understood what the challenges facing the organization were um so learning is is a super cool thing to focus on instead of mttr uh the other thing you can look at is is concrete changes right we're we're professionals we don't really need permission from mttr going up or down to tell us hey we're missing

a bunch of boxes in our inventory management system or we have all these unpatched things or we have these these this our monitoring has a big hole right here um you know you can just go in and do that and track those metrics and know that that is making your system more secure and don't have to wait to see how how mttr is affected by it um the last one here is is maybe more situational um but humans are very good at taking in subtle changes in their environment subtle subtle cues um and even if they don't understand why they they they can aggregate that and assimilate that and and and get a a good understanding um or

sense of of changes and quality of things um it's kind of like this uniquely cool thing about people's brains um so so you can ask people um this is called like the modified Kirk Patrick method where where you say are we getting more more secure are we getting better um with people who are in the system every day and and you can trust that that feedback so measure meaningful things um is my first thing and and just sort of a throwing shaded mttr as a as a metric um next one I want to talk about is system health and success and I'm bad at prey it's my first time doing a prey thing thought I'd try something new um

is the system healthy uh this is a book by Eric Hall Nagle safety1 and safety2 uh from the field of of human factors in system safety um resilience engineering um dealing with safety critical domains and and how how safety Works in in environments and in software um we tend to think of s in both Sr and security we tend to think of safety as sort of the absence of these right um your system is really good unless you're having an incident unless someone's getting paged unless you're in an active breach um other than that the system is is great um and we kind of organize all of our work like this right we we we this

is how we how we run our businesses how we how we um describe our success all in terms of these tickets um and this is what Eric Colo calls safety one um and it really is the uh the absence of incidents is what makes something safe and you have if if you have a an acceptably low number of these events uh then then the system safe for whatever uh whatever ever number of acceptably low is um that's an OKAY model for for how to think about work how to think about systems um the the real limitations there is is is that thinking about work in that way tends to uh come with some assumptions um uh specifically uh it assumes that

systems are decomposable right so we build our systems piece by piece so you know we take the EDR and we wire it to the Sim and that goes to a sore and then that notifies a human sock analyst and they have escalation tooling and there's probably someone reasonable to escalate to um and the idea is that we can we put all those things together so we can take them all apart and look at those individual components individually um with just like fixed inputs and outputs uh you know systems don't really behave like that um things tend to take multiple inputs once you wire the things together their behavior tends to be uh an aggregation of that

some sort of cumulative output that's not not just the two binary things that you glued together um the other assumption about safety one is that that the things all those little components are either in a an up or down state they either thumbs up or thumbs down and there's not really a way to account for states of gray like hey this thing is running but it's shipping so many logs that it's killing our logging system and we're blind to everything else or or that hey this thing has been down for three weeks but no one's noticed and and no one seems is is is anything really wrong is something is something accommodating um the other

assumption from the uh safety 1 and safety 2 book that safety one makes um is that there's a chain of causality and in security we can kind of do this from the attacker perspective right we can look at the kill chain and say Okay I uh I I uh you know had had SQL injection on this host and then I found a service account I got um got rout i i i sprayad the network pivoted and eventually I got I got domain admin or something right um You can you can understand that uh that that that chain of causality from from the defender perspective it's really hard to do that backwards right oh oh no

um it's really hard to go to go backwards there right okay so the the the attacker got domain admin and you know is the issue uh that we use computers differently than we did in the year 2000 um I I I I don't know right like like it's really hard to say what what caused uh what caused the attack or what caused the component to fail from the um attacker perspective so or from the defender perspective so that's kind of why safety Mo safety one is a is a tough model to work in it tends to have gaps we tend to feel that friction um when we're doing work when we're think when we're trying to make

all of our work in terms of these and we feel like we're doing all these responding to these incidents in a timely way but things aren't getting better um it's typically because of that safety one of working uh what what H Nagle is is really advocating for here is more of a a system like this so this is a model of our system this is risk flowing and pooling in different parts of our um parts of our systems um different kinds of risk and it's getting collecting in different places and it's getting released in different places and humans are going in and and and cleaning things up and making sure that everything sort of non

destructively flows um you know this is kind of the model of of safety that that makes a lot more sense I think um you could imagine like some of those being um you know toxic this is your person this is your personal data um the the upper left blob is uh is uh weak firewall rules and the bottom right is the unpatched cicd server that is still running a Docker version from the early teens um that there's there's just this risk pooling around and and you need to actively find that and go in and and um clean it up so that's safety two I don't know um safety two is marked by the presence of

successes uh safety is not a non-event um it is an active thing that we do we create safety moment to moment uh by the things that we do um in our organizations by by our daily practice practice at the at the sharp end um which is which is different from from from from from this here right we we want to talk about um making that that kind of work visible uh it's a little bit of a jokey summary here imagine a zoom call safety one thought I was on mute but I wasn't safety 2 at least I was wearing pants right it's the thing we do the things we do every day to ensure that we don't

have a bad outcome um and we're we're we're noticing that we're making that visible and celebrating that so what's the advice here what's the what's the learning here what's the takeaway here well we can't really like this is how we structure our work right we can't get away from these necessarily but we can do more to make this kind of work visible going in and actively reducing the risk pooling in our systems um that that is making that VIs visible and first order is is uh critical to success um cool so that's that one um time check here okay other two are pretty quick um you can always tell good incident review from bad incident

review um and it's pretty easy because good incident review tends to focus attention up your org chart um up toward places of Leverage up toward uh Power uh in your systems and and if you do poor incident review um that tends to focus things down your org chart to focus on um small changes or or or low-level employees that that that don't have a lot of control over the systems they're operating in um if for an example of of a bad incident review you can think about um think about like a five wise I don't know if anyone here is familiar with like the five wise framework but let's say you start hey the the server didn't restart

well why didn't the server restart uh well because the developer never made an autoscaling group for it well why did the developer never make an autoscaling group for it well it was part of a POC it was never really productionize um well why is that well uh John didn't make the ticket after he finished the POC so it just kind of stayed there and we never did the work and now you blame someone who may not be at the company anymore um there's really not a lot of action items that you can take away there uh nothing that you can really do to change how your organization does work um because you've pointed you've scoped

things way down toward place without power um so the only the only real takeaway there is is you know if you if you find yourself talking about uh an issue and and and you keep going further down your org chart or further down to like specific jur tickets um try to redirect that upward um try to try to figure out what created the conditions for that thing to happen in if there are higher level changes um because because not much good is going to come from from five wising yourself to to a single jur ticket that should have been made um my last takeaway here uh this is the montra of of chaos engineering which is kind of the water I

swim in um this is an excerpt from the chaos engineering book I also would recommend the the principles of chaos um as an a website that's that's kind of a good quick read about this the idea here being this Mantra of embracing complexity is is that complexity increasing complexity is a natural byproduct of us doing our work of of our systems being more secure of our systems being more reliable of our systems serving users better making more money whatever it is that requires complexity complexity is the currency that we we pay that with um and so instead of trying to fight that complexity I love this quote to Surf it like a wave you figure out tooling and processes to

to manage that complexity um instead of of of trying to fight it because because uh it's a everything we want to do is going to is going to Great complexity the question is uh how do we not misprice it and how how do we understand it as as we add it to our systems so those are my four uh things that I think are fun conversations we're having in SRE right now that are might be interesting to security folks um I'll just conclude with what I think uh security does really well so um I mentioned I mentioned Casey's Sr con keynote um called the the it's called the success in Sr is silent um and really the the Crux of that talk

is is that the improvements that we make whether we're in security or Sr uh will get eaten up by the business parts of our organizations um that you know as we make something more reliable um we'll put more users on it or we'll add more features or we'll shrink databases and and you know F eat into that safety margin um on on the security side as well as as we get better at paying down uh technical debt and and and and vulnerabilities as as we get better at responding to incidents that frees up the business more room to add more users and add more features and move faster and and that is that is really what we're enabling there right

is is letting is getting the business to move faster um so so SRE struggles with this this sort of delicate balance here you're we do a lot of work to make uh systems more available more reliable faster but that inevitably goes goes away the metric any metric that we track will will eventually go away because that is being that is a currency that's being spent to to create more value for the business um so how do we tell a narrative that uh you know that that explains our success um and I think security does this pretty well um you not not everywhere but uh I I think being able to like being able to have Frameworks and have mental

models that you can hand to the business that say you know the we didn't get breached um you know that we can we can have success in the absence of a negative and celebrate that um is something that SRE needs to learn pretty well um you know if there are uh other other tricks and tips or if you guys have any secret insights into why security seems to do this better than Sr um you know I'd love to love to hear that but yeah that is um that's my talk that's what I got I can throw it out to [Applause]

questions any

questions cool all right thank you very much thank you Alex [Applause] yeah so I'm I'm not sure whether or not sandwiches are here yet but we can at least oh they are okay sandwiches are here yet uh so we will stick with the the schedule um just have a longer lunch break uh thank you Alex and uh and then come back around 1:25 uh we'll do another another round of giveways so

thanks giveways

giveways