← All talks

Chaos Engineering: Break It On Purpose by Morgan Carter

BSides London · 202218:08145 viewsPublished 2022-01Watch on YouTube ↗
Speakers
Tags
About this talk
Morgan Carter explores chaos engineering—the discipline of deliberately introducing failures into systems to build confidence in their resilience. The talk traces the history of chaos engineering from Netflix's pioneering Simian Army through contemporary implementations at LinkedIn, Facebook, and AWS, explaining the benefits of graceful degradation, improved monitoring, and disaster recovery testing. Carter provides a practical roadmap for implementing chaos engineering at enterprise scale, covering estate understanding, health metrics (traffic, latency, saturation, errors), blast radius management, and retrospective practices.
Show transcript [en]

Thank you. Can you all hear me okay? Yes. Okay, cool. Um I was feeling okay about this and then Andy very kindly informed me that somebody from the chaos engineering team at AWS is here. So, now I'm kind of wrecking it a little bit. So, bear with me. Um We're going to be talking today a little bit about site reliability engineering. Um so, a little bit about me. I'm a security analyst at a tech kind of food startup in West London area. Um my background isn't actually in tech at all. Um I did an English undergraduate degree and then moved into afterwards. Um mostly risk sort of placements working in audit and so on. Um

I have a little bit of experience in the finance sector. So, really interested in fintechs. Um absolutely no cryptocurrency though, please. I'm really like cloud infrastructure. Specifically AWS. I don't have um a lot of experience with Azure or Google. So, please don't ask me any questions about that. Um I'm currently studying the Master's Information Security program at Royal Holloway. Um that was supposed to finish this year, but because of COVID has kind of been delayed. So, it's taking like three times longer than it was meant to. Um and I am not a platform engineer. So, um please do bear that in mind when you're asking me super technical questions at the end. Um so, a bit of an agenda today.

I'm going to talk about the history of chaos engineering and where that originated. Um a couple of contemporary projects at other organizations and the benefits of those and what kinds of um value adds an organization can get out of that. Um a little bit of a road map to implementing this at an enterprise level. Um some sources and like nerd stuff if you want to go away and read some more about this. And then like a little bit of further reading and things that I've come across during the course of my like research into this for my dissertation so far. Um So, we rely on distributed systems, public cloud infrastructure, microservices um more now than we ever

have before. Um critical national infrastructure leverages these kinds of systems um hugely. And with the sort of advent and reliance on things like Facebook and social media for um consumption of like news articles. Um It's a really big deal when there's an outage um on one of these platforms. Any kind of change that you make on this platform is is going to have a much more um well, fundamentally it's going to have a disproportionately large impact on society. Um A few examples of recent outages. Um Cloudflare sort of um broke their own backbone uh last year during a routine network change. Um due to latency on part of the network, they redirected all of the traffic to

one access point um and basically DDoS'd themselves. Um and then everybody kind of saw the Facebook outage due to BGP issues last month um which then also had massive impact on other kind of secondary forms of social media and other sites um who hadn't stress tested for that scenario if Facebook went down. Um and then obviously there are specific sites like Downdetector that specifically check for outages on things like AWS and Azure. So, chaos engineering is the discipline of experimenting on a system, usually in production, in order to build confidence in the system's capability to withstand turbulent conditions. Um and that's from Principles of Chaos which Netflix wrote um when they sort of pioneered this uh

discipline. Um so, a bit about the history. A lot of this builds on um the site reliability engineering work that Google did in the early 2000s. Um Google have a massive SRE team. They've got hundreds of engineers. Um And uh do some really, really cool things with the storage of data to ensure that data loss is minimal um in the event of like an outage or a failure. Um When Greg Orzel was overseeing the migration of Netflix to Amazon Web Services in about 2011, he had the idea to build some tooling that would force engineers to um develop secure and reliable systems. Um so, Simian Army suite was developed um which is kind of like the forerunner

of most chaos engineering tools. A lot of what exists in Chaos Monkey, kind of Chaos Gorilla, Chaos Kong now is redundant because um large public cloud infrastructure providers have brought out their own tooling um that replicates a lot of that functionality. Um but initially um they wrote uh Chaos Monkey which would basically shoot uh an EC2. Um if you're not an AWS person, it's basically like a virtualized server. Um And uh Chaos Gorilla will take down an availability zone and Chaos Kong simulates region failure. Um and then you've also got um teams at the likes of Facebook, uh LinkedIn, and Twitter that do site reliability engineering and chaos engineering experiments too. Um and then third-party consultancies

like Gremlin and Blameless offer this as a service. Um AWS launched their fault injection simulator in March of this year and Azure Chaos Studio is out in preview mode as of last week, I think. Um yeah, so a couple of projects. Um we sort of mentioned the Simian Army Netflix tool suite. Um A really, really cool one actually is Project Waterbear at LinkedIn. And they actually classify and split out their experiments into infrastructure and application chaos engineering. Um so, I think their infrastructure project is called Fire Drill and their application project is called LinkedOut. Um and they did something really cool which introduces this concept of graceful degradation. So, you can identify core workflows on a

particular page that people will be using. Um and choose to prioritize say the search function over loading third-party ad content. Um which is really helpful if they're seeing like an increase in traffic and demand for their services. Um obviously there's Blameless and Gremlin that have um platforms and tool suites um that will allow you to run these sorts of experiments on your own infrastructure. Um and then there's Project Storm at Facebook which I would really love to know more about, but unfortunately Facebook are secret squirrels and all of their infrastructure is super proprietary. So, there's not a great deal of information available about it. Um there are some cool articles there about how they um simulate data center

failure. Um and then Azure Chaos Studio and AWS Fault Injection Simulator which we mentioned. So, benefits of this are that it directly promotes building resilient, reliable, self-healing infrastructure. You can use things like auto scaling um to deal with increased demand so that your services basically don't fall over. Um it makes your engineers' lives easier. It makes them less likely to be called out out of hours. You don't need so many manual fixes for things um in the event of an incident. It highlights issues with your monitoring and alerting thresholds, too. Um so, I had um an instance recently um where a friend had built a platform and had deployed auto scaling. Um but if auto scaling

broke, there wasn't any monitoring or alerting to let them know that that happened. Um and in that situation, they would need to run a script to manually rebuild the service. Um it improves your understanding of your architecture, workflows, and how everything interacts. And allows you to test your DR responses in real time. Um so, you can identify bottlenecks, key person dependencies, any procedural issues, and improvements that you need to make. So, that's super cool. Um and it provides the opportunity to control how and to what extent your system fails. So, we can see that in the LinkedIn um graceful degradation concept. Um I would definitely recommend reading some more about that if you like that

sort of thing. Um so, yeah. I just generally think it's really cool. Um as a bit of a road map to implementing this, um primarily and fundamentally step one is understanding your estate. When I started researching this, I originally said um that step one would be an asset database or a config management database. But that's not realistic. Um a lot of organizations now are leveraging serverless technologies really heavily. How do you log that on an asset register? What if something doesn't have a static IP? How do you account for that? That's a real challenge when it comes to this sort of thing. Um so, as long as you have a an understanding of your estate and how

things interact, you're in a reasonably good position to get started with this. Um then you implement some kind of monitoring um potentially an observability solution that will provide you more insight into how your systems interact with each other. And establish some healthy baseline metrics for these systems which we'll talk a little bit more about in a minute. Um it's probably a good idea to be able to conduct some business continuity and disaster recovery testing beforehand. Um not just for kind of real world scenarios, but in case one of your experiments does have adverse impact and actually causes a massive outage, you need to be able to recover from that. Um and then you can identify infrastructure

improvements in advance to say we suspect that this will be a bottleneck or this will need to scale. Um So, yeah, then you can kind of hypothesize what the potential impact is going to be of one of these experiments. Um run the experiment and observe using your monitoring health metric thresholds and observability solution. Um document your outcomes, definitely hold a retro, and re-architect or make changes to improve. Um with a retro, I think blameless blog and Twitter um wrote some really interesting content about how accountability um I think it the blameless article is accountability in the war room. And it's um focused heavily on the people side of things rather than learning lessons about the technology. It's like identify

if you've got a key person dependency. Is this one person getting an alert that it kicks off the whole incident management process? Are we unable to respond to this if they're not available? Um so, that's a really big piece. So, for health metrics, um a lot of this comes from work that Google's SRE team did previously. Um and they developed this concept of the four golden signals, uh which is defined as traffic, latency, saturation, and errors. So, say um an increase in traffic or a demand for a site um is going to cause higher saturation of a a resource, um which will increase latency experience. And then, you know, you have things like cumulative latency,

um which impacts workflows and processes, um will cause an increase in errors, um dropped packets, and failed requests, and timeouts. Um so, I've got a really cool radar graph here that I spilled some coffee yesterday cuz I got really excited about. And the um smaller like kind of kite shape in the center demonstrates, you know, potential metrics of a healthy system. And then, the larger kind of kite graph on the outside um demonstrates what it could look like under failure. Um So, the really important piece here is to identify which metrics you care about, what healthy looks like for your system, and then what you want to preserve in the event of an incident. So, similar to how for disaster

recovery, you need to identify and define what your recovery time objective and recovery point objectives are, um define here what tolerable latency is for you, and what your systems can cope with before they fail. So, the next concept is the blast radius of experiments. Um I haven't seen any models for this besides Netflix um and LinkedIn splitting it out into whether they want to simulate um server failure or um availability zone or region failure. Um but, you can simulate things like degraded performance and then split that out more into um one component or a service or multiple components um or degraded performance across a whole region. Um service failure again of one asset that might be critical, and then

you can, you know, implement something like auto scaling to bring that back up. Um or global failure of a whole service, which does occasionally happen. Um so, in summary, understanding your estate is super important. Um you can't develop health metrics without that. You don't um know how an experiment is going to impact your estate if you don't understand your estate in the first place. Um big tech companies are doing this, but you can do it as well. Um all that you really need is, you know, a a couple of pieces of infrastructure deployed in um some massive cloud provider, and um Yeah, I know, it's really easy, right? Uh AWS and Azure for sure have free tier

resources, um so you can start kind of playing with this if you want to. Um the goal isn't to build something perfectly the first time. It's to build something that's reasonably strong, test it, break it, and then rebuild it so that it's better. Um Oh, and just because big companies are doing this, it doesn't mean that they're super mature. I saw a really cool talk um last week by an SRE at Elastic, and they'd initially started off their chaos engineering journey with a script that predefined how many instances they wanted to scale to um at particular times of the day. It wasn't auto scaling at all. Um and they've kind of moved on massively since then.

Um So, yeah, um if you want to get started with this individually, build a lab. Don't over-engineer it like I did. You can do this with a couple of servers like web servers really easy, and use something like Chaos Monkey to kind of take one down and see what happens. Um you really don't need like a a massive environment um to be able to play with this stuff. Um health metrics are really important, so develop some of those. Decide what it is that you care about. Um identify the sort of expected or desired blast radius of your experiment, and try and limit the impact so that it doesn't have wider reach than that. Um come up with a good retro template so

that you can record any lessons learned from these experiments. Um and like as always with everything in tech, keep reading about this stuff. Keep doing research. Um some really good sources that I've come across so far um is I think a guy who works at Red Hat. He's in the SRE team there. He has two GitHub repositories, one for Chaos Engine, one for SRE. They're both brilliant. They have pretty much every source in there that you could want. Um I can add the links to these um or post them on Twitter or something afterwards if anybody wants the the actual links to those. Um there's an SRE weekly newsletter that I subscribe to, which covers off recent kind of um outages,

incidents um lessons learned, that sort of thing. Um and that's really interesting reading. The Gremlin podcast Break Things on Purpose interviews SRE um and chaos engineers at quite high-profile tech orgs, and that's really quite cool cuz they go go into what they've learned from their experiments and how that went and any challenges that they faced along the way with implementing this stuff. Adrian Hornsby um was actually how I discovered this. He gave a talk at the AWS Summit in 2019 um called Creating Resiliency Through Destruction. It's on YouTube. It's a really good watch. Um Would highly recommend that one as well. And then, there was a conference a couple of weeks ago called SREcon, and

there's a playlist of videos from that one on YouTube as well. It's by USENIX, I think. Again, super cool. Thank you very much. Please don't ask me any questions about Azure, and thank you very much to KEVEN AND ALEX.

OKAY, SO THANK YOU VERY MUCH, MORGAN. UM WE ARE GOING TO field a couple of questions only just so that we can keep the talks moving as we're already behind schedule after two talks. Excellent. Uh actually, we were we were out of schedule after the introductory talk, so it's definitely not a last speaker's fault. Um but with that said, you know, if you've got a question, make sure it's really good cuz we've only got time for two now, and otherwise um Morgan will be I'll probably be in a bar afterwards as well if you've got questions. Any questions? Oh, don't start. Can you explain No, no, I'm not answering that one. Somebody else, please. He's a troll.

Don't give him the mic again.

How popular is uh chaos engineering Well, I'll just shout really loud. How popular is chaos engineering getting, and also what is your How popular is chaos engineering getting, and also what is your um I don't know what Azure is, I'm afraid. I've no idea. Um chaos engineering is growing. Um at the moment, you only really hear about kind of large tech companies doing it, but there is a Gremlin um 2021 State of Chaos Engineering Report that goes into what size of company that's running these sorts of experiments and that has like a chaos engineering function or an SRE team and so on. Um and that has some actual stats in it if you're really interested in that. Would recommend

reading. I had an answer to that question too, which was

No.