
Our next two speakers are um from Netflix. Uh we have uh Sandy and Prachi. Um the title of their talk is Orchestrating Resilience, Composing a New Score for Netflix Service Netflix Service Reliability. Sorry, I need to get close to the mic. I'm trying to give the them the limelight. That being said, uh I'll have some announcements at the end of this, but uh I'm looking forward to your talk. So, please, Sandy and Prachi. Thank you. Welcome, everybody. Uh let's start with a story here. Imagine you are a conductor. You have composed a beautiful, complex symphony. But you're worried. What if the orchestra makes a mistake? What if something goes wrong? So, you make a decision.
Simplify the score. Remove the complex passages. Reduce the number of instruments. Fewer moving parts means fewer things can break. It's safe. It's predictable. But you have also removed what makes the music beautiful. Netflix faced the same choice. During critical periods, we would freeze everything. No changes. No deployments. It kept us safe, but it also kept us stuck. Welcome, everyone, to Orchestrating Resilience, Composing a New Score for Netflix Service Reliability. I am Prachi, an SRE in Security Engineering org, focused on keeping our services secure and reliable during those big moments. And I'm Sandhya Narayan, a TPM in Security Engineering. I partner with teams to make sure our launches are smooth, secure, and predictable. I know we are the last presentation,
keeping you uh from lunch. So, here's what we'll do in the next uh 25 minutes. We'll talk a little bit about the wrong notes we were hearing when security and reliability fell out of sync. The problems we were seeing with blanket freezes. We'll talk a little bit about how we understood the impact, aligned on what good looks like. And finally, from signals to score, how our framework for secure resilient decisions came up. And then how we empowered our teams with autonomy while staying in key with security. And fourthly, we'll talk of Lastly, we'll talk about the encore, where we listen to the data and make sure that we are in harmony and we still while we hit those occasional
sour notes. So, let's talk about the discordant notes we started hearing during our launches. Prachi, can you share what was going on with our old approach of blanket freezes uh and why it wasn't working for us anymore? Sure. All right. Let's talk about why blanket freezes were the wrong instrument for us. On paper, they sounded great. We'll just stop all changes around big events. What could go wrong? Well, a lot could go wrong. Think about innovation slow down. Teams started timing everything around freezes. Features slipped, experiment got shelved, and people were basically composing to the calendar instead of to the customer. So, if you have had a perfectly ready feature sitting in a branch behind a
freeze, you know that feeling. And that takes me to the risk pileup. That one surprised us. While we were slowing down the changes, risk didn't slow down. Vulnerabilities, dependency updates, last-minute fixes, they were all piling up behind the freeze window. So, the moment it got lifted, we had this giant release wave. And that's not safety. That's just deferred chaos. And finally, the operational pain. All that pent-up change meant noisy launches, tense on-call, and a lot of "Please don't page me tonight" energy. We were exhausted, still not confident. So, I want you all here to take a moment and think about the last freeze you have lived through. Not whether it was stressful, I already know the answer to that. But what
specifically broke your trust in it? Was it the post-freeze deploys? The things that fell through the cro- cracks? Or something else? Hold that thought, because in next few sections we'll cover the solution to exactly that problem. Supraji, where were we hurting with these blanket freezes? Was this control actually creating fragility? That's exactly what was happening, Sandhya. And let me walk you through the three ways it broke down for us. First, critical fixes and security patches, the one I talked about earlier, getting stuck behind the freeze. We'd be in a quiet period for a big launch, and high-priority security patch would land. The rule said, "No changes." So, now we are in this absurd situation where we
are knowingly running with risk because our safety mechanisms say, "We can't fix it." We had cases where teams are slacking us like, "Hey, is there a way we can push this out or sneak it in?" And that's when you know your control is working against the very thing it was designed to protect. And then the very assumption that one size fits all. A tiny low-risk config tweak, blocked. A high-risk architectural change, also blocked. We treated every change as the same risk, which meant we weren't actually focusing on the right ones. And when you do that, teams do what teams always do. They invent workarounds, shadow deploys, side channels, we'll just flip it after hours. All the stuff that's way harder to see,
secure, and operate. So, instead of clean, observable, well-planned changes, we got hidden complexity and surprise behavior. So, we weren't actually reducing the risk. We were just reshaping it in a different form. And that's the moment we realized the problem wasn't the change itself. It was unmanaged risk. Instead of stopping the music, we needed to retune our instruments. We didn't need more no's. We needed a smarter how. And that takes us to act two. Up to this point, our default was basically everything is critical, freeze everything. Which is kind of like saying every instrument in the orchestra is a soloist. And if you've ever heard an orchestra where everyone thinks they're the soloist, it's usually not great.
In reality, our systems aren't all playing the same role. Some are right in front of the member. Some are very way back in the deep deep in the background. And some are mostly there to make the lives of our engineers a lot more easier. Quick gut check. How many of you have felt that half your services get treated as mission critical even when they're clearly not? Show of hands. There you go. So, in this act, we started asking a different question. What if we actually tuned our controls to the real risk of each service instead of freezing everything the same way? And that's where the idea of risk classification came in. Okay, so I'm going to walk you through
the core idea. Classify risk so not every service is critical. Some services should absolutely be turned up during critical launches. Others can stay at the background at a nice chill level. Yeah, for Netflix think of like sign-ups playback payments security credentials. Those are our lead instruments. If they're off, the whole performance flat falls apart. But an internal reporting tool, a low stick, config UI, those don't need the same level of drama during a big event. So, if you all think about your own world for a second here, think about what's your if this breaks, we are all getting paged service versus what's your we'll fix it tomorrow, it's fine service. So, Prachi, I'm guessing once we started
naming that aloud, it became obvious we didn't need one giant no changes rule. Exactly. And we built a scale of risk for our services and we anchored it on three really simple questions. First, what is the impact on our customer experience? If things break, does a member immediately feel it? Like, I can't press play or I can't log in. Then, what is the blast radius during critical events? If this misbehaves during a big premiere or a live event, does it take down the whole experience or just a slice of it? The blast radius tells us how careful we need to be when we are touching that service. And lastly, what is our acceptable risk threshold here? Some
services can tolerate a bit more experimentation versus others during those big moments. Writing the answers to all these questions gave us the permission to say these can keep shipping these need tighter guardrails. And I'll urge you all to think about it for a minute. If you have to answer those three questions for one of your services right now impact blast radius, acceptable risk would your current change controls actually match your answers? Think about it. So it means once we had the scale, we could stop thinking about and treating everything like a tier zero service. Correct. So we have talked about why we needed to tune the instrument and think in terms of risk and not freezes. The next
logical question is how do we actually use this in practice? How do we turn the three things we talked about impact, radius, and acceptable risk into something teams can actually use day-to-day? And and that's where service tiers come in. This part is about setting the volume for each service. Some service should be cranked up to to not mess with me during critical events. So we built a simple tearing model. For tier zero, we kept the definition brutally simple. If this breaks, the user definitely feels it. Think of like pay- playback, core streaming experiences, sign up payments and the control planes that keep all of that together. If any of those wobble it's instantly visible to our members.
And these are the services we give strictest guardrails to. In musical terms, they are our lead instrument and we really cannot afford for them to be out of tune. Next on the scale, we have tier one, the first chairs. These are the services that might not be the main spotlight, but they lead their section. If they are off, everyone notices. For us, that includes things like discovery, recommendation services, key UI flow. So, tier one is right behind tier zero, but still tightly coupled to those core experiences. If tier zero is your absolutely getting paid service, the question we asked earlier, tier one is your I'm probably getting paid service. So, for these tier one services, we
still apply strong guardrails, but with a little bit more flexibility and we'll talk about it. And then comes the supporting section, tier two. These are the services that are really important to how we run our services, but their impact on our members is usually indirect or delayed. Think platform services, shared infrastructure, internal systems that keeps our engineers productive. If they have a bad day, the members might not feel it right now, but they will feel it later through slower deployments, more bugs, or maybe degraded experience. So, tier two is still important, but it doesn't sit right there in front of the line. That means during critical events, we can often keep tier two services shipping with some guardrails, because
the blast radius is more buffered. So, our controls are focused more on stability and good hygiene than absolutely no change. So, again in musical terms, they are the sections that keep the rhythm and texture of the orchestra. And lastly, we have the backstage section tier three. They are internal low impact services. Where honestly, members don't care about them. Think of your reporting tools, your internal tools, low risk utilities. You care about them, but not the member because they can't see the impact in real time. And they are the safest to ship even during big moments. So, if something goes a little sideways with any of these, uh we don't need to worry too much about
that. And tier three is where we preserve velocity for our teams. That's like keeping the backstage humming while the spotlight is on tier zero and one. So, I'm going to leave this up here for a second and give you all an opportunity to think about your lead instruments and your first chairs. Think what is important for you in terms of criticality. And if you heard Anna's keynote this morning, she did mention, "Don't treat everything the same way. Everything is not critical." And that's the whole theory behind this. Think about in your world, what is your tier two? What is your tier tier three? If this service break, it doesn't break the member experience. That's your
backstage. So, if we zoom out on this whole tiering story, it really comes down to one thing. Not every service is critical. Not every service is a play button. And that was our big realization. But once we tuned each instrument or each service based on criticality and risk, our decisions around risk and deployment got a lot cleaner. We knew exactly where we needed to be strict and where we can be way more lenient. So, Prachi, it looks like if there's one thing our audience can take from this, it's don't design your controls as if everything is critical. Figure out which part of the orchestra really are and let them keep playing. Absolutely. So folks, we tune the instruments, we
learned that everything is not critical. Now the question in act three is, what do we actually do with that information? Correct. And this is where we go from tears on paper to real decision, from signals to score. Instead of one giant yes and no, we started looking at small set of risk signals and letting them drive the decisions for us. To make this practical, we focused on four core signals. Deployment confidence, test coverage, change type, and historical behavior. Together, each signal is just a hint. But on their own, they're just a hint. Together, they tell you a story about how you should be playing out your deployment. So deployment confidence is about how we ship. Do you trust your pipelines? And
not only when you or your team is making changes, when others are suggesting changes. That is very important. Because that signal unlocks automation. Dependency updates, stock updates, a lot of other things can be automated if that sign if you have data for that signal. And when confidence is low, we will work on mitigations. Next up, test signals. They are our early warning systems. We are looking at test pass rates, flaky tests, code coverages. If those signals are noisy or degraded, that's telling us that the safety net has hole and we shouldn't be taking risk during big moments. Change type is powerful but simple. Is this a simple config flap flip-flop or a minor dependency bump or a core
library upgrade or we are talking about a schema migration? What type of change it is and that will help you decide what is the blast radius. Don't treat all those changes the same way. A reversible feature toggle gets a lighter touch. A database migration that touches millions of services and probably rows gets a much tighter review and roll out control. And finally historical behavior. That's essentially asking how has the service behaved in past during our critical events. Was it noisy? Were there too many incidents? Um whether previous deployment for this service have been healthy unhealthy? So if a service has a pattern of instability around launches, we are going to be more conservative. Smaller bad size, tighter big times
maybe and maybe manual checkpoints. It depends. So for you all, we recommend pick any two signals and begin from there. And layer upon the remaining signals on those. You will be surprised how far you can go just from any two of these signals. So basically, these signals are what let us walk away from blanket freezes and move towards more targeted risk mitigations that actually match the real risk. As we grew this muscle, we started seeing a shift. We stopped shipping on I think it's fine. We are always We've always done this during the launches and started shipping on actual signals. I remember Prachi before this a lot of decisions sounded like it's a big event.
I just don't feel good about this change right now or the opposite. We really need to push this feature. Let's Let's push it and hope for the best. That's all vibes. And now we've got information on tiers, risk signals, deployment confidence, test coverages, and historical behavior. Today, this data drives our mitigation strategy, not whoever is the loudest in the room. This also helps us move faster, feel safer during big moments. Up to now we've talked about tiers, signals, and all of that good stuff. Act four is all about empowering our engineering teams. Empowering our soloists. Let teams keep shipping even during these big performances without putting putting the show at risk. All right. So, what did we do to actually empower
our teams? So, we leaned on three main resiliency tactics. Canaries, regional staggering, and synthetic testing. Canaries can be used to verify that changes in code or configuration will not cause deviation in the performance or behavior. So, instead of flipping a giant switch, you take a tiny slice of your traffic and make it see the new version. We watch metrics, error rates, only then we do roll out to wider, especially higher stakes risk Then comes regional staggering. We don't light up every region at once. We roll out region by region. So, if something goes wrong, the blast radius is contained. We can pause, roll back, maybe fix forward before it actually becomes a global incident. And then comes synthetic testing.
We test our critical user journeys. For us those are like, can a Can a sign up? Can a user make the payment? Can they watch? Can they playback? So, we know if something breaks, we can fix it before the member actually feels the pain. During big events, these are our always-on listeners. The important part is this isn't just tribal knowledge. We have wired all of this into our pipelines. For certain services and tiers, the pipeline kind of nudges you towards, "Hey, use a canary. Roll out regionally. Make sure you have synthetic checks built in." So, risk awareness is built into the pipeline and not just a conversation over Slack or email or a meeting. Okay.
Pachi, let's talk about how all this comes together in practice and who actually can bypass the freeze and ship during a event. Yeah. Since we've already talked about risk signals and resilience tactics, uh this is where we're going to bring it all together. We use uh something which we like to call risk-aware launch rubric. Think of it like a scorecard. It takes everything that we have discussed in earlier sections and bring it all together for us. Event type. How critical the moment is for us. Service tier. How close you are to the core experience. Risk level. Tests. History. Size of change. And resilience. Have you built in canary staggering tests in your pipeline? And turns it into a clear, "Can you
bypass the freeze?" decision. If you meet the criteria for your tier and event, the rubric says, "Yes." If you don't, it blocks or nudges you to maybe put more guardrails. The best part, as I mentioned earlier, is that this is not a hallway conversation. It's wired into our pipelines. So, the system applies the rubric consistently the same way and provide us data-driven safe deployments. So, it's about understood risk, engineered mitigations, and pipelines that enforce the score for us. Everything we have shown you thus far is talking about tiers, signals, pipelines, tactics. All of this really only works if we learn after each big event and feed that back into that score. Even a great performance can be improved
for the next time if we keep evolving the playbook. After big events, we don't just say that's over and move on. We actually run run past event reviews and then look at the whole picture. What worked? What broke? What was just unnecessarily noisy for us and stressful. Then we do signal and tier tuning. Did that tier two service behave like a tier zero? Was it mis-tiered? Did our red signals fire too often or not enough? And then we adjust that. And finally, we update our playbooks and the pipelines. If we learned that a certain type of change really needs a canary or a synthetics, we don't just write it in a document. We actually wire it into our
CI/CD flow. The goal is that every big event leaves the system a little smarter than before and not just tired humans. So, the other half of continuous improvement is keeping your stakeholders in sync. So, nobody's guessing how safe it is to play. During these big events, we share clear status. What's safe to ship? What's paused? Why? That builds trust in the framework. And post events, we send out action summaries, after action summaries, and say, "Here's what went well. Here's where we took the risk, and here's what we're improving before the next big performance." And we anchor all of this in the shared metrics. Not just reliability, not just security, not just velocity, but all three
together. This keeps the conversation balanced. And so, this act is really about making sure the music keeps getting better. Every launch, every live event, every oh no moment, feeds back into that score, so the next performance is a little smoother, a little safer, and a lot less dependent on heroics. All right, I think it's time to wrap up. So, we're going to leave you with three things here to take away. Embrace risk-based decisions. Not everything is critical. Not everything is a play button. If you treat every service like your tier zero, you will burn your teams out, and still miss the real risks. Be explicit about impact and the blast radius. And aim your strictest controls where
they actually matter. Then, automate your risk signals. If you launch decisions are still like, "I don't have a good feeling about it." Or, "We have always done it this way." That's vibe, and not data. So, let tiers change type history speak for itself, so the answer to, "Can we ship?" is grounded in metrics and signals. And lastly, use resilience tactics to ship safely. Canaries, regional staggering, synthetic tests, these aren't just nice-to-haves anymore. They are how you keep shipping securely during some of the biggest moments for your company. And you don't need to hide behind blanket freezes. So, if you remember nothing from this conference and this presentation, here's what we would like you to remember.
Don't freeze everything just because you are scared. Know what truly matters, measure the risk, build the safety nets into your pipelines, and let your team keep playing. That's how you make your orchestra more resilient. Thank you, everybody. Thank you. >> We hope you enjoyed the talk. And please provide feedback. Uh we would love to keep improving. And let us know if there are any other questions. >> Great talk, Prachi and and Sandy. Give it up one more time for our speakers please. Thank you.