Confessions of Disaster Recovery Padawan

Name: Confessions of Disaster Recovery Padawan
Uploaded: 2020-08-28
Duration: 34 min 45 s
Description: Matt Fisher describes building a bottom-up, community-driven disaster recovery program in a startup environment. The talk covers tabletop exercises, datacenter failovers, incident readiness, internal communication, and strategies for sustainable DR program growth through volunteer stewardship and me

BSides Columbus · 202034:457 viewsPublished 2020-08Watch on YouTube ↗

Speakers

Matt Fisher

Tags

CategoryTechnical

StyleTalk

About this talk

Matt Fisher describes building a bottom-up, community-driven disaster recovery program in a startup environment. The talk covers tabletop exercises, datacenter failovers, incident readiness, internal communication, and strategies for sustainable DR program growth through volunteer stewardship and mentorship.

Show original YouTube description

This is not your Grandpa's Disaster Recovery (DR) program. Describing the successful evolution of a bottom-up, community-driven Disaster Recovery program in a start-up culture. Includes an introduction to the dark arts of tabletop exercises, datacenter failovers, incident readiness, internal communication and publicity, improving engagement, shifting left in DR, and more. Attend this talk if you want to boot up your own DR program or to see if there's anything worth stealing from a Galaxy Far, Far Away. (Spoiler: this is working for other programs too, not just DR.) This Presentation was featured at BSides Columbus 2020 on August 21st, 2020

Show transcript [en]

welcome to besides columbus and confessions of a disaster recovery padawan a tale of loss and regret but also of resilience and hope my name is matt fisher but i go by fischer because there's always a lot of mats here are some things about me if you paid for a b-sides ticket and you want to contact me after today's session you can email me at fisher cmh gmail.com and let's say this offer is best before 2021

being a little bit coy about my employer because i didn't ask for permission you could do this probably better than me so you should give it a go i'm not a drx bet i don't even play one on tv i'm not a lawyer not a doctor any opinions or typos of my own definitely not my employers you can disagree and correct me in the time for questions and that's all good but please be kind and respectful so we don't spoil it for anyone else

so i'm going to assume that you know something about disaster recovery and that you can google for things like rto and load balancers and server affinity that's the help you need i'm sure there are many great resources out there but this isn't a dr explainer it's about building a dr program i have about three hours of material on the cutting room floor um so prequels and sequels the order is not strictly linear here and it's not an homage to the franchise it's just the way it tumbled out we'll begin with a narrative that sets the scene um maybe where you are right now um and then we'll look at where we're heading and the documents the tabletop

exercises and the failover tests this is a really big topic so these are pretty skinny and then we have your quest like how do you succeed at this how do you begin how do you grow how do you excel and then we're done the first confession is i don't know much about star wars i watched the first three movies back in the day but the rest of it sort of passed me by if you're watching this in a star wars costume because you thought this was about jedi's then i apologize you will load here under false pretenses but for those of you that came here for disaster recovery let's get started let's talk about these graphics the

poster ideas for promoting dr and business continuity events internally it's quite in keeping with our culture but it may not fit yours but if it does something think about adding some match stickers and whatnot that particular poster is uh is a riff on the movie 28 days later we had this pandemic tabletop exercise schedule for march 2020 but real life events kind of over took us so we canceled that there was really no need and that's worth noting if your event is eclipsed by a real life situation then you can skip your event i mean you already tested your plan in the real event no one's gonna thank you for making them relive it again in practice mode right

afterwards something to think about i had this big story arc narrative that was was gonna explain some fictional startup not the one i work at um you know i'm gonna say you're not brand new so you've been around for five ten fifteen years you're not cloud native so you know so much easier these days right these cloud kids um you had startup priorities so did it make sense to invest in dr when you maybe weren't going to even make rent next month then you get this big customer really makes you up your game right moves you into a proper data center out of your garage and um you know so your color out there and maybe second data

center in time but this big customer is making you jump through compliance hoops and you know that feels like a compliance risk like a pressure on the business forcing you to do the right thing and i'm kind of okay with that that's a control yeah so data centers first data center second data center for resilience their data sentence to cloud but no second region in the cloud that would be a fourth data center and that's just crazy but the resilience implications of running out of one data center means you're still tied to a geographical region and then those unlikely threats with the compliance um so i had i had a lot to talk about here

but it kind of ended up on the cutting room floor so we're going to gloss over this

now we have the evolution of the controls themselves and this pretty much follows the arc of of the company of your startup so we start off quite modestly and then there's more sophisticated sophistication added right we've got the the clustering and the load balancing and you know we've got the pools and the farms and we're able to write out those single node failures pretty well but then you get some larger techpocalypse and your fault tolerance just can't save you so you just created in your data center and you don't yet have that second data center so you just got nowhere else to go when your customers are raising eyebrows at the time it takes for you to recover

um and that's that's when that second data center becomes a real thing business continuity i mean it's there's some discussion about you know disaster recovery is certainly a component of business continuity but if you don't have that business continuity already in place you can't draw from it i mean you're going to be feeding it right you're going to be finding finding out the criticality of things and the single points of failure perhaps before business continuity comes along and clouds and containers we may not have started with that but it's definitely something we have to do our devon engineering talent pretty much demands it they've got a future proof we can't let them play with those

features here then they're going to go somewhere else and do it so every time we replace some system or build some new products we're thinking cloud fast and and retention of that talent is is a business continuity concern and it's legitimate okay and now everyone's favorite documentation you are going to need different documents for different purposes you're going to need a an executive summary of the program no more than two pages i recommend you work on that now instead of waiting until you're asked to deliver it in five minutes time um your future self will thank you the dr plan is the big kahuna it's an enormous document um start now even if you're just putting

in the skeleton of headings and placeholders the things you might might think you're going to need i'm wary of describing this too well because a it will take up all of the time and b they're usually proprietary and protected by non-disclosure agreements but of course you can find templates and advice around the place and you can look at what your customers are asking for and what your framework is looking for and you can you can key off that but just get started because you're going to be adding and adding and adding to this over time and building it up and it is going to be a huge collaboration with many other people even if they're not wanting to come to

the table early they are at some point going to have to own their piece and i would say that this document is likely you know it's going to summarize things with overviews at the beginning that describes all the things and then all of each of the things is going to have its own appendix or chapter um that somebody owns it's their piece um and it might be on a need-to-know basis so that might be the only piece that they see so you're going to have some rigor around you know the collaboration of this one of the things that's definitely going to be in there is the criticality what does the business care about the most what's the most important thing

and make sure that those are are outlined and make sure that if there's any metrics around them you know sort of slas and commitments and requirements like your regulatory stuff and your contractual stuff and describe what compliance looks like and what not non-compliance looks like and some idea of what the impact is if you don't meet compliance and the failover runebook is for us today the most important of of those appendixes appendices um it is the step-by-step instructions of how to get from data center a to data center b once once a disaster has been declared how to turn down a how to get things ready how to stand at b how to validate you know and that gap is

is the latency between running from a and running for b um so the failover runbook the step-by-step instructions that are to be followed and then compliance records right so the very least it's your your version histories of the documents you know and it will review perhaps and sign off maybe there's minutes recorded if that's your culture and again you can look at the sort of things that your framework or customers or compliance people are asking for what is the document set that they expect to see for a dr program and you know achieve that checklist

tabletop exercises as a real bread and butter of your dr program i mean they're just pretend right there it's a paper exercise the intellectual process of going through the run book line by line and just making sure that it's right that it's going to work and doing that regularly so you're testing the run books the documents to make sure they're still fit for purpose you're going to do this with real participants i mean you're going to pull the sort of people that are going to get paid the engineers and the devs that get paged in a real life situation you're going to pull those into a room or onto a call and you're going to go through this

together and they're going to speak to their part of the failover process and it's all to improve their readiness so you want the uh you want those people essentially getting trained up in this um and that's incident response training the disaster is the incident and the failover is the response so this is just simply an incident scenario that we that we rehearse um so don't have it be the same person from the team turning up every time i mean make sure you rotate through the team and be clear i mean are we going there and back in our tabletop exercises and i really think you should because there's a girl leg and a return leg they're not

just mirror images of each other it's always more complicated than that so make sure you practice in both directions and compliance what are your compliance folks need to you know to properly register the fact that these are taking place just ask them and then provide it

let's look at some of the key roles that are needed in the tabletop exercise because you really can't do all of this on your on your own you're you're really going to need some help and i will say that because of reasons of culture you know some of these names might not be the sort of thing you'd find in your corporates because you don't have like a chief unicorn officer or something so don't worry about the the title so much just appreciate what the rules are so we have a quarterback we have a scribe we have comms we have a party planner we have a witch hunter general and we have a disaster master

let's look at the quarterback rule that's what we call the person running the incident so the incident lead and they're going to be calling out what happens next and really directing the effort now that should all be written down in the run book so they're going to be following that run book it's a trained position right there's some training involved in leading an incident um and our incident response program trains incident leads or quarterbacks so we simply reach out to them and we ask them for a freshly minted quarterback to you know be nominated to come and join the next tabletop exercise and it's it's good training for them and it's it's a win-win let's take a look at the scribes role i

think it's arguably the most tiring role because you're keeping the running record of the event so you're listening and typing at the same time for most of the time so chat is really your friend i like i like to open a document and can't pierce from the chat and i scrub out the banter on the emojis and whatnot and preserve that record as we go along and if there's any wrinkle in the run book where we say we're gonna have to fix that and update that i mean we keep going in the tabletop exercise but we need to circle back and fix that so um scribe you keep a running list of all of those things

and that's going to be the basis of the retro at the end so the scribe needs to be paying attention so when there's too much chatter going on you've really got to hush people so you can concentrate and that retro that list at the end i mean it's the last 30 minutes of the tabletop and we're going to work through that list very quickly and we just need a high medium or low determination and probably put a name on that um and the time so you can refer to the record so you can get some context if anybody is unsure about how to work that ticket and then at some point um in the coming

days the scribe's going to have to publish that record as part of documentation for the event subscribe is not a difficult role but it is it is continuous work the comms roll the communications roll doesn't really have a lot to do during the tabletop exercise because nothing's really happening because we're just pretending right there's nobody banging on the door for updates so but they're gonna shield the quarterback um from that sort of outside pressure in a real event that would be their job they've probably got one you know probably present in there in the field of a uh room and they've probably also got one foot in the marketing and communications room as well they're in

both meetings and there are years in between the two all right so they're gonna have to summarize the situation for the non-technical stakeholders outside of the room um it's it's important right because executives are going to need to know what's going on and where are we in etas updates all that kind of stuff that people are going to going to need coming out of the room um so the comms role person doing that means that the quarterback can focus on the actual failover and it goes both ways right so there might be external events that are pertinent and so you know the comms person can communicate outside to the room as well now in in practice everybody's on twitter

and email and the grapevine anyway right so it's it's all going to get there before the comps officer but that that is their role it's a party planners role it's huge should really chop it up into a few people um i mean so much work here so i mean you got just all the room booking and scheduling that that has to happen that in itself there's a bit much you've got to do so you know announce it and invite people if you're going to invite people you're going to have to have the right people there so it's the right mix of people from the different tech teams and rotating the participation and all of that kind of stuff

and i will say it's tempting to just write to them the managers of the directors and say hey here are the people that i need but in a bottom-up community driven culture like that that is not gonna work you're gonna get some backlash so you call for volunteers and you you let people volunteer themselves and if that doesn't work because your last resort to get rears in seats then maybe maybe you try that top-down approach uh schedule the day like have some sort of plan um you know how it's going to work out and then you've got to manage that day you've got to manage that schedule and i thought anybody's interested i mean hit me up for

for how i do it just a suggested starting point there's there's a lot goes into managing the day um all sorts of things to consider um and so it's not a technical role anybody who's reasonably organized and assertive can be the party planner um it doesn't have to be and that's a great way of getting non-technical people into the dr program as well is the roles like this witch hunter general the most unglamorous and thankless task you're going to chase those retro tickets that's it really there are different kinds of issues that we're going to find in the retro the vast majority of them are going to be updates to the runbook like tweak the documentation or get it up

today sometimes we tweak the process itself that's all good sometimes there's a legit technical problem that surfaces and okay that needs some engineering or code or something and then we're you know those ice bags those those real big blockers um that are really hindering our ability to fill over those architectural and systemic or legacy issues um there that are really slowness down and then capture those and then those retro tickets right you've gotta gotta explain them right you gotta you gotta chirp on people like hey this ticket it's three weeks old or three months old and then the person who got got it assigned is going to say what does this even mean and you got to be

able to explain that but thankfully from your retro somebody put a name on it and somebody put a time on it and you can refer to the record and get some context or at least bring somebody else in that can give that context but remember witch hunters it is it is not on you to fix the thing right your job is to make sure that all the findings get some sort of attention and if they're not going to get done then you know you can just discard and close out those tickets but don't just let the tickets sit out there

the disaster master role is for me the most fun it's like being dungeon master in dungeons and dragons so you come up with this scenario some reason why we're being forced to fail over rain and make it something plausible you know realistic enough but different and don't don't have a fire every time and you really want to you really want to take the table tops into places that are uncomfortable and surface issues right i mean not in the very beginning just keep it simple um you don't need to do that there's enough work to do but once you get into a regular pattern and you're feeling reasonably on top of it then challenge everybody by going to the

hard places and you'll get a lot more out of it throw in some kf balls i mean take away phones and video conference or something like that halfway through like have a storage array go down just something to get people thinking on their feet and do make it fun right it's the biggest opportunity to make this fun so um come up with something interesting and don't forget comms i mean make sure that i don't know as part of your scenario the co like she needs regular updates on the hour and a half past the hour about what's going on where your progress is and etas and stuff like that so if you can try and make the

communications person work for this

that brings us to the data center feel of a test itself it's a real event you're really going to swing over to the secondary data center and you're really going to have downtime it's a planned event right it's in the calendar as opposed to an unplanned event where lightning strikes disaster's cold and we're going to fail over in the middle of the day customer communications this is key you're going to have to get the message out there you're going to have to shepherd that through your marketing and communications folks and then through your your customer relationship managers if you know who they are even and that's going to take some time and then you need to be giving your

customers on a 60 days notice or whatever your contract says so you need to start this months ahead of time and so you really need a plan for this and you really need somebody on top of it and if you don't i mean the business is going to hear about it from customers and don't neglect your your unmanaged accounts as well right those users that signed up to the to the website um although their terms of service don't say they get notification necessarily um you want to get that message out there with web banners and emails and stuff otherwise they're going to be you know flooding you 100 number voicemail or help out email address and it's hard

to declare a successful fail of a test when some of the department is jumping up and down about the impact that you calls them the night itself there's plenty to do consider having a party planner just for the night right just to make sure that garage doors are open pizza is delivered and the rooms cleared up all that kind of stuff the retro don't hang around for the retro it the invite should have gone out weeks ago so that you know a few days later midweek the following week is when you bring people together and go through that retro list right you're not going to keep people around and you're not going to keep your

scribe around to tidy up the record or the you know the retro list right give them a couple of days as long as it happens before the retro that's fine prepare for failure just to seem obvious to dear people failovers don't always succeed plan for that failure this really goes all the way back to selecting a date and communicating it out to customers you really need at least two days uh communicated and accepted out there you know if it all works great on the first night and you don't need that second do-over date that's fine just just cancel it that's easy but if you have to back out and you need that do-over i mean it's really hard to scramble to

get that in short notice right so it's just horrible you have all these decision points uh where you do get to back out right in the run up to the fill of a test and in the fill of a test itself so you need to identify those opportunities to back out and put those in the timeline and also the criteria under which you should back out so that the quarterback doesn't have that judgment all on their shoulders right make it easy what happens if you exceed your change window maybe it's no big deal maybe there are penalties for that and you need to declare an incident instead know these things if you need to send

out any communications make sure they're pre-canned pre-approved they're in the run book so it's easy to execute on that too the dr event calendar i recommend scheduling your events a year out even if you haven't communicated them a year out like just to know um where they're gonna fall in which quarter and it's mostly table top exercises right so aim for one per quarter two of those scenarios your failed data center your primary still present any one of them like it's gone it's a smart crater or something it's just not there one of your tabletop exercises is that rehearsal um for the fill of a test so that's a that's a freebie there's no scenario

there and then you've got your actual fill of a test um i mean if you don't think you can do it biannually buy yourself some time so pretty straightforward so how do we get started well we we start small it could be just a couple of engineers that are really worried about how they're going to be able to fail over their systems if disaster strikes and i call these you know the consent citizens and they're they're very committed they're worried about their own systems primarily but they have dependencies as well and they can reach out horizontally to other engineers and bring them in too so don't worry if you hit some seemingly intractable problem uh just just pivot to something that you

can work on it's important to just keep moving that big problem let me mark it as such and keep going you've got to be documenting um you know just start putting this runbook together version one it can be scrappy uh you're gonna iterate and refine just write it all down you knock on some doors and get more people involved and explain what you're trying to achieve and why and it will make sense to people try with some small mini tabletops i mean maybe it's just you and one other engineer working through step by step their system until they're happy and you want to add that to the run book and maybe you build from their small

groups

so how do we grow the team no you ask for help right you need some volunteer firefighters that are interested enough to donate some part-time hours in addition to their day job and you're aiming for that first full tabletop exercise so you need to get to the point where your runbook is pretty much end to end the failover process and then you can schedule this fast tabletop and it's as much to introduce people to the process of the tabletop and to the idea that a planned failover is something that everybody needs to practice and then it's continuous improvement i mean you're going to have a ton of findings every time but that's continuous growth right you're going to be

drawing more people into this process and your coverage is going to grow and you need to keep documenting you need to keep maintaining that run book and you need to delegate the ownership of the different parts of that run book onto some of these teams and then you've got to iterate and you just gotta keep doing it so the failover runbook is in pretty good shape and you've had a couple of tabletop exercises and the process is working but it's only been possible because of all the volunteer help there's no way one person was going to do this on their own part-time i mean they would just burn out so how do you sustain that volunteer help so we introduced the

idea of stewardship stewards or co-owners of the program they get to shape it and it's a one-year commitment so we ask maybe a dozen volunteers if they will make this commitment of four tabletop exercises and a feel and a feel of a test if we think we're ready and then after that it's somebody else's turn but we would like them to hang around for that second year to mentor the newcomers so they're not thrown in at the deep end we also have a wider community of people interested or concerned about dr now their level of commitment not may not rise to stewardship but if we create some sort of forum or chat room or you know internet space where people can

bring their concerns and ideas then we can still crowdsource a lot of this and this is where we're going to recruit from in the future so stay positive you know people will find this and the the sustainability means that the program should survive you uh and should survive any of the stewards right so you really need to be recruiting your successes we ask every steward to recruit their replacement and those replacements are the par ones so padawans need training and we do that in three ways you know i i learn from reading i learn from seeing and i learn from doing so we start off with some explainers and introductions to some ideas and some definition of terms

right some book work some reading maybe some quizzes some homework um and and then i i learned from seeing so they're going to attend a tabletop exercise this will be the last tabletop exercise organized by the stewards and the padawans will will learn from this they'll watch and they'll ask questions and maybe they'll shadow and they'll say okay i get it this is this is what we're trying to achieve okay and then they learn from doing now the padawans themselves are going to organize tabletop exercises and tests mentored by by last year's stewards and that really is the sort of cycle of life that is the pipelining of the generations padawan graduates to a steward um

stuart graduates to a mentor and a mentor ages out right

so how do we become a legit program i'm not sure that there's any graduation ceremony for doing that so i think you just have to walk like a duck and quack like a duck and i realized that this slide feels like playing defense um so you can't do all of the things so you you know define what your scope is and then make sure that your scope aligns with the business's priorities and that's always helpful and come up with some kpis because if you don't somebody else will voice them upon you so they should be realistic and achievable and you know just just measure things whatever you can measure really and have a policy in a standard and and do that

before the policy writer you know cuts and pays somebody else's policy from elsewhere so you just want to explain that you know dr is happening and that you're achieving all the criteria set out in the standards and then the standards will list all of the goals and objectives and the measurements um and the golden rule of policies is that we say what we do and then we do what we say so bear that in mind and then connect with other programs right so if if there's a risk register in the risk management program make sure you're registering items on that if somebody is doing in-house training modules make sure they tend your padawan training into a module i have this

business continuity make sure you reach out and connect with those folks and that you're aligned

so what is next level for our dr program honestly just hitting all of your targets is an achievement enough shifting left means getting earlier into the process right so we want to be into that design phase we want to be into that requirements phase we want to be into that planning phase so that other people are building in dr much earlier in what they do so that you don't have to try and bolt it on to the end i've had a lot of success with threat modeling when we threat model some new proposed product i make sure that i insert something about the dr requirements right so tell me about your vendors tell me if you're doing anything

that isn't our standard dr practices explain it to me it's a great way to pick up findings being a differentiator when you have a customer tell you that your dr program is cut above it's really validating more validating than it should be but it also gets the attention of the business right and you'll find that that success coming from outside so from an auditor or a customer people will start to associate with that success and things get much much easier so if that happens you will probably attract some executive sponsor who wants to further your success and they're going to put backing and resources behind you now that is great because that means you get to

to play at a more strategic level and some of those big intractable problems now you get to start tackling a couple of those and it's a solid solid improvement to your readiness and you get to present at b-sides so that's the end thank you everyone

Confessions of Disaster Recovery Padawan

Related talks