
a San Francisco before we get started I just want to thank the organizers volunteers and speakers for beside San Francisco and to all of you for spending time with me today my name is Miranda and this is my first conference talk ever so thanks for being so warm and inviting to me I go by oh hi Miranda on Twitter with a 0 and a 1 in my handle if you'd like to tweet at me and I'll also be making a b-side slack channel called arcades and audits after this to post my slides and and to answer any lingering questions you may have and it also really like to hear your stories and connect with you this avatar was made by
the amazing Tim Walsh and thanks to him for accurately capturing the stickers that live on my Chromebook although you can tell it was made a little while ago because I have even less real estate on that thing now I'll link his contact information in the slack channel if you'd like to look him up I work at duo security a company that helps protect all users devices and applications using a zero trust platform I specifically work on the team responsible for duis infrastructure and operations or as some people call us the guys who work on resolving issues in the middle of the night when everyone else is sleeping here's where I give a quick shout out to
the ladies and Envy's who work in operations and security but being fast on our feet is necessary when our service goes down our customers who have configured duo to fail closed or protected but that means that when our service goes down so does theirs to make sure that we're fast and efficient we practice responding to outages we're obligated to perform business continuity disaster recovery drills twice a year for our EPCs audit but we also have other obligations throughout the year and we also just practice Incident Response to make sure that our ops game is strong and our team has individuals from varied backgrounds in operations insecurity and having these drills hones cross departmental collaboration root
cause analysis and also just builds up confident levels around these sorts of alerts so storytime think about where you were when the last time a natural disaster affected you or your work if you live or work north of here perhaps you were affected by the California wildfires that happened in October in November last year I consider myself really fortunate because I haven't been as impacted as that I've never needed to evacuate my home when I was first hired at doooo though I remember how I wasn't allowed to have computers that my first week on the job it is incredibly difficult as an engineer to go through onboarding without a computer but it was because of the devastation that was
happening in Puerto Rico because of Hurricane Maria the agency they're responsible for reporting as part of my background check was unreachable and ooh I didn't want to give me access to production systems without verifying my background check understandably it was the least I could do to be patient and thankful that my friends and family were safe while I was frustrated the it was found out that the total death count attributable to hurricane maria exceeded 4,600 people I don't know the name of the agency but the fact that this organization was able to resume business operations to ensure the security of their customers is a testament to their disaster recovery plan in action I was able to get my computers in about a week
and that's pretty incredible here's a less devastating example fire causes company customers to lose access to accounts this one happened recently just last month and I've blocked the name of the company here because this is a blameless presentation we see examples like this all the time when everything's fine companies using reliable services and micro services are able to maintain three or four nines of uptime for most companies that level of uptime is appropriate however when your company's service is being relied on for the standard of living of for other people or basic human needs you have a responsibility to build resiliency and redundant see interior platform identifying single points of failure and planning for disasters is paramount but that's why we
plan for this just as no services immune to outages no one is exempt from natural disasters either your company doesn't have to be using chaos engineering right now in order to rock at BCD are planning all it takes as a plan and proper preparation now that we have our frame set let's put your big beautiful brains to work and make it be a business continuity disaster recovery plan now if you're wondering what a business continuity disaster recovery plan is think of it as a plan that assists during disruptive catastrophes that would otherwise interrupt business operations but there are really two parts here so let's break them out the business continuity is about making sure essential functions continue to operate
with minimal or no downtime whatsoever during and after a disaster or disruption a business continuity policy can be for a specific business unit or it can apply to the whole organization on the other hand a disaster recovery policy it counts for restoring data or applications if infrastructure is damaged or destroyed these two concepts can get smooshed together but it's important to keep them separate and distinct you can see the examples here of payroll during an earthquake and a data center availability issue of California were just to fall into the sea as addressing different issues the continuity of business ops and the data restoration in order for a drill to be a true BCD Artest it has to involve
disruption of operations and prove that your team is capable of the data recovery an important aspect of BCD our drills is to learn your acronyms the recovery time objective is the maximum expected time by which your service is expected to be restored when your service goes down what is your service level agreement to your customers how long would it take for you to recover your data and reinstate it in the new system the RTO is related but different to the recovery point objective the RPO describes the limit of time that can pass during a disruption before the quantity of the data loss during the period exceeds the BCD ours maximum allowable threshold when you lose
customer data what is the oldest allowable backup that you're willing to restore your RPO defines the policy like no backups older than a day old then there's time to resolution or T TR t TR is defined as the elapsed time at which you can fully recover a file or a full system from the disaster and this measure has a certain RPO and RTO associated with it here's what I hope is a simple example of RT o RP o and T TR in action suppose your company backs of databases to tape and the process takes two hours you've scheduled your system to back up the information twice a day outside of peak traffic hours at 6 a.m.
and 6 p.m. with this in mind your company doesn't believe that it should restore backup that's older than a day 24 hours old because that information will be too stale and it's already getting backed up twice a day they also believe that restoring a backup should take no more than two hours so you have an RTO of two hours now imagine you're living your best life going about your business when you realize through monitoring alerting or otherwise that one of your tapes failed at 1400 hours or 2 p.m. the most recent backup should have been made at 8:00 a.m. 6:00 a.m. plus the two hours to backup so the most recent backup would have been made at
8:00 a.m. between 8 a.m. and 2 p.m. means that your recovery point actual would have been 8 hours and if it takes 2 hours to restore data from tape then the recovery time actual would have been 2 hours so you would hate your RTO and RPO if you were able to restore data in one hour your T TR would have been one hour and it still would have satisfied your RTO and RPO but what if your backup at six a.m. was faulty and you had to use a backup from 6:00 p.m. the day before that would be a 20 hour recovery time actual but hooray for you that's still under your RPO if your team took three hours to resolve it
however your art TTR would be three hours and you wouldn't have satisfied the arty oh that was a lot but let's move on okay so the first step in designing your organization's BCD our plan is to inventory your company's assets and prioritize what to protect start with some basic questions like what does your company do what are your customers do what do they rely on you for how do they use your service across the board restoring your primary service to customers is probably priority zero or the most critical need that and expectations that your customers have at duo we focus on whether or not one of our customers can authenticate we collect key metrics around
authentications what kind of authentication is it is it a mobile put passcode a push a phone call a hard token or something else what is the latency associated with that authentication is the latency associated with a specific customer or does it encompass an entire deployment depending on what in the production path has failed and is impeding authentication we also prioritize bringing certain services up again the question then becomes how do you test the team responsible for bringing these services back up naturally the National Institute of Science and Technology has a standard around these exercises they have a standard for everything while 834 is specific to contingency planning for federal information systems these standards can be used for a multitude of
audit requirements they break it down into two realms tabletop and functional exercises tabletop exercises are discussion-based s-- participants meet in a room or video channel and for a given example our given scenario explain what steps they would take to respond to an emergency scenario this is akin to dungeon Dungeons & Dragons tabletop game in which a person called the dungeon master or diem sets the scene of her players maintains pace and provides dynamic feedback the facilitator of the tabletop exercise drills participants on what roles responsibilities coordination and decisions they would make its tabletop exercise is discussion based only and doesn't involve deploying equipment or other resources for this reason tabletop exercises are easy to set up don't have
the infrastructure that needs to be doubt spun down after the event ends and might be more comfortable for less experienced participants on the flipside tabletop exercises can be tricky if scenarios have a lot of complex moving parts participants must say what roles they're assuming what actions are taking and who they'd communicate with which can be difficult for a large team taking concurrent actions across multiple departments understandably facilitators or DMS must also be experienced and able to improvise with target objectives in mind lastly I've found that it can be harder to get a sense of urgency if participants are and familiar with role-playing tabletop exercises or games and if you saw Mel Masterson's presentation on tabletop exercises
you're all experts at this now the other kind of exercise that NIST recommends is the functional exercise participants in a functional exercise are given a simulated environment which they need to respond to an emergency scenario in and the exercises can vary in scope from validating specific aspects of a-b-c-d our drill to full-scale exercises that address all elements of the BCD our plan this could be something as simple as restoring a database or it could be as complex as an entire availability zone outage a functional exercise provides a more practical experience with systems in the event of a catastrophe it's also way easier to get a sense of urgency when you can send tests Victor ops or
pager Duty alerts to people in the same room as you which I get great personal satisfaction from it's kind of like the movie wargames if you haven't seen the movie a high school student connects to a system that doesn't immediately reveal itself and it in an attempt to change his grades in actuality he connects to a supercomputer capable of the go/no-go decision making in the event of a nuclear attack a supercomputer was programmed to continuously run war simulations and learn overtime in perfect 1980s cheesiness david the high school student thinks he's really playing games when he's about to start World War 3 I won't spoil the ending although the movie is 36 years old and you all should
know that there are spoilers on the Internet but the computer concludes that nuclear war is a strange game and the only winning move is not to play unlike wargames your functional exercise should be separate from production systems not accessible by the public but still providing that exhilarating experience that will help your BCD our drill participants learn what their roles and responsibilities are in disaster scenario however functional exercises can take a large time investment to spin up safe systems parallel to what you're using in production all artifacts and infrastructure will need to be deconstructed after the event as well so if you're short on time a tabletop exercise may be the way to go at this
point you may be thinking I know I need to do a b c d air drill for compliance reasons which could be for the electronic prescriptions for controlled substances or service organization control - or FedRAMP or a dozen other regulations out there but unlike you Miranda I'm not a huge gamer nerd why should I make this into a game and I would say well friend I got news for you and by news I mean science backed peer reviewed journals there are studies being published all the time that tout the benefits for playing video games and I can bomb that Valier and green's article video games play that can do serious good the authors discuss how video games teach
players to seed on a set of tasks that are initially quite difficult levels in an arcade game start with more simple and introductory controls an increase in difficulty over time as players begin to master them not dissimilar to the concept of spiral curriculum in educational psychology when a game is constructed correctly players have been shown to have higher levels of retention and are able to call upon those skills more quickly in the future in addition applying skills in a variety of contexts helps with decision-making for example if you have operations engineers they might be familiar with the concept of bringing up services and a new availability zone from company documentation but if they haven't done
that themselves they might not be super familiar with it and a functional exercise or a video game would help with that the article argues in favor of the idea that games that constantly challenge different aspects of perception attention and cognition in a variety of contexts are likely to help result in enhancements of these base abilities the interactiveness of a game also has an effect over methods of learning Indy biasing decisions improved decision-making with a single training intervention a team of researchers from cast Business School Carnegie Mellon University and others pondered whether different learning tools would affect participants cognitive biases participants either received a single training intervention watched a video or played a video game the interactive
games provided participants with personalized feedback on how biased they were during gameplay with the opportunity to practice decisions and learn strategies to reduce their propensity to commit various biases which the six tested by C's were biased blind spot confirmation bias fundamental attribution error anchoring projection and representativeness it was a page-turner the first experiment addressed three of these biases and the second experiment built on that foundation and addressed the other three biases the average reduction of bias over these two experience was 17% more effective using video games over just a video in the short term and 10% more effective over a longer term two or three months period in other words gamifying the lesson instead of using traditional classroom
style learning or videos provided more of a benefit in the same amount of time investment so don't get overwhelmed when thinking about preparing a/b c/d our drill or participating one if you're new to the fields it might seem a bit overwhelming however gamifying drills is positive reinforcement it'll be fun while you're training and when the real thing happens and you need to be quick on your feet the steps you'll take will be second nature to you now that we've looked into the why and how of organizing a/b c/d our drill I'd like to share my three best practices for running a successful event care make it remote in time box it care this is more than just having
empathy for the people that you're orchestrating this PCD our drill for it's actually caring about your drill making sure that the details are accurate that it tells a whole story and that you're able to define what is and what is not part of the drill that part is key make sure you provide enough context for participants and let them know what the boundaries are for the drill your people might get distracted by a red herring so make sure there are enough people scribing are not actively participating who know what the scenario is that can guide participants I find that one facilitator to three participants is a good figure but scale up with complexity also keep
in mind that your teammates are human they are mere mortals who require sustenance when the Adrenaline's going they might not think about basic human needs like eating I love to eat I have found that food is an excellent bribe to get volunteers to participate and goes a long way when energy starts to wane during the drill there can be a lot of excitement the drill so so just like with phishing campaign exercises make sure your people know that they don't have to feel ashamed if they make a mistake it's all part of a learning experience even if they don't hit their RTO or your company's RTO lastly in my experience there's always something to iterate and
improve on during a/b c/d our drill and when you care about your people and the exercise others will be more likely to help out and take on action items after the drill is over make it remote it's difficult to have everyone on the team get together for a one-hour meeting let alone one that spans a few hours or all day with this in mind plan your drill so that people can participate without having to be in the same room while the convenient part of a drill is that you don't have to evacuate earth to get to a safe spot in a real disaster you might have to relocate or work remotely even if you don't have a distributed team there
might be a chance that someone who wanted to participate had their own little disaster crop up and so making it remote will help them be a part of that also make sure that your people know where to look for clarification or resources in a shared folder directory or safe branch of your repos to work in on my caring slide I also mentioned how food is great bribe and that I love food if it's within your company's means letting remote employees expense a lunch goes a long way and lets them feel like they're in the room even if they can't be timebox it all arcade games have a set limit for you to complete a level in
having a be CDR drill that lasts all day is inhumane and you will make many enemies if you do that unless you want to fight in some epic boss battles keep your drills time box to a couple of hours enough time for participants to take a solid try on hitting your company's RTO and if you feel up for a debrief work where participants can share feedback on the drill and processes to improve on I like to schedule the debrief later in the same day so that participants can decompress and compile what they just went through but not so far in the future that it isn't fresh I would also caution organisers to avoid scheduling the
debrief more than a week or two after the event in summary business continuity disaster recovery drills may be mandatory for audits but just because you and your team have an obligation to practice doesn't mean that it should feel like a chore establishing feasible recovery time and recovery point objectives are important to balance your service level agreements to customers and practicing often will help your team's mean time to resolution making sure to prioritize the right aspects of your business operations and gamifying the experience for your team will help set them up for success when the real disaster comes remember what formats are available to you when conducting your be CDR drill the tabletop exercises discussion-based and while it's easier to set up it
requires that the physical facility is experienced and can provide dynamic feedback a functional exercise uses infrastructure similar to what you'd be using in production while participants may get a more hands-on experience with a functional exercise spinning that infrastructure up and down might be harder with a time constraint also remember the three best practices for running a be CDR care about the drilling your team and support them with the proper resources make it remote friendly since if you had to do a disaster recovery for realsies you wouldn't have the benefit of being co-located plus you can never prepare for all the little disasters that might crop up lastly make sure you time box the event to a few
hours so that your team can try to hate your company's RTO and TTR and all of those other acronyms as well but not so long that it's insufferable these are the video games that I referenced and their release dates and thanks to you all for being here [Applause] awesome thank you so much Miranda we do have a few minutes for questions if you have you can raise your hand and I can bring the mic to you it's oh hi Miranda with a 0 and a 1 instead of an O and an eye you're welcome you're getting a lot of fun
the question was what is the best way to get buy-in when doing a tabletop exercise for that I would say making a great argument and getting buy-in from leadership goes a long way at duo our BC DRS usually involve our operations team and our security team because we try to bundle like Incident Response with security response and so we have a really great relationship across departments and that helps a lot but there's also a really strong argument for you know if you're feeling a time crunch a tabletop exercise is definitely the way to go great question yes the question was what elements of gamification do you find most valuable among employees my team gets really competitive and having that
sense of urgency is really great for finding things to really hone in on we have a lot of subject matter experts and I try to construct the drills so that the subject matter experts don't make decisions for the entire team like I had one drill where I split the team up into three sub teams and I named them Atari Bally Astor Cade and Commodore teams a B and C and I had one subject matter expert on each side team and I found that that was really great because the subject matter expert could disseminate information on their sub team and it wasn't like those three subject matter experts were driving the drill I hope that answers your question
I can't see in the back if there are any questions up there I think we're good awesome thank you so much once again Miranda and here [Applause] [Music] [Applause]