And the Clouds Break: Continuity in the 21st Century

Name: And the Clouds Break: Continuity in the 21st Century
Uploaded: 2022-06-23
Duration: 56 min 11 s
Description: Wolfgang Goerlich examines disaster recovery and business continuity strategies across two decades of consulting, grounded in Cisco and Cyansha's security outcome research. He dissects common failures in continuity plans, from organizational ownership to testing protocols, and proposes frameworks fo

BSides Cayman Islands · 202256:11141 viewsPublished 2022-06Watch on YouTube ↗

Speakers

Wolfgang Goerlich

Tags

CategoryTechnical

StyleTalk

About this talk

Wolfgang Goerlich examines disaster recovery and business continuity strategies across two decades of consulting, grounded in Cisco and Cyansha's security outcome research. He dissects common failures in continuity plans, from organizational ownership to testing protocols, and proposes frameworks for prioritizing recovery strategies—including emerging approaches like the DIME pattern and chaos engineering—to build resilience in cloud-native infrastructure.

Show transcript [en]

okay we're going to go on stay on schedule so our next speaker wolfgang garlic is a hacker strategist he's the advisory ceo for cisco secure prior to his role he led it and i.t security in the healthcare and financial services verticals he's also held the vice president positions at several consulting firms leading advisory and assessment practices he is an active part of the security community co-founding and organizing security conferences and events wolfgang regularly advises on and presents on the topics of security architecture and design identity and access management data government secure development life cycles zero trust security and more so i am going to welcome wolfgang to the stage with and the clouds break continuity in the 21st century

thank you so much let me first make sure my clicker works here haha it does okay great back works too yeah we're going to be talking about disaster recovery we're going to be talking about business continuity this is already mentioned i did a lot of uh consulting in my past i worked in healthcare i worked in financial services so i got to build these programs and i got to advise folks in these programs way way back in the early 2000s now i could say at the turn of the century which sounds much more impressive than it felt at the time at the turn of the century we're building disaster recovery solutions and uh and it's interesting because

20 years later it feels like we still make some of the same mistakes so i'm going to share some of those mistakes and share some context now two decades ago we didn't really have good data we didn't really have good insights as to what made a program good or bad we had some gut feel we had some expertise we were like ah this this worked and that didn't and oh that person seems to be going okay and oh my goodness don't talk to them the week they've had right we had a lot of those types of stories we didn't have really good data so in this talk i'm going to be referencing a uh survey that

uh cyansha did in partnership with cisco so cyansha institute does a lot of different security studies fantastic analysis of data they worked with cisco on a security outcome study effectively asking what works in security right reviewed around 5000 people double blind so they didn't know us we didn't know them crunched the data and gave us some results and surprisingly business continuity and disaster recovery jumped out of that resilience jumped out of that we thought you know what what people were doing that would make a good security program would be like cloud security or would be like sassy or be zero trust would be something cutting edge no no one of the main drivers was continuing zest or curry weird

weird for those of us who've been doing it for a while and kind of feel like you know we're the uh or the dowdy uh cousin that no one wants to talk to it was a little bit weird but yeah a little bit reaffirming so i'm going to talk about some of the data today in uh and showing where we need to go when we build our programs first off first off in terms of like gut checks all right in 2016 2017 2018 um no one really thought there would be a pandemic the idea of like we should do pandemic planning was something that people were like really why why would we do that like no but we should you know for for

continuity and recovery okay okay i'll humor you ah but no one really thought and no one rules were engaged at the same time if you remember like walking dead was popular there were some things on television people were we're finding fun there's some good movies out there like c nation um good and air quotes i mean it's a popcorn movie we can debate whether or not that was actually a good movie but there were some movies about that and so what my team did was we used the cdc the united states cdc zombie preparedness kit if you haven't seen it it's it's kind of cool it walks you through what happens if there's a zombie outbreak and we thought

this is great i'm going to prepare folks for the pandemic and we we built some plans we did some table tops i'm a big fan of table tops get me in a room get me to tell them some stories give me a good table preferably with a nice top and some coffee we're gonna have a good time so when the panda happens when 2020 happens i thought this is good this is my chance i'm gonna reach out to some of the folks i worked with i'm gonna be like hey what happened we did this zombie preparedness kit and they're gonna be like wolf you have no idea you saved us and i'm gonna feel good about myself

right i'm gonna have some good case studies we're gonna have some good anecdotes that i could share one day at besides cayman um however i'm sorry to report that wasn't the case i reached out to folks i'm like you remember how did you guys do you remember we did the zombie thing we did the we we did the and they're like oh yeah you know funny i forgot about that like you forgot like well i was busy at the time people were going home i didn't know what's going on we figured it was two weeks then it was two months i don't know and to a person no one remembered the plan and i think that

first was humbling because the kid i'd spend a few years prior to joining cisco working on this but that also really in my mind illuminated part of the problem with continuity and recovery we all have plans we all have tabletops and then something happens and it doesn't work so i'm going to talk through today how to prepare how to get ready but i encourage you guys please please please put this in action in your organizations test it keep it fresh because if it goes too long it's gone even if it's something as exciting as fleeing zombies it's gone all right i'm going to talk about continuity recovery in general i'm going to delve into some common mistakes where

people go wrong including again forgetting there's a plan and i'm going to talk on part three about how to use this information strategically how to use this information to do more than just recover but to position the security team to have good conversations with the executive team to get sponsorship and buy-in uh to build better relationships with our peers so part one continuity and recovery with cloud services now some things of course haven't changed in the beginning of doing continuity recovery if you're just getting into it if you're wondering what to do you need to do a business impact analysis what do i have what it means we're going to talk about that in in depth in this session

we need to protect against uh impacts not just events a lot of good planning is built around what happens if it's a hurricane what happens if we lose power what happens if there's an adversary what happens if there's malware and then the scenario comes down and plays out it's not just like what we thought it was a little bit to the left a little bit to the right when i was in the financial services our cfo used to always ask me wolf what happens if an asteroid falls from the sky and takes out the data center and i'd say peter i would retire i'd be done if i'm not in that data center but when we actually had an outage we

had planned so hard for a full down like everything is down that when things were partially down we had corrupted data and services coming up out of order and services trying to talk back to the services that were failing and the data center that was affected it was really interesting to see that how we had over optimized from that scenario and it really resulted in trouble getting back up so it's important to think about the impacts short term long term medium term of things being done instead of over focusing on events it's also important to focus on a few proven strategies i'll get to why that is uh in just a minute now some things have changed

right um two decades ago we wouldn't all get to declare you know a international holiday when aws was off at abs is down today what are we doing we're playing video games we're waiting until they get back up and until the services that rely on aws get back up and then our services can come back up right we didn't necessarily have that idea uh watson um of ibm who not the watson mainframe the the watson who founded ibm who the mainframe was named after once famously said there's a market for maybe 10 to 50 computers in the world right he's like no one wants a whole bunch of computers and everyone laughed and we tout that

out again and again like ah he thought they were going but we have so many more computers but when you think about it we're really down to like four main computers these days right we're on azure or aws or gcp we're really down to this tight tight consolidation which hadn't happened before so we have that to contend with we also the fact that a lot of assets are outside of our controls in the gold and olden days it was my employee on my device sitting in my office on my network talking to my data center on an application that i partially wrote of course today that application is probably sas if there is a data

center component that's infrastructure as a service both outside of our controls very few people are going to office anymore very few people are even on corporate owned equipment anymore oftentimes it's byod so we've got a lot of assets out of our control that we need to take account for there's new strategies that exist which i'll cover and more and more uh it's a team sport it can only be it or can only be bit security who are overseeing this especially when we have supply chain issues especially when we have devops teams and sre teams so some things have changed some things haven't when we talk about business impact analysis it's really where we need to start i

think this is the most underlooked aspect of blue team the almost underlooked aspect of organizational security what do we actually have and what does it mean and what does it mean from a business process perspective right what is our business process who's our key people what technology are they using do they have facility dependencies and what does that data flow look like mapping those things out and then taking that data flow and mapping out further into the enabling it our ability as security professionals to say any nut and bolt within a data center any virtual instance with infrastructure and service any dns and ip address pairing in the cloud on a sas application what does that mean to my business

if i'm in the business of funds money management should be about the funds well in the business of manufacturing it should be about the widgets can we draw that line i was telling you about how like the some things have changed and some things haven't one the genesis of this was i was doing a business continuity and disaster recovery exercise for a head hunting firm out of chicago and so we're top floor with chicago skyrise beautiful view people are coming in from different parts of the business and uh and i was letting my uh consulting team do most interviews i was singing on it and we had positioned ourselves with the window behind so that

people come in and sit and it has a light in their eyes what should have to light my eyes now it's a good feeling it's a good feeling to have no it's not it's like there's sort of like those like mafia intimidation seers like what do you do where did you hide the money what applications are you using very very effective when you're doing baa also completely by accident that's where we sat we realized halfway in this was a bad choice as everyone was squinting at us and trying to answer the questions but i kept saying okay what happens if the technology goes down i realize use this asset for that sas app whereas you

use workday or servicenow or whatever maybe what happens if that app goes down they're like well then we don't work like no but i mean what's your like fallback procedure right in the old days you're gonna be an operations team like no we just we don't work like but what do you do and finally it's one person's like they're an accountant they're like the wolf you need to understand we're not going to haul out the big piece of metal and put your card down and the blue to run a credit card we're a international company that's not how we work if that enabling idea is down we are down like well that sucks that doesn't i mean

that makes their plan easier if you're down you're down but that doesn't allow for much flexibility unfortunately that's where a lot of us are today there's not a lot of slack in the system so we have to know what we have we have to know what it means now i will say before i jump to the next slide how much do we need to know is it 10 percent is it 150 i don't know for a long time when you talk to someone in business continue disaster recovery up until today when you talk to a security professional they'll say something like the following we've got really good controls around our crown jewels and you'd be like oh yeah what about

everything else though oh so like just your crown jewels no we got really good controls just during the ground trills you know you can't so that's wink like don't ask me any more questions right one of those things how much is your crown jewels is that five percent of that ten percent of the environment not very much but maybe that's okay if we just get those systems up we're in a good spot maybe we just recover those systems when i was at the money management firm that was exactly the attack i took we recovered our crown jewels and we had this backup war room where had monitors and everything you would imagine like you know if you've seen

wolf of wall street it was that thing um and yes my name's wolfgang there may be a wolf a wall street joke i'll take it but that's not my point point is imagine that and all my traders come in and we've simulated that asteroid coming all my traders come in they sit down and they pull up to their desk and screens start popping and charts start shining and the market starts trending and we're online and we're wired into new york and i'm like yes i'm feeling even better about this than the zombie story if i'm telling you the truth i'm like yes we did it and uh and the head trader turns around goes okay this is great but uh where's

all my plugins i go what do you mean he goes well you don't think i actually like click on things to trade i need all my plugins i'm like you never told me about plugins there was no survey that said you had plug-ins he goes well i told you i just need everything on my desktop but i'll we're on this desktop so we missed those plugins those are not within that ten percent so we haven't known how much we need to recover for a long time scientia looked at this uh in their study they're like how much of our systems and scope should we recover and they plotted it against how likely organizations will report to have

successful recovery and as you can see one of the things that really surprised me is the most likely to maintain business continuity were people who are doing 80 percent of more of their systems i would have figured 20 but 20 50 up to 80 almost flat 80 or more is where we saw the big jump that i don't know how to wrap my head around because how do we get 80 of those applications if we don't have good visibility into what those are i think that starts with things like caspy that starts with having a more of a dynamic acid inventory as opposed to static acid inventory that starts with having more conversations with folks

to figure out what we're missing but that is very concerning to me because there's a bunch of blind spots that we miss i told that story about oh yeah everyone came down it's like where's my plugins i told that story to see so i was coaching uh recently it was about a year ago he's like oh stop listening like halfway to that story he's like ah don't worry about it like don't worry about it he's like yeah don't worry about it okay good goes i already figured out browser plugins i audit for those that is good that is good but that also misses the point the point wasn't let's add one more layer of visibility the point is

there's always going to be something we don't see how do we get to the reality of what we don't see part of that is asking people serving people assessing their devices having visibility but the point is we need to be humble about what we know especially we're trying to get to 80 okay next thing threat scenarios threat scenarios in that asteroid we need to be very careful not to build specific plans for asteroids and earthquakes and fires and floods with the caveat that i am in where i am hurricanes are a real thing hurricanes we should probably have a very specific plan for but in general when organizations are building out their plans they should look at these

threat scenarios and group them by effect otherwise we end up with very big plans conflicting things conflicting uh procedures and the longer a plan is we know the harder it is to maintain and therefore the more likely it is to be out of date we then look at the different processes and we tier those okay if if payroll and hr are down marketing down what does that mean probably about the same impact to the business if production is down if i can't trade if i'm in health care and i can't do patient uh health and safety if i can't deliver health care that's a much better much worse much larger impact so we want to tier those

up so we've got a few different things coming out of that we need to build a strategy but first what's important is the disaster is the impact we think about resilience um resilience isn't nothing bad ever happens right resilience isn't we can get punched in the face we can keep going i had a storage array way back in the day i had a storage array and uh and we partnered with this organization because it was super fast but for whatever reason they had bought a bad batch of drives and we were losing a couple drives a day and the drives would fail we had a big stockpile of other drives we pull in pull the bam drive plug in the the new

drive rebuild the raid array and go and this went on for about six weeks one of the most stressful periods of my life every every day losing a couple hard drives and you know like a couple weeks in you're starting to use oh it's fine we lost three today i'm like okay that's fine i'm like no you need to understand the three we lost if we lose one more in the next 12 hours we're gonna be down i'm like oh i was really used to just being two being okay but and three doesn't seem that far from two so yeah it shouldn't be but it hit this disc this disc and this disc and now my

raid's at risk if i lose one more disc and so you're just sitting through the night going please don't fail please don't fail please don't fail but we made it through we actually didn't have an outage we're very fortunate now that's resilience right we had failures we had things break again and again and again but that was not a disaster the disaster is not the event the disaster is the impact so we think about dr what we're trying to do is mitigate that impact we think about dr we're trying to say this is going to break that's going to break the other thing's going to break and that's going to be unavailable how do we make sure that the

organization doesn't feel that financially so we do that with some strategies and i'm going to cover a few in just a minute but think about a strategy as something that looks at the various scenarios and maps them to the various recovery tiers so if we've got those scenarios happening on the assets for those recovery tiers we know what to do think about a strategy that way now when things like the pandemic happen um black swan events when things like i had an email system fail one time and uh it was just the stupidest thing one guy executed the wrong command because he thought he was on task and he was on prod and after that they're like

we will never lose email again we ended up with the biggest email cluster in like the midwest of north america it was stupid it was stupid but these black swan events happen you know the executive teams they want to make sure that that never happens again what's really important with black swan events i believe so i think good security is built on the wings of black swan events we take that black swan event we abstract a lesson we build a strategy on it that would sure address the black swan but also address other bird-shaped problems in our environment we build strategies that take what they're interested in and we make it important right take that

make the interesting important to make the important interesting that is the trick of security we take those concerns that the executive team has and turn them into a strategy that can address multiple different scenarios and tiers that's really the trick to funding this and making the whole wheels go around on these programs all right from a strategy situation there's some myths old myths we don't need backup we've got raid we've got the data striped that works great until this corrupted data and now i've got corrupted data everywhere don't need backup we've got replicated data that's awesome now i can take the corrupted data from one data center and send it to another that's great if i'm in aws ceased i've

got corruption aws west sweet we don't need backups we got image level backups awesome try to restore one file off of off of a you know terabyte vm off of snapshot that's terrible also bonus points if you're trying to do it at the exact same time that everyone else is in the cloud council trying to recover something there are some new myths i don't need backup i've got infrastructure as a service i went into multiple different conversations with organizations that say you know what it's it's all right i've got infrastructure as a service um they're handling everything like you realize is a shared responsibility model right i mean you've got your responsibility no no that it's good

because everything's running and if i need anything it's in my s3 bucket or whatever okay if the bucket is corrupted or emptied or any number of things we don't know we don't know similar with cloud instance snapshots i've seen uh especially with these site failovers uh where we're like okay we timed it to rebuild and reload our environment from snapshots it takes 15 minutes great now aws goes down aws comes up everyone's trying to bring stuff up it's no longer 15 minutes it's now in the terms of hours and they're trying to explain why they had an sla for 15 minutes got to be aware of these myths we got to be aware of these concerns

there's some new strategies that are evolving of course some good strategies like failover connections how do i get to my environment direct connect express route those sort of things um backup recovery continues to be a good strategy being aware of a lot of the open source tools will have backup and recovery in their freemium model but won't encrypt it or will have very limited backup and recovery i can backup everything but i can't back up at the item level you want to be able to dig into the open source software you're using figure out where those edges are and make sure that you got the ability to not only back up but also restore item level

also encrypt that data i've seen too many incidents especially in talking to red teams they're like oh yeah we just stole all their backups got all their data it's perfect there's no controls that way ground to cloud still exists that was one of the first infrastructure as a service moves that i did uh with azure if anything failed in my hyper-v environment spun it up in the shore awesome we're moving away from that now of course as more and more workloads are in the cloud it doesn't really make sense but there's still those possibilities region to region is really where it's at making sure that you've got wss east and west and those sort of

things regional recovery options when something fails manual processing is that even a thing maybe maybe i like in manual processing uh they fail over ability right so if my identity provider is not available my customers can't log in can they still purchase things maybe they can't log in and see their order history but they can still you know buy something with their credit card those types of sort of uh uh gently degraded services i think are a good way of looking at things uh or insurance right insurance is always a thing cyber security insurance is still a thing how long will be a thing we'll have to see with all the changes in the market and

ransomware hitting it all the time but of course insurance is always a recovery strategy an important thing is to do as few of these strategies as possible and do them very well the fewer we have the better we can make sure that everyone knows them socializes them out practice them those sort of things there's also a couple of do nothing strategies i really liked i was helping a sas company get uh certified through high tech and my my team kept asking a little how do you backup desks how do they back up that and they kept saying we don't why would we we don't we have an ansible script we don't and so finally i was really

pushing my team like that's not an answer you can't just say we don't we have to have an answer and so i uh i sat down with the the lead devops architect like look you can't just tell me you don't what are you doing with this he goes well if this is a kafka pipeline any data and it's immutable any one of these hosts we can we can spin up in a heartbeat if we lose any one of them doesn't matter if we lose all of them that's probably a short-term problem but i'm never going to recover from backup and that was the first time i heard of what sono u has called the die

pattern if you haven't looked at this highly recommend it distributed makes sense right multiple different nodes immutable nothing on the node itself changes this is really good for data pipelines it's very good for micro services nothing on that host changes and ephemeral once we're done with it we can wipe it out and destroy it and it has no impact on the environment at all do nothing right i don't have to do anything with that environment just let it sit and if i got to rebuild it with an ansible script i can cool cool i like that i think moving forward into the future we'll probably see like a layer of dye pattern around most of our

infrastructures because most of them are still going to have data and things we need to maintain we still have devops people are doing a lot of in-house configurations and getting on the systems and wiring them up and loading config all that sort of stuff that we don't want them to do but they're still doing that's one approach another approach of course is that ci cd pipeline approach which is if i lose it that sucks but i can run a script and i can rebuild it awesome approach um are there are skilling issues if everyone's trying to run it there also is uh an issue where i ran into this uh the poor guy he uh he was using this pattern

and he was using jenkins and they lost some systems he's like no problem know exactly what to do i will now run my jenkins script to rebuild it on another cloud really quite clever multi-cloud dr awesome he went to ron jenkins and roz jenkins had the same dependency as his infrastructure did and was down he had no way to actually run and restore so key thing there is when we're automating right when we're using automation we need to remember that automation means everything include that in tier zero make sure that's not in the same fault path as your underlying infrastructure that really covers disaster recovery it's bottom up we want to make sure we don't get lost in the weeds we want to

make sure that we know what our technology is we've got a few key proven strategies to to recover that technology and we know what it means to the business so we can tier it accordingly next thing i want to talk about some mistakes that people make in business continuity um when i was doing consulting my team read 176 different continuity plans which i will now summarize the mistakes of 176 continuity recovery plans also i would encourage you to never read 176 continuity plans every one of the team members on that team has since left that company i kind of blame myself but here were the top three failures we saw again and again and again

wrong person this one did not surprise me when i was at the money management firm the business continuity and disaster recovery was owned by this lovely lady who was a project manager in the operations team and she had no visibility into [Music] the technology she had no understanding the technology she had no visibility to the compliance requirements she had no understanding of those compliance requirements she had no visibility into the executive team she had no sponsorship at that level she just was really good at being a pm and owning this document not good not good i often times see this two in the i.t side i was talking to uh preston who who put out the book on

data recovery in the cloud really brilliant guy highly encourages uh out through o'reilly he's like you know what i find is that oftentimes business or backup and recovery is pushed to the junior it guy like congratulations you're fresh out of college you get the backups it's like a junior task right and i was like you know it's funny because oftentimes in business continuity we do the exact same thing congratulations you're out of college we need you update these reports in these forms and this plan right we push it down to very junior people and then we wonder why it doesn't scale and it's not appropriate it's not accurate so top failure is having bc and dr owned by the wrong

person next one is having the wrong technology so i i say that i am protecting my uh you know databases i got you know sql in azure i'm protecting my azure sql because i'm using a bunch of palo altos i'm like wait a minute what yeah yeah our palo altos we got load balancers okay that's great the resiliency you know props but how does that protect your database well our application will go down and people can get to it okay but what about your database right wrong technology we need to make sure that we've got the right technology supporting those systems and the last one is the wrong drills a lovely lady i mentioned it was a

project manager she really was great did an awesome job on what she uh was given did an awesome job every time i worked with her her drill the drill that was reported up to scc for business continuity before i took over the plan was the fire drill the fire drill yeah we do uh you know twice a year fire drill the fire drill is how you're saying that we got business continuing disaster recovery yeah because i do it i've got people i've got floor captains that is awesome doesn't tell me a thing about rit but cool so wrong person wrong technology wrong drills who's the right person did not have good data on this for years

again a lot of gut check no one really knew this data was from that scientist study i mentioned earlier and they found that there's a drastic increase in the ability for organizations to get up and back operational if there was board level oversight of the program and if there was c-suite ownership and cso involvement so we want to make sure the board knows what we're doing to keep the technology up we want to know that the the uh c-suite has engagement and sponsorship and supports what we're doing especially if we have to take time away and do drills and interviews and those sort of things and we want to make sure that the cso is ultimately the person accountable for

it that's what the data shows all right other ways that we saw these plans mucked up when we did this massive study i already mentioned some shared responsibility it's outsourced i don't worry about it okay shared responsibility is a thing we got to make sure that it's not that abstracted non-standard language again very common if we get junior people doing this um but mixing up recovery time objective recovery point objective uh not really understanding what roi means any sort of like non-standard language i see in a plan it's sort of like a you know like a code smile if you're a developer it's that sort of thing i'm like all right this doesn't smell right i know it's probably

something rotten in this plant uh nonstandard approach i'm not saying we need to foul nist or iso 22 uh 301 i'm not saying we need to file those they're very heavy frameworks meant for international companies meant for very large scale companies but but they do have good life cycles they do have good processes right it should be a similar approach that if we're just making up something that doesn't align with the standard again doesn't smell right using aha as opposed to recovery i love me some high availability but that's not going to get our data back that's not going to keep our systems running having overly broad plans my favorite was something bad happened address the bad thing bring systems back

online that's a great plan three points right now something bad happened fix the bad thing bring things online okay i can work with that um can you give me an example they couldn't of course like well we'll figure that out on our own like i don't i don't think you've really tested those strategies then i don't really think that you know what you're going to do if you if it's that vague the worst is the opposite very very specific plans bad thing happens therefore joe executes this command and this virtual machine at this time and then sally does this command in the aws management console and then jane is going to i'm like wait a minute that's

very very specific when was the last time this was maintained is it up to date is it accurate what happens if jane's not available what happens if the management council is unavailable what happens how did that virtual machine even get there do we account for it being restored right that that black swan effect i mentioned that if we get really focused on just addressing that black swan instead of looking at bird-like threats we can fall into this trap and then finally handled in isolation if we've got different asset management uh programs and categorizations if we have different risk rankings like our risk function has tiered our applications differently than our bcdr function or our bia function has a different tiering

than our sock function that's sort of handled in isolation really indicates that we're probably not recovering 80 of the environment because there's probably not good tie-ins with the asset folks we're probably not really doing good risk ranking because there's not good times with that and if we got to get a whole bunch of teams together really fast to figure these things out we're going to be talking a different language we're going to be looking at things in a different way and that's going to slow down recovery so all that to say a few good strategies right person board level oversight right uh technology depending on that scenario and then having a good test right having a good

plan for that and testing that plan and testing that plan again and again and again again how often do we test these plans is this biannually we do a drill is this annually is this quarterly what's the good amount i used to really like like tabletops every month and then one quarter we would do an exercise for a different strategy and then two of those four we would select as evidence for the sec that's how i did it but again this is like hey what do you do you know how do you do it okay i do it this way we really didn't have good data science should ask the question how many uh events are you doing

on a monthly basis for your plans and provide us some insight into that one of the things that immediately jumped out of me is i figured like one or two would be good it's actually five five is the magic number five events per month makes an organization 2.5 times more likely to maintain resiliency that's a lot that is that is a lot i don't want to be pulling the plug you know a couple times a week that doesn't make sense i don't want to be doing full tabletops a couple times a week that probably doesn't make sense when i started appealing behind this data and talking to organizations that had five or ten events a week what i

found was something differently what they're actually doing was they were aligning those events and embedding them in the business and aligning them in a way to push information up so if you think about like bloom's taxonomy right if you think about bloom's taxonomy we don't need everyone in the business to be creating and evaluating our bcdr plans we probably just need them to remember we have one see also my zombie problem we probably need a few people to remember and understand who to call when something's on fire or not working so we can start lining those up to different activities and here's some of the ways i can look for the process owner the business process owner right

finance accounting um their team meetings their team meetings are perfect opportunity for us to in bed jump in once uh every couple months or once every couple quarters and say oh by the way remember we're the bcdr function um you know supported by the board board is interested in this we just want you to know here's what the plan is at a high level when this happens here's how you you'd contact us that can be something that the dr team is doing monthly but if you've got 12 processes you know that's once a year where you're making that cycle all you want those folks to do is remember and understand that there's a plan for the system owners

aws owner devops team whoever owns my erp system whoever owns my hrms system what we're doing again is looking at the team meetings or standing all hands where we can get engaged what we're doing is reviewing the plan making sure they're aware of it we're also engaging them probably quarterly maybe a little less often or more often depending on the criticality of that uh system we're engaging them in bcdr meetings and doing a tabletop exercise and engaging them in full exercises this is one of the main things i learned early on when i was at the money management firm we had a great way of recovering the technology we're like yeah the technology is up we've done it

we've restored we have to restore we replicate what we have to replicate we restarted services yes i can ping it everything's good and then the whole rest of business is like what what does that mean i'm like ah yes we actually have to have operations come in and certify this and turn it over ah okay fine we'll have we need a few more steps we need a few more people involved this is where uh system owners are engaged and really thinking again back to like bloom's technology this is where we're looking at apply and analyze from the bcdr owners all right cso on down whoever the function is who's owning this these are the folks who are

going to have specific bcr dr meetings who are going to be doing after action reports who are going to be planning the updates who are going to be distributing that information now these are the folks who are going to be valuing the strategies adjusting the strategies and really driving things forward leading those tabletops capturing information this is how i would recommend getting to that num magic number of five per month i do want to say one thing about after action um incredibly important after action when do you do an after action well after obviously well yes okay sure but how soon after is really important i don't have good data on this this is my

gut your mileage may vary what i've noticed is if i've done after action right as we've recovered like the event happened we've recovered i've asked people to take notes now tell me what to fix i get highly specific highly tactical answers this command didn't run this person was unavailable this virtual machine's name has changed uh the system owners didn't tell us that they're using these other parts of the applications if on the other hand i wait for two days you do the bcdr um i've got approval for hr and this is another pro tip you should get approval for hr that when we've got an instant response or a business continuity exercise i've got free pto i can just throw out my people

you'll burn yourself out go back to your families spend 48 hours do not look at a keyboard do not pick up your phone do not touch a piece of technology do whatever you need to do sit down and watch cartoons open your favorite bottle of bourbon sit by the beach whatever it is just do what you need to do and just chill hr get them to approve that get them to prove that make part of that plan so when it happens we can give our people a break then i bring those people back two days that's it don't look at it don't think about it open up your notes do an after action i find i find when i do that

the after action is much more strategic you know what we miss some of the things in the command line so maybe what we should do is just write a script that automates that you know what there's some things that we we did not have included in assets so what i want to do is just have more of a touch point when i do that all hands meetings with the process owner i'm going to ask them you know some questions as to what's changed in their environment tends to be much more strategic after the event after we've gotten out of firefighting your mileage may vary i don't have good data on that that's just what i find

in terms of progress there's a couple key metrics business impact coverage right of all those different processes in the bia how much are covered by our recovery strategies of those scenarios how many of those scenarios have been covered by recent exercises how many have we tested how many we verified how many we cut our teeth on how many have we walked through uh tactically percentage of control has been covered and then we get to things like rto and rpo now you may wonder hey wolf why don't you talk about rto and rpo first rto and rp is what we talk about recovery time recovery point um sla is what we cover about how many nines are

we measuring to um which incidentally was what drove that 176 plans we want to make sure that everyone for this organization was meeting the slas and the metrics of the organization really fun project for me anyways maybe not for the people doing it but um i don't believe we can mate those things i don't believe we can meet those slas we don't first have bia scenario coverage and tactical coverage i go right back to that data graph i showed in the beginning which is 80 percent of the systems covered provided the highest level of assurance i think we need to start first with coverage and then make sure that we're able to hit those numbers

and then once we have that we can start embracing chaos so if you've been following along with uh chaos engineering um it's been picking up steam again there's a great book from o'reilly about ks engineering this comes from the fine folks at netflix uh back when netflix had some great content on it um sorry if there's anyone from netflix in the audience uh but netflix was like hey i'm gonna issue a ks monkey it's gonna wander around and the script is gonna unplug things and restart services and and have bad things happen why because if there's an outage if there's a bad bug we're already familiar as a devops team as a function on how to respond and recover to that

we're starting to see that same methodology and same thinking come over to security how can we use chaos engineering to inject some problems into the environment and then make sure we can recover when sanchez looked at this they found that there was a big jump and people were saying that chaos theory was part of their standard process i put this last because those same folks were already reporting that had a very mature bcdr program i'm very concerned about we pushed this down to the junior rit guy we pushed this down to the junior security guy and the junior rit guy and security guard are like you know what's cool is chaos and then they adopt chaos

engineering and bypass everything else we've talked about and now we've got things going up and down because they saw a number that that gra you know was made organizations twice as likely to be highly resilient that is going to happen you know it's going to happen you know it's going to happen in some org so i want to put this out there that that is the direction that people are going in and it's a really powerful direction but we've got to get the fundamentals first we've got to have strong bars a good pile of bananas before we turn the monkey loose okay so that's business continuity top down don't forget to connect the dots let's talk a little bit about using this

program strategically for years and years and years business economy disaster recovery i've liked it it's fun we have a drill we have a clock we put that clock up there we count down to the sla people are running it's exciting um it's fun but a lot of people didn't really think about it oftentimes ignoring low capability until we did the study with scientia and asked what actually works with security i honestly had started really forgetting about bcdr i was very surprised when they came and said look when we look at this there are some five practices that are tied with everything we want to do from enabling the business to managing the risk there are five

practices that we found that correlate with driving behavior and one of them was continuity now when they mapped it out they came up with this great chart you can explore this data it's a bit of an eye chart what it is is the higher the number the more likely they were to do these various activities i went and crunched this out and peeled off just resilience right these are the five practices that are most related to drive security outcomes two of them being resilience instant response and disaster recovery i peeled those two out and i looked at what organizations were able to do better if they reported that they were more resilient first thing they're more likely to gain

the confidence of the executive leadership they're more likely to have peer support that makes a little bit sense right if you're talking like so many technologists do about this instance or that instance or this service or that surface of the kafka pipeline uh executive leadership is like glazed over okay it sounds like you get that under control i'll see you later and they're already out the door if we're instead talking about technology in terms of what its meaning is to the business every single time that i was in healthcare i tied it back to protecting the patients every single time i was in money management firm i tied it back to funds and reputation i've got a good friend in

trucking and as a cso every time he talks about security talks about the trucks the way we get there is by tying that enabling technology up through data flows up to business processes i think that's what's driving that organizations with resiliency are more likely to keep up with the business and manage top risk same thing we've already thought about some of the impacts and whether it's a new version of ransomware or it's a worm or it's um something funky in dns if it takes the systems down and we've got to manage that downtime it's going to have a comparable issue so i think that's why we're better able to manage risks unplanned work and wasted effort is

pretty much exactly what we're trying to do same thing with losses more likely to be running cost effectively this one is intriguing to me i was working with a the the north america's headquarters of an international company they're doing a by bia and in doing that they found a large number of redundant applications significant number i think the the amount of money they're able to save by consolidation was in the millions so obviously that money went to the security department right that's the way it should work no did not i think they got like a gold star on the review but you know how it goes but you're more likely to be able to run

cost effectively if you know where those redundancies are and you can work with a team to maintain mansion so first get continuity and recovery right and then use the program strategically then use the program to achieve these types of outcomes we know statistically from a data perspective that these are the outcomes that are possible and are very likely so get the program running and then make that happen all right final thoughts continuing in recovery recovery we do bottom up don't get lost lost in the weeds continuity go top down make sure we're tying those threads and that things are coming together common mistakes get the right person with the right strategies and do the right drills

and then strategically i just mentioned all the different things that programs can do so elevators um when you leave here you go up to your room or go back to where you come from the next time you're an hour check out where the elevator is what elevator manufacturer it is and might be otis i've got a bad habit of doing this i took my wife from mother's day to the san diego zoo and had a private tour and the tour guide was telling me all about the elevators and she was walking behind me just shaking her head i feel really bad for her but i did find some there's some really good stuff in the elevators in the zoo in case you

want to know why otis otis is interesting because 100 years ago maybe 150 years ago otis elijah otis the namesake was creating elevators he'd go up on a platform four to six stories high they would cut the rope he would fall the elevator would miraculously stop the crowd would cheer again before netflix there was a lot of entertainment this is what they had everyone loved it and what he had invented was not the elevator otis did not invent the oliver he invented the elevator break he invented the resiliency that gave people the confidence to use the elevator and that enabled the last major migration when people moved from country to city and people moved from the ground

floor to the top floor because they could trust those elevators the last major migration about 100 years ago of course we're in the middle of a new migration as not one from that's physical but one that's digital one where we transact most of our work online we transact most of our shopping online our food online everything online they asked people in the valley recently hey where are you going to build your next startup and before the pandemic goes san francisco after the pandemic online we're in the middle of the next big migration what that means is resiliency the resiliency we build today the elevator breaks that we developed today all of us in security all of us looking at cyber security and

infosec that resiliency is effectively what's going to make tomorrow's wonders possible with that thank you very much i believe i'm at time i don't know are we taking questions or not i can't see anything sure no it's it's trademarked you have to pay me two bitcoin yeah of course of course absolutely 100. all right well i hope you guys have a great time oh i really can't see so you gotta yell out by cdu yes sir um

yeah cloud and on-prem so um i've seen ground to cloud where if the on-premise is impacted there is a process to restore key services in cloud containers so that is one strategy the traditional hot site or multiple different sites still works for on-prem so if we can move that over and we still have like direct pairing with the cloud that will work so really what effectively becomes is the traditional strategies for on-prem ensuring that they also move with the cloud where we get into trouble is is the cloud down we could probably handle that it's on prem down we probably handle that is the cloud and on-prem prem down that's where we end up in scenarios where i've had to move from

east to west and now i don't have direct peering back to my dr data center so if you grid that out and look at the different fault paths um you can follow that path and make sure you got the right recovery in

all right with that i will get off stage thank you so much [Applause] thank you

And the Clouds Break: Continuity in the 21st Century

Related talks