PW - Zero downtime credential rotation

Name: PW - Zero downtime credential rotation
Uploaded: 2024-09-04
Duration: 41 min 59 s
Description: PasswordsCon, Tue, Aug 6, 17:00 - Tue, Aug 6, 17:45 CDT Credentials are one of the most vulnerable components of any software system, and yet, they're notoriously difficult to change. More specifically, developers are often loath to change credentials for two reasons: they either don't know how to

BSides Las Vegas41:5975 viewsPublished 2024-09Watch on YouTube ↗

About this talk

PasswordsCon, Tue, Aug 6, 17:00 - Tue, Aug 6, 17:45 CDT Credentials are one of the most vulnerable components of any software system, and yet, they're notoriously difficult to change. More specifically, developers are often loath to change credentials for two reasons: they either don't know how to do it safely, or they know that to do it safely, the entire system needs to be rebooted, which causes expensive downtime. Fortunately, things need not be this way! By applying a few basic strategies, any complex codebase can be designed to handle credential rotation with no redeployments and practically zero downtime. Additionally, even just going through the exercise can teach valuable lessons about system failure points and design weaknesses, which can better inform incident response. People Kenton McDonough

Show transcript [en]

my name is Kenton I go by Kent I work for visat you probably saw us in the news a couple years ago we got hit real hard in Ukraine um there was a talk at Defcon last year from our uh siso but I'm I'm talking about something else unrelated today this is a zero downtime credential rotation designs and lessons learn so I'm Kent I graduated Virginia Tech in 2021 so I'm fairly new to the industry go easy on me um I I had a paper about container resource uh fuzzing that I thought was pretty cool but I I'm a security automation engineer at viat and I like to say I get paid to break things before somebody else does

so one of my first big projects was on credential rotation and credential management for one of our our newer system so let me just run you through a few scenarios that you might want to think about um you recently fired a disgruntled employee who had access to your secrets manager and you suspect they may have copied some things before they left maybe a developer laptop with a bunch of plain text credentials on it was infected with malware and you think the dis may have gotten owned your sock has detected one of your service accounts being used from an unusual IP in a country you don't operate in impossible travel you know Google was just talking about that or maybe you

find out that a third party with access to your infrastructure was breached you don't know what credentials they had access to because why would you you trusted them right so in all of these scenarios uh once you say oh crap right your next move should be hey let's rotate some credentials right so maybe we go ask our devops teams hey I need to change all the credentials for your backend system let's see what they say you might get a response like I can't change that without a system reboot Bo so I'll schedule that for the next maintenance window downtime possible that's why this team is doing it in a maintenance window maybe you'll hear uh

I can try to change that but we might break everything is the business okay with that those magic words the business right that means downtime is likely this this app isn't going to handle that and then a third response you might get is why do we need to do that you're going to take my slas for this month leave me alone downtime is guaranteed this app is held together by twigs and chewing gum right so here's the unifying sentiment behind all of this cred changes are complex doing complex things introduces risk of downtime which means loss of money and the best way to reduce risk is to not do things I guess so we should

never rotate anything ever once we created it set no there's an obvious answer in here is we should be practicing right and why don't we just practice so here's here's my big idea uh credentials are going to leak it's a question of when not if didn't we just have the biggest password dump like in history recently it's getting bigger right since credentials will leak you should be able to change them quickly and painlessly and since you need to be able to change them at any time you may as well practice frequently how frequently is up to you monthly weekly daily hourly whatever you want but the point is you should be able to do it and

the act of practicing will teach you a lot about your system and show you how to improve resiliency it will teach you things that you didn't even know could go wrong cuz you're destroying some fundamental assumptions anyone who who's familiar with the chaos engineering principle this Falls firmly into that bucket right best thing you can do is hire a monkey to just poke things and see what falls apart okay so I'm going to run through some terminology now just about credential management again I'm not talking about humans here this is all like backend systems service credentials service users certificates right the stuff that holds your system together you probably have a secret manager and if you don't you should a secret manager

is a place to store Secrets here's four very common ones hash Corp vault is pretty good there's a company called thycotic that's now known as delin that has a product called squirrel you might have used that AWS has one built into AWS if you have big presence there even GitHub has one now for you know your CI pipelines right these are everywhere they offer encrypted storage they offer you some logs to tell you who's been accessing your secrets what where and hopefully some controls to prevent them from getting out where they're not supposed to be granular access what are you keeping in your secrets manager there's two main types of credentials a symmetric credential is a credential

that must be known by two parties to validate typically that's a client talking to a server this is an example of a password an API key in some cases symmetric key material although disclaimer I'm not going to talk about that because symmetric key material is kind of its own world of you know management and complexity and how you transfer that across the network that that's kind of a special case but you might be keeping that in your secrets manager with symmetric credentials it's just like a string right there's no typically no lifetime built in so a client if they have one doesn't know how long it lasts right as far as they're concerned it could last forever it could

change 5 minutes from now uh the other type of credential is what we call a derived credential which are typically shortlived shortlived in quotes I mean you could make something last a year if you want right um but the these are derived from a symmetric credential so this is like RFC 7519 Json web tokens Olo barer tokens x509 certificates right um clients are aware of the credential lifetime typically you don't store these in a secret manager but you're going to store a symmetric thing in your secret manager that is used to derive one of these right you could pass these around a secret manager a bad idea because they have times right they expire the important thing about

these is that a client that has one of these knows how long it lasts if it cares right you can crack open a JWT and check the expiration Time same with an all token or read your certificate right it's it's in plain text uh that is not the right slide I skipped the end so we're going to go back up to where we're supposed to be uh here all right uh the last thing that you need if you're managing credentials is you need a secret Rotator of some kind uh this is whatever you use to rotate your secrets most likely it's a collection of scripts or API libraries or something right that changes the credentials and puts the new one in the

secret manager when it's done some secret managers have apis that let you do this like some of the Vault backends will let you call an API and rotate the credential for you and push the result into L app or something right um managed cloud services sometimes just do this for you transparently and it all just works like a AWS M certificates that you can put on a load balancer they rotate them for you and you never notice right um the design of a rotator varies depending on what kind of Secrets you have and isn't covered here I think there's a talk tomorrow where we might talk about some stuff like this but uh the important thing is that you have one

and it works right okay so now that you have all of this stuff here are three algorithms for zero downtime credential rotation if you take nothing else away these are very simple things that you can put into your apps to make this stuff work uh before we actually talk about the algorithms I want to talk about kind of the big picture how these work so you either have for credentials a push model or a pull model right so in a push model when you deploy your app your deploy pipeline whatever it is is going to read secrets from your secret manager and hold them in memory right it's these secrets are going to get injected into

your environment somehow they either go into environment variables configuration files or some you know uh ephemeral storage in kubernetes or something technically that's memory right um they go somewhere where your app can read them right and they're pretty much static right so they're injected into the environment and they don't change from there on after the app reads them once on Startup and if you want to change them you got to reboot the whole thing contrast this with a poll model where the app environment is provisioned with a mechanism for interacting with the secret manager directly so your your app either has a library that it can use to talk to the secret manager on Startup

or maybe it's like a proxy server in your kubernetes deployment that has a mechanism for reaching back to your Vault or something um the point here is that your your app rereads the secrets as necess NE during operation instead of giving it the credentials to start you tell it where it can read the credentials and it knows how to read them whenever it needs to so again big picture in a push model if you want to change creds you got to redeploy the thing in a pull model your app can recover from cred changes automatically although there are caveats there's a lot of corner cases you can get into okay so here's strategy number one

most people are probably familiar with some form of this this is what we call a blue green deployment so over on the left is the secret manager where we have two credentials which I'm calling blue one and green one the these are for all purposes identical they're separate sets of credentials but they're valid for the same thing they act the same way uh we're going to rotate them in different times but that's the only thing that's different on the right I have some replicas of an application it's not really important what these are but we have this blue one credential is used in three separate places for this app that's running behind a load balancer

right so when I decide that I want to rotate blue which is in use right I start by pushing out green and I can do this bit by bit right so since they're both valid I push out green to the first replica and make sure nothing breaks that means that green works right and I haven't made some fatal assumption I'm going to break my whole app if it didn't work I've still got two replicas right I take my one replica offline figure out what I did wrong and and nothing bad happens right so we've already kind of built in a roll back scenario uh I push out green again I don't have to do this

immediately I could wait but why wait right I'm going to push it out I push it out a third time blue one is now gone right nobody is using that because all three of my replicas are updated so now uh I can change one to Blue two right nobody's using it Pros it's easy to do it's all segregated you can do it in pieces right you have built-in rollback mechanism the one thing that's kind of a con here is that it's hard to Define when you're done if you don't know how many replicas you have or let's say you know one of my replicas was unreachable because the network was down or I was having a problem somewhere I can't

rotate blue if I only got two out of three unless I'm going to kill that replica and everything's okay right it puts you into a funny state where can I finish now can I not finish now right um that credential is being abused like right now I need to get rid of it but I'm going to take down part of my my system right so it can also be slow right if if you can't do them all at once you might be there for a while so that's strategy number one strategy number two is uh fail and retry so I have a slightly different picture here on the left I have a secret manager where I still have

blue one I've got my three Apple replicas that are all using blue one and I have some backend service that also has blue one so this is like an ldap database or something right doesn't really matter um so in this scenario I change blue one to Blue two and I do it in both places right my back end is updated and my secret manager is updated those are synchronized now the thing that's out of sync right now is the apps so the apps one of them is going to try to use blue one and fail right I've gone to Blue two blue one is no longer valid so what's the app going to do okay well

it failed it's going to go back to the secret manager retrieve blueo now try to use blue two find out it works the pro here is that uh I didn't need to do anything to my apps right they all knew how to read the secrets they know how to interpret a failure case they know that the credential rolled over right they just have to go get it and try again uh they can all also do this asynchronously if they don't talk to this back end very often I don't need to force synchronize all three replicas they'll recover when they get around to it right and they have to recover in order to keep doing service um now a stute viewer May note

that I said zero downtime and technically there's a failed network connection in here do we really count that as downtime well I mean the network could go down for unrelated reasons or it could be buffering or you could have a a packet drop or something right so I don't really consider this downtime it's a adverse Network condition almost and that I needed to refresh my credentials my my argument underlying this is that this should be a normal operation that you can recover from just the same as if your network blipped or you had a route switch right or you lost your connectivity for a moment you just come back up right so in that case it's not

really downtime um uh another caveat here is that I mean your application code has to be more complex because you have to tell how to read from the secret manager it can't just read a file and be done and if you're using commercial off the-shelf code talk about that later but support varies and it's not always pretty so anyway this is strategy two strategy three uh probably should have mentioned this so strategy one and two are really only for symmetric credentials that's mostly where you would use those the strategy 3 is hands down the only thing you should be ever doing for a derived credential where you know how long it lasts so same setup I've got my secret

manager with blue one I've got my app replicas with blue blue one my back end this time doesn't have blue one in it because we're using a derived credential so each one of my apps is going to derive blue one prime I know the naming convention sucks but you get the point across so when I present blue one prime to the back end it's accepted and I can do this for as long as I want blue one prime has an expiration time in it because it's derived right so we're going to wait 75% of the lifetime that's probably about fair so for 12 hours that'd be 9 hours for an hour that'd be 45 minutes

whatever so we set a timer and we wake up right uh with 25% left at this point I've already switched over to Blue two right and i' I've done this asynchronously so my secret manager is updated uh the app when it's time to refresh is going to go to the secret manager first before it does anything else and go rep this cred so now it's going to pick up blue two I can now derive blue two prime while I still have blue one prime and blue one prime is still valid I've still got 25% of its life left so I can try well blue one still works right I haven't done anything to that I can now try blue 2

Prime and find out that also works great I just reset my lifetime with a totally new credential now I can throw away blue one done right I've replaced it and I can throw away the derived credential there's really no reason to not do this right I mean there should never be a problem if you had some problem where your secret manager had a problem and you couldn't generate bluetoo Prime you've got whatever 25% of your lifetime is to figure it out and if you don't figure it out in that time you have other problems to worry about because something's really out of out of whack so all right there's three pretty basic algorithms at least we think so

here's the the rough stuff uh we took at ViaSat one of our next Generation ground systems and we had a requirement that we needed to do this frequently we wanted to rotate credentials very frequently what basically whenever we wanted maximum of 90 days you know whatever you want to set the the time limit on and implementing all of this stuff into a variety of apps with various support for it taught us a lot so here's the fun part where you can see all the ways that we shot ourselves in the foot all right problem number one I I have these structured as problem and then the lesson that we learned problem number one is what we called ubiquitous

credentials how many of your service credentials are shared between totally different things for example consider you have like a read Service account client one uses it to read from some database and because it was convenient at the time when we wrote it and it was a Friday and I didn't feel like making another service account we also stuck it on client 2 that uses it to query information from a totally separate API alternatively uh let's say I slapped an identity certificate on a box and I have two servers running there that are doing totally different things it was just convenient to put them on the same box for reasons right they're using the same identity C if both clients or both apps

or process whatever use the same credential they become inexorably linked right they are stuck together forever and you might not think about that when you did it but it's going to Pro probably cause a problem later when you rotate the credential they rotate and if they have downtime because something goes terribly wrong that's going to happen together so your blast radius is increased and if one client gets compromised both both systems are at risk right if I lose service account a from client a and it's also used by client B my attacker might figure that out and now they've got two for the price of one right this flies in the face of a lot of the um you know least

privileged principle that we espouse a lot but it's convenient to do it sometimes right I think we've all done stuff like this so lesson number one granularity is important uh service credentials should always be split along logical boundaries whatever that is for you so applications probably a good idea deployment environment you should have separate for Dev test pre-prod prod however many environments you have your backend system if they're different you should have separate credentials your rback level if somebody only needs read and all you have right now is read right or crud make a new one right it's it's not worth sharing um anytime you need to uh make a new credential for a system

you should ask what is the current blast radius of the credential I already have and how would sharing it increase risk if it gets compromised right I again this is very basic it's the principle of lease privilege but as developers sometimes we get lazy we cut Corners right and you'll find this out once you start rotating credentials and stuff breaks in totally random places that you had no idea about and you know you get a call from somebody why did my app break in pre-prod when I rotated a prod credential for a totally different system right here's a a picture that I like of a gate that has six locks on it uh but you only need one key to open it

yeah I lifted this off Reddit this is like a ranch thing that people do when they have lots of people with different keys with different access to stuff but they all need access to one place all you need is one key to open this but if somebody loses their key you only have to replace one lock you don't have to replace all five or six rather okay problem number two this one's a little bit more fun do you know what your client software does if it doesn't have valid credentials well I can tell you what ours did uh scenario one retry with no DeLay So the client got stuck in a CPU Network bound retry Loop bashing against

the API it was trying to talk to with the bad credential it didn't even check the case that the credential might be wrong wasn't even considered this overwhelmed the target API or actually we we overwhelmed the secret manager at one point in a self-inflicted Doss attack right we we dossed ourselves uh problem number two so we we fixed the Dos problem maybe but um so we implemented some back off you can end up with this thing called a Thundering Herd so if you have 30 replicas and all of them need to do a refresh cycle at once right um maybe they all wait for 1 second and then they try again and then they wait for 2 seconds and they try

again and then 3 seconds right you're just creating a little wave effect right where you just pound instead of a a fully uh you know Max efficiency Doss attack you're just dossing very slowly over time right this is a well documented Thundering Herd problem so uh yeah Jitter is important when you do time calculations you always need to add a little Randomness to fudge the time so that you don't do this um here's another uh this one's kind of fun this one's a little bit more Insidious so this was a what we called leaky threads so we had some designs that um especially in like a go or or a language that uses like a

user space threading model you just spin off a thread for your connection right and uh assume that it will raise an error of some kind when you have a credential problem right if you don't actually terminate that thread once it raises the error you're just going to leak threads over time so whenever we would do a credential roll over we'd have like 20 threads running on connections right we would leave all 20 of those still running and then bring in another 20 with the new credential right and if we rolled it again now we would go to 60 or whatever that was right so over time these threads would pile up and they just filled up the network

right just boxes got you know uh we were wondering where is all this traffic coming from I'm only running three apps it's because one app had like 500 threads in it right that weren't doing anything but just spamming credential errors over and over again so if this sounds like you might have this problem and you've never tried it you might want to find out because uh this is bad right um so lesson number two is that wa's web application firewalls are always worth it you should never assume that your internal clients will behave themselves um if the service that you're talking to is critical it needs a laugh cuz if an internal client on a fiberlink can knock

you over like if you can generate a gigabyte worth of traffic like or gigabit per second like that that's insane you're created a liability for yourself because if I'm an attacker and I get inside your network I don't even need to poke anything right if I can just generate an internal Doss attack you're dead uh might be showing my age but this is a a clip from SpongeBob that I like where uh if I were to die right now in the some sort of fire explosion due to the carelessness of a friend well that would just be okay don't don't say that right build a wff okay here's here's another fun one problem three lockout events we've

talked a lot about like human credentials today and typically if you have a human credential and you just Spam passwords against an API you're going to get locked out right the parameters May differ about like how long you get locked out or for what time right or you know you could do a refresh to change it but um if you have service accounts that look a lot like human users do you know if their password policy is different than your human users CU for a while ours wasn't um if that if that password policy is the same all it takes is one client with a bad password to lock the account out and then nobody can use it

and if it's shared in multiple plac among like replicas of an app you've just taken all of your replicas offline and all you needed was one bad client you in fact you don't even need the password right you just need the username if your API supports lockout by just having a username then anybody can do that if they just have it the username becomes secret right so that that's kind of insane um a human can do this right if I can grab one of your usernames oh if you have a public API that's you know exposed to the public internet the service accounts can off to I don't even need to be inside your network if I can learn the name of one

of your service accounts I can just Spam your API and lock it out whatever I feel like it right ask us how we found that out um I mean this is an easy problem to fix right you just change your password policy for service accounts but the same general problem app applies to wafts and client IPS that are behind a natat so let's say that you have a bunch of apps running on a node and one of them is just spamming an API because it's broken or it's an old deployment or something that you forgot about in a test environment if that creates enough traffic to trigger a w IP block to a service that you're using for legitimate

clients that whole node is dead right because the Waf just blocked the node IP this is particularly Insidious on kubernetes when you have apps that move around between nodes so if you have one that's just moving around from node to node and just spamming and locking out the whole node ask us how we found that one okay uh so the lesson to learn here is that One Bad Apple can spoil your bushel literally uh service account lockout policies can easily be replaced by monitoring and high complexity requirements um yeah if you treat your service accounts like you treat humans you're probably doing something wrong you probably shouldn't be using service accounts in general but if you're stuck

with them you you're doing something wrong the the knat blocking problem is a lot harder to fix because I mean in a lot of places we don't have anything better than a client IP so kind of ask yourself how can you figure out like what's doing it right um if you get an alert from your W that an IP internal to your network has gotten blocked how quickly can you figure out what it is on that node that's causing that problem because it's not as easy as just going to PS right and looking at a process listing right we've tried that if you're on kubernetes you know you got to find a pod you got to reverse your way through

aat to the Pod and then figure out the Pod IP and then start looking at pod logs it like it takes forever so get good at doing this if this sounds like it might be relevant to you all right problem number four we talked about secret managers do you know what happens if your secret manager has bad data in it either because your rotation pipeline broke down or a user decided to put in haha funny I'm quitting right into place of your service password or something right um or I mean here's a a worse case scenario let's say that your rotator is running it's updated a password it's waiting to write it into the secret

manager and the network goes down at the critical moment right where we have the new password but we can't give it to anybody because we literally can't talk to anybody um I mean this is an Avenue for attack right A lot of the time we think about secret managers being compromised or like read out or dumped but I don't need to dump it if I can just like wipe it or fill it with garbage right eventually I'll crash your system either way um how reliable is your rotator program right um if you're waiting for a reverse alert from like a customer or like a developer team to say hey something isn't working then you're going to have downtime right so uh the

lesson to be learned here is that for everything credential related whenever you're going to rotate anything you need to have layers of logs and alerts right so reverse alerts here are like the final layer of protection for credential related issue if you waited that long you've already failed right um You should be getting forward alerts or events from your rotator if it knows that it failed it should let you know maybe it should even X matters you because that could be really bad um your application if your application is not logging hey I tried to use this credential and it didn't work help before it retries right that's a problem this should never happen silently um you

also probably want some big picture alerts to kind of monitor hey what is this credential doing in my system right we often think about this for people to see where they're authenticating from and what they're doing but you should be doing the same thing for your backend system too um after you rotate a password I mean you're probably going to see failures right of some kind as the apps kind of figure out hey this thing rotated right I need to refresh the new one do you know how many you're expecting to see right how long do you see them right I mean what's your stagger window of if it's more than 12 hours afterward everything should have

recovered so anything I see beyond that is probably weird right or or something that's not recovering uh how can you detect if a system is not recovering right can can you detect that without even looking at the app logs just from seeing hey this credential just keeps failing from this IP long after it should have can I raise the alarm to the devops team to go take a look or something right because my you know the security log indicates there's a problem here uh this is a picture of like modern tank armor right you know you have your titanium alloy and then a bunch of ceramic and then more titanium alloy that like you should think about your

logs for credential related stuff like this if you only have one layer you're the first bullet that hits you is like you're dead Okay uh problem five is anybody these days is probably using a lot of commercial off the-shelf software and a lot of these for various reasons don't include look hooks for what we'd call touchless credential refresh right a lot of cot software is designed around you deploy it and the cred never changes some of them especially more modern stuff I've seen kind of thinks especially for like certificates is like hey we put a watcher on the disc to see if your certificate has changed we'll reread it in we'll do that every 5

minutes and so just drop it off in the disc force and we'll be fine right um something like Apache where like you have to do a soft reload to get it to reload the CT Apache is fine it's a wonderful product but it didn't in they didn't think to include this right you have to reboot it um these kind of assumptions are built into a lot of C stuff and it can make it challenging to use when you're really trying to be robust about credential automation because you have to start bolting on additional steps it's like okay after I rotate this credential and I push it into the secret manager I have to log onto this node and do this thing or I

have to write an API to let me remotely reboot all of my proxies right because don't understand that the credential has been changed or something and that in general I think that's just a a relic of kind of assumptions we made long time ago that credentials don't change very often right or there's no need to keep up with this kind of thing but um dealing with this in CS is instant technical debt right if you have to be writing this like bolt-on code to hey I've updated my password please reload it right like it's it's just silly right it best it's flimsy at worst it's technical debt um even some stuff that does uh let you change password they

don't let you tune anything that would be good to tune like how long do you want to back off for what back off strategy do you want to use what Jitter calculations do you want to use right to prevent these Thundering Herd problems it's just not exposed so lesson five this is a plea that we made to all of our developers and a plea that I'm kind of making to all of you is if you write software for security purposes that interacts with credentials let this stuff be tunable right assume I'm going to want to change every credential at runtime unless there's some critical reason why I can't and I must reboot if I have the option to change for like

networking or whatever just let me do that right make it easy give me a call back mechanism in your client library that I can use to refresh add a file watch functionality you can reot from the disc give me options to control back off Jitter and reai do not hide credential related logs in debug they belong in warning error or critical the number of times that we've had to turn on a debug log to see that a credential was failing is ridiculous because debug logs I can't run in production they produce like terabytes of data right these don't belong there and never swallow errors there's some applications will like fail like hundreds of times before they even log an error or let you

know right like you must like let me know right if if this is me trying to use your client application you've done something wrong right you're you're failing me okay um so I have some final thoughts here uh I'm a little bit ahead of schedule so we'll have time for some Q&A I think um credentials are only useful if you can change them rotation algorithms are pretty basic to implement at least I hope I made them seem basic but they're really hard to get right because there's a lot of corner cases and conditions that you need to worry about you suddenly you need to worry about what if the network is down how long am I retrying right under what

conditions do I retry right uh practice makes perfect and it requires a collaboration between developers security and operations hey if you look at the highlighted letters there's a buzzword in there Dev SEC Ops right doing this can turn a devops team into a Dev SEC Ops Team right you make them care about their credentials and if they don't care Break Stuff right break their test deployments and make them care um and and finally right once you gain confidence and you can start doing these rotations with zero downtime and everything just works don't stop right A lot of the time we look at requirements that say you know oh I need to rotate these 90 days or once a year why if I

can do it once a week why not do it once a week right or you know if if your developers only want to let you do it in a maintenance window fight that right I want to be able to do it during business hours so you don't have to call me if something goes wrong right I'm already at my desk all right time for questions [Applause]

so so hey thanks for your time and then um you know back I you know I used to work in a database monitoring uh role a lot a long time ago and one of the questions that kept coming uh up again and again is hey hibm guardium Flags this event as a credential failure event but I'm looking at my application my application doesn't think uh about my my application doesn't flag this event as a credential fail or a password uh you know error so how do you reconcile uh you know your common definition of what is a bad credential event CU I know it's up to the um you know individual monitoring tools to then flag that

behavior as such but then how do you come up with a common standard of classifying a bad credential event so this is actually a really good question when because we built detections for this stuff right to try to rec uh to understand when a client is recovering and when it's not the best I can tell you is it took a lot of practice because the numbers change depending on how many replicas of an app happen to be deployed when we do the rotation right if we're doing like a weekly rotation and the developer team from last week doubled the amount of apps that they're running maybe that's 4X the number of credentials right that that are floating around like two per

thread or something like that um I I don't I wish had a magic answer that was like here's a heuristic you can use but literally we did it a bunch saw what the numbers looked like saw what was normal right and then figured out okay here's the Line in the Sand let's build a detection around that so we said anything that's significantly higher than this basically we have we have two types of alerts we have a threshold alert that if we see way too many failures in a very short time period that means that a client is broken and is in DOS mode and we need to go like get that now and then we've got a

separate failure that we check for what we call persistent mix configuration which is like something didn't update but it's not causing enough trouble to like disrupt the network so we have a different alert that looks back over like the last two days and was like did anything fail at least once an hour for like six hour buckets right like we we built a thing to kind of do queries like that and it it turns up like broken deployments and like pre-prod and stuff as different from like a a failure event and if the Threshold alarm triggers and then triggers again uh like we know that as a Quee to like go look into it I I hope that answered your

question yeah great talk man don't even know where to start really enjoyed it uh curious to see if you actually had to deal with incentivization uh to get I guess the dev SE Ops reality to really happen in terms of did money if we cut the BS have to become part of the conversation for the devop side of the house to be properly incentivized to take the security requests in tow I guess so I'm I'm actually pretty pleased like the culture of isad has been I think pretty great about this but we did have to quote unquote grease some Palms in the form of like I wrote a lot of this stuff myself right like we wrote a

ticket had a hard time getting it prioritized especially for like inner Source stuff for like libraries that people work on sparingly it was like we just had to step in and write it and then just once we had it written the teams were pretty open to implementing it right and and helping us tune the algorithms and help us find out where we did stuff wrong and it kind of rolled from there but we've we've gotten a lot better about it but we did need a burst of developers to kind of get it going hi here uh great talk by the way this way yes uh sorry so my question was maybe just partially answered by your

earlier comment around detections but how do we prevent an internal crowd strike event with the a bad change going out a bad credentials being done as part of the rotation how how do you see that happening practice practice makes perfect so like I mentioned these Doss events and stuff right we we also had scenarios where we just desync the secret manager like I can think of one where we had a network condition where um we tried to rotate the password and it got lost coming back so we tried to rotate it again like we retried and then the first one came back so we wrote the first one in but the request had already gone out to change it again so the

second one just vanished Into The Ether right um so there's almost no way to stop that right unless you have a roll back mechanism like you know hash Corp Vault or something let you like roll back a version so once you figure out what's going on you you should have a button to go backwards if you've already updated your L app store you're stuck you just need to redeploy so you kind of fall back to basic devops principles of I need to be able to redeploy really fast I should be able to do reboots really fast if I'm not going to do a full redeploy your detection should catch this stuff and the app teams you know if

you're not warning them they should be coming to you right we shouldn't be waiting for a customer to tell us something is broken the app team in the middle the SE devops team if you will should also catch that and come let us know hey critical problem what did you do do recently we actually built uh dashboards cuz to a certain extent a lot of the automation like our Rotator like my team runs that on behalf of other teams and we don't always watch it necessarily because it just works but we built dashboards for all of the teams that just post the data here's all the credential stuff that we've done for you recently and we kind of train them if

they see a problem they suspect is credential related first thing they go to is that dashboard and if they see that something happened recently they call us immediately and say hey help right so how often how often do you uh verify that the app asking for something from the secrets manager knows the old secret words I want a new secret I prove to you that I'm a legitimate app by giving you the old secret do you guys find that necessary no so in general the connection to the secret manager is a totally separate credential from the credential that is being consumed by the app right so we do monitor that like aside from the secrets that are in uh

like in our case we use a lot of Vault right so aside from the secrets that we have in Vault being consumed we also watch Vault for weird events of apps that are not authenticating to Vault properly or constantly failing to log into Vault or trying to read paths that don't exist right or or reading a path over and over again when they probably don't have a reason to we have stuffff to flag all of that and you know again poke the devops team and say hey what gives like are you doing this on purpose do we have a broken deployment right you know thankfully almost all of this has been constrained to like non-prod

environments for us by the time stuff got to prod it's actually been pretty well behaved which been very good fingers crossed uh so hi yes um I'm a devops engineer uh regularly hanging out on the security side of town um so most of the things that I do with my team is kind of devops as a service uh so how would I win over the hearts and minds of the developers to be able to for lack of a better phrasing uh take care of their sh um and just do what they need to do with all of your uh that last thing where you've got that plea the call out for the developers how do you win over those hearts and

[Music] Minds so I'd say the the kind of the formula that's worked for us is um there's this idea of like the security dragon that like guards the gate and says I won't let you to prod unless you do X Y and Z you have to use that very occasionally when there's like a very blatant like crypto violation or like critical vulnerability but for all the kind of squishy stuff in between right you got to meet with the developers on their level and say how can I help do you need me to write the code for you do you need me to sketch the algorithm like uh can I help you write the pipeline to help test it can I

supervise the test right I'll stay on call with you right like when we try this for the first time right work with the constraints that they've got but also I think push them to be better it's it's almost more of like a quest for excellence than it is just like meeting the bar the the bar would be you maintenance window I can change my creds and I'm done right the Excellence is let's build confidence let's do this at any time and understand how to respond right and that last bit that that's what really scares people is hey hey my credential changed out from under me right I don't know what to do who do I

call right how do I recover right and and we've written like pages and pages of docs about either how to get a hold of us how to handle it yourself how to interpret the alerts that we're giving you right and over time things things have gotten better because the not the tribal knowledge but the the organizational kind of uh key right builds up about how to handle this stuff and that we should be expecting it right and then the Excellence kind of it's a Snowball Effect I feel in one of your slides is you said Implement a file watch MH uh file watch doesn't really work well in a containerized environment uh especially like for some implementations like

spring um so have you you usually have to pull for like the file updates and see if the file has changed have you gone around it or what's your experience with this no polling um in the containerized environment specifically I've seen a couple of approaches to this so like we we have a um uh like a sidecar model I I could do like probably different stuff on CS if you want to talk about that but a very basic way to do this if you're not interested in like a full-blown framework like CT manager is Implement a side car that uses a um uh like a shared memory like Mount to pass files between and then just have both sides kind of

pull it every once in a while um again like how you configure those parameters are up to you but I I I understand where you're coming from about certain Frameworks make this harder right and that's where you have to kind of bolt on like custom stuff and it's just you just got to try to keep the technical debt down as low as you can well maybe you know if you have a refactor coming down the line right that makes it easier you know just make sure that that's at least on the list of requirements when you're thinking about you know your next redesign or refactor or what so uh this is not a question but a

followup to your question um if you store uh certificate as a secret in communities uh uh uh the you can uh trigger on that event to reload the application um that's least at least what we we did yeah certificate Management in kubernetes is a whole ball of wax just yeah make sure you have your uh authenticator configured correctly for your domains um a last question for Kenton before we uh go for break no more questions one comment okay well one comment if that's okay you can correct me if I'm wrong um everyone has a hard time getting the business involved in security if you can get your board on uh into the mode of security

then you can ask them to make sure everyone below the CEO has it on their performance review that they do their work securely and then they will come to the security department and ask for help that's my suggestion Microsoft just did that right like seite yeah seite pay is like now tied to security performance CU they have too many critical cyber attacks right like everyone should be doing that oh yeah I I have to be honest as well during your talk here you know several times it's a thing about time and you know short Liv credentials and stuff I just saw some research coming out I think it was from academic research from I'm not sure maybe it was

Germany MH but they had been looking into GPS and systems using GPS uh because GPS doesn't only send you your position it only also sends time and they have found out that well in most systems using GPS and GPS time uh there is a built-in protection for you cannot turn time back but there's no protection against turning time forward so they had figured out there are quite a few uh airplanes and the systems on them they are using GPS both for position and time and they also rely on certificates I see so they said that if you can just say outside mccaran airport and do GPS boofing you don't have to care about location just spoof time and

put time like one year into the future there won't be a single plane leaving that airport for quite some time because you have to update firmware by connecting uh all the systems physically so uh have a nice flight back home and thank you C

PW - Zero downtime credential rotation

Related talks