PKI - Avoiding common pitfalls - Rick Davis

BSides Peru36:3165 viewsPublished 2022-09Watch on YouTube ↗

Mentioned in this talk

Platforms

Show transcript [en]

so a little bit about who i am um as you know my name is rick davis i'm a senior customer engineer with microsoft and what that means is i don't work internally i'm out working with our customers generally our largest enterprise customers everything from the very reactive such as incident response or threat hunting to the very proactive where i teach workshops i do assessments help customers deploy products design and architecture all of these kind of things over my career i've spent a tremendous amount of time with pki i probably spend about 30 percent of my time on it now it's been as high as 75 and 80 percent uh i have a hand in developing our

workshops our assessments most of our proactive offerings around pki um if you're on twitter you can find me at the ratelin there with that said even though i work for microsoft and i i do like to mention it because cool company they're not representing here they didn't send me here all the thoughts opinions bad jokes sarcasm that's all mine so just like to call that out all right so why are we talking about pki over the last year in particular there's been a tremendous amount of press and new things happening um in relatively reverse order here golden certificate attacks new tools like certifried the pk in int patches vulnerabilities if you will and then tremendous work certified pre-owned

that was put out amazing paper if you haven't seen that so all of these things are really great really good research really had a lot of organizations do a fire drill but there was kind of the problem of putting the cart before the horse so in all the time i've been working with pki i probably take a guess that i've seen about 600 or so different pki implementations in about 20 years and with all of that i see a lot of the same problems over and over and that's the purpose of this talk i want to call out a couple of the most common security and operational issues that i see because all of these

new things happening they're really great and they're really things that you should pay attention to and deal with but if you haven't done the basic things it's not going to help a whole lot you're not going to get the roi in that and there's just better approaches so that's really the goal here i want to call out some of the big things that i see things that are very actionable things that anyone in an infrastructure or certainly a security role can go and look at and hopefully drive some change and improvement so with that said we'll start with just a little bit of kind of terminology around pki how a couple things work just at a high level just so we can level set

everyone's knowledge make sure we're on the same page so if you look at a certificate in the gui on a windows system it'll look like this it's relatively straightforward all of the data that's represented there you can see on the command line and the same would be true in a linux environment or just about any other system so pki is standards-based so you're always going to see really the same fields and relatively the same structure and a couple items that are going to be very important for our discussion today the first is what we consider a subject or an identity so a certificate has to be issued to something after that we have who issued that

certificate generally certificate authority and then finally a lifetime of the certificate so when a certificate gets issued by default it's going to have some time range baked in of when it's valid from and when it's valid until so keep these elements in mind because we're going to talk more about these as we go another big item that we'll talk around is what's called a crl a certificate revocation list so even though a certificate has a lifetime it can be revoked meaning an administrator can mark it as not to be trusted not to be used before it would naturally expire so the idea is let's say i issue a certificate to a server and that certificate is good for a year

but two months later for whatever reason that server needs to be decommissioned and it's no longer used i shouldn't let that certificate live another 10 months i should revoke that certificate just good care and feeding of the environment now a lot of times that doesn't happen but it should the crls have to be somewhere where clients can get to them and that's what we're illustrating here one of the fields in every certificate instructs the client where to go to get that crl when it's checking a certificate so when we think about checking a certificate what happens a couple key items first we do a subject verification so if i go to www.microsoft.com i'm going to get presented a certificate

in the browser and that certificate has to be issued to www.microsoft.com or an acceptable wild card and if it's not i'm going to get a warning now a browser is kind of a special case because i am going to get a warning and like any user i'm just going to click ok and it's going to keep going but most things that use a certificate happen programmatically so any failure stops something from happening but there's always that subject check we always make sure that verification is correct and then we check the time so is today's date and time between that valid from and valid to field so very quick checks that happen right off the bat and

then there's that big revocation check so we'll talk a little bit more about this this is a really big deal this is one of the biggest reasons that pkis go down and outages with enterprises that relate to pkis go down the revocation information is very important regardless of the structure and the size of your pki every certificate authority has to be creating these crls and clients have to go and get them and they have to check any time a certificate is used to see if it's been revoked even if i checked 10 minutes ago it might have been revoked eight minutes ago and there's some differences in how that's configured the timing how often they're issued how clients can cache

them there's a lot of complexity there but one way or another clients have to feel that they've done that check according to the rules of the system and make sure that the certificate hasn't been revoked okay just got brighter all right so i would argue crls then are the most important part of a pki i've been at large organizations and i've seen crl's fail either they become unavailable someone drops a network rule or changes permissions to where they are supposed to live any of those kind of things crl is going down or being unavailable for a couple hours i've seen cost enterprises hundreds of millions of dollars so most organizations if you have an internal pki

it's a critical piece of the infrastructure just like email just like domain services but most organizations don't treat it that way so when there's a problem it's big if the crl itself is expired certificates will fail by design because if i need to trust this certificate i need to know if it's been revoked but if i can't check to see if it's been revoked i'm not going to trust the certificate i would much rather not trust a certificate that may still be valid and trustworthy then possibly accept one that's been revoked because pki is a system of trust in addition to that if you can't find the crls same thing it's the same thing as them not being

there if the crls are misnamed misconfigured in any way if it does not match up to what the client expects when they go to do that check that certificate's going to fail so these crls are incredibly important so that brings us to the biggest problem that i see out there operational issues around these crl so the cdp the cdp is the crl distribution point so if we think back a few slides to that screen clip of a certificate and it had that locations of crls that's what the cdp is so the cdp is the set of locations where clients can go and get these crls and ideally there is a lot to it it's not just one file server somewhere and

hopefully it's not any actual file server so when you look at that list you've got a number of different options these options get figured configured at the ca level so if you have a number of different cas you're going to configure all these per ca some of the things that you might consider is how long the crl is going to live different types of crls and where the crl lives specifically there's a lot of options it could be sent over an http protocol on an iis system or something similar it can be stored in active directory so an ldap location in some cases it can be a file system an ftp all of these other things

the key with this is we don't want points of failure and a lot of times pkis get built by someone following a guide online and clicking next next next next and yeah that works you're going to have a ca and it's going to issue certificates but when things go wrong they're going to go really wrong generally for design and architecture of a pki for an average organization is probably six months of work and it's white boarding it's discussions with different teams you can script out and build a whole pki in five or ten minutes but you need all of that good design work to get this right so that's our problem the problem is there

are points of failure within these locations so more specifically what do we see and what can we do about them so one is out-of-band movement of the crl files so every ca has to make these files and what they should be doing is the ca should be configured to write the files directly where they're supposed to go directly to wherever the clients expect to go and find them so those iis boxes or the ldap locations or whatever that happens to be a lot of times i don't see that happening we see clients write the file just to the local file system and then they use some script or other process to move that file where it needs

to go generally a bad idea and why is that so the cas do a really good job of letting you know when something's going wrong if you look for that but if you're now relying on something out of band the ca doesn't know that if you tell the ca just write it to the local file system because then you're going to move it well it does that it does its job and it thinks everything is happy it doesn't actually know if that file gets there so you're introducing more things that go wrong you could have a script now that could break something with permissions on that script or the location or in really bad cases that script gets

hijacked by an attacker or some other process and it becomes a way to escalate privilege so all of this adds complexity that didn't need to be there because the ca already knows how to do this it knows how to write the file so this is a big problem we see this a lot and this shouldn't happen the ca should be writing the file where it needs to go in addition to that we see this concept of really kind of putting all of your eggs in one basket so even though the ca itself could actually host these crls so you can run iis on a ca but you shouldn't again points of failure so if i have a separate server that hosts

my crls and my ca goes down i do have some problems so obviously if that ca is down i can't issue new certificates i can't re i can't revoke certificates i can't renew any certificates but my crls are somewhere out there safe likely with a lifetime of days or more so any certificate that's still out there works just fine so i haven't really taken the enterprise down but if i put everything on the ca now if the ca goes down i've lost everything so within pki there's a number of different roles and ancillary services that you might use outside of the certificate authority itself and the recommendation is generally to separate all of those roles

they all have a vastly different attack surface different ports and protocols you might need different places at the enterprise different parts of a tier model so they should all be separated for that reason as far as the pki is concerned you separate that out to avoid these points of failure poor location design is probably the most important thing to consider and this is a little bit harder to wrap your head around so when we say location design is the location where these crls live there's a lot of different types and there's a lot of places where redundancy doesn't get noticed we want redundancy within a location type and between a location type when you configure these at the ca you

configure them as a list it's an ordered list the client parses them specifically in the list as they're defined and this order matters so think about this for a second let's say that i'm using two kinds of locations i'm storing my crls on a web server and i'm storing my crls in active directory so an ldap location and let's say i have the ldap location listed first so what happens if i have a client on the domain and tries to get the crl so hopefully everything is great for them because they're on the domain domain controllers talk with everything they converge very quickly so that client is really happy they're able to get to those crl

but what happens if i have a linux device or a mobile device something that can't or that i won't allow to bind to ldap to get that information so what happens in that case is eventually it's going to time out and there are some values built in for that in this list of crl locations by default you get 20 seconds for a timeout and it's separated out by the number of locations in that list so the first location in that list will wait up to 10 seconds for a timeout and then it's half the remaining time of that maximum so with a 20 second maximum by default you get 10 seconds for the first location

five seconds for the next two and a half seconds and so on now most cases this check happens very very quickly and clients cache this information based on the configuration so if everyone was waiting 10 seconds for every crl check the internet would grind to a halt but it's a maximum wait time so if i have something that can't talk to active directory it might be able to figure that out very quickly or potentially it could wait that maximum time out not a good experience but there's a bigger problem with that if my two locations were ldap and a web server if that ldap location is not available for some group of clients in effect i've

designed in a point of failure i've designed in a lack of redundancy because now the only place those clients can go to are my web servers and if they go down or they become unavailable i don't have anything else so having an ldap location and an http location gives me redundancy between those locations and ldap itself provides redundancy within its location but that http location and on its own is a problem so a lot of times organizations will make multiple http clusters and this is a big point in design and architecture usually one of the biggest decisions that you'll face so we can do some things to make it better right off the bat we can make multiple

iis boxes put a behind a vip and a load balancer but we still have that one url that we're hitting so we can use gtm and ltm technologies if we have other data centers or we're leveraging the cloud but more than likely we're still going to want multiple groupings of these systems in effect multiple locations so when i say http it's important to note we're not talking about a website that you're going to go to even though that can be made available it's not the intention we're simply using the http protocol to pass data around because it's very firewall and proxy friendly it's easy to monitor and every system every operating system embedded devices everything knows

how to deal with http so this is one of the big problems with these locations a big decision point is do we even use ldap at all seven and ten years ago we wouldn't even ask the question everyone was using it it made a lot of sense now in the last three to five years i see fewer and fewer new implementations of pki using an ldap location just for that reason because everything can use an http location not everything can use an ldap location now if you have a large pki and you have some cas that you know are only issuing to domain based users and systems then it makes a lot of sense to

use that location and that can be something that you think about as well because all of these settings are specific to a ca one of the reasons that you might have multiple cas are to change these kind of configurations based on the clients that they're serving but it is a tremendously big decision point and then we have that overall lack of redundancy so regardless of the locations that i use regardless of what i design you have to think through are there enough options and this is a tough question because you're never going to feel like you really came up with a good answer there's really no cookie cutter approach with pki it's very dependent on the needs that

you have the kind of certificates you want to issue the design of your architecture in general and a lot of other things go into that so what's right for organization one will certainly not be right for number two or three or whatever i've been doing this for decades and i regularly see new things new designs that just weren't appropriate anywhere else and maybe i won't see it again so there's so much design and architecture work that has to go into doing this right for your organization but regardless of what's there you should be able to take a look and have a little bit of a thought experiment and figure out well if this goes down what else can the

client do if this location goes down what else can the client do do i have more than one location so things that you can think through you don't necessarily need privileges on the pki or the cas to see these kind of settings because they're baked into the certificates now hopefully you have some of this documented not only the settings but the reasoning behind why it was done the way it was done and if no one can answer that it's time to maybe have some more design work and discussions around it so as good as you can be with the design and the architecture at the end of the day you've got locations that exist outside of the pki

itself so we rely on these iis boxes or ldap so that means that forces outside of the pki can wreck your crls as well a network change that blocks access to these locations a permissions change on where you're storing the files um a dns change to one of the vips you know it's never dns so that couldn't happen but all of these things outside of the control of the pki and probably the pki team isn't really looking for it but you want to be looking for it you need a way to monitor that so either getting that visibility to the pki team or a knock or a sock depending on what your organization has so you want a way to

know from the outside from the client's perspective you know is everything look normal is everything still accessible all right so then switching gears a little bit that's the big operational side we'll talk about a big item on the security side and for the most part it's privilege and permissions and it's really important cas specifically but really the pki as a whole is probably the next most important thing after your domain controllers by that i mean if an attacker compromises a ca they're going to own everything as if they compromised a domain controller more than that it's going to be very hard to notice most pki is kind of treated like a black box a little bit it seems like magic to

a lot of people people don't know it very well but attackers know it with all the new research and tools there's a lot of ways so pki and certificate authorities are what we consider a tier zero resource just like adfs or an ad connect or the domain controllers if an attacker hits one you're done most organizations don't treat it that way or when they do they don't do their due diligence following all the paths to the ca like they would with a domain controller so having some of these rogue permissions and weird privilege is an easy escalation path for an attacker but it can be hard to manage every ca is basically its own little

island so you need a way to monitor not only the cas but other things like templates that i'll talk about here as well you need some process you need policy you need documentation you need monitoring and there's really nothing out of the box to do these things you've probably already done these things with other applications and critical services in the environment but more times than not pki gets left out so we want you to start doing that with the pki so let's start with a template what is a template so templates is a construct of windows server based pki domain services based pki so microsoft didn't invent pki we follow the dot 509 standard for the most part

but templates is primarily a windows construct and it's a way to rapidly issue certificates and in a lot of cases without any real user input or impact so without templates if i want a certificate i have to supply all of the information that the ca needs to build it specific fields in a specific order in specific syntax tens and tens possibly hundreds of different items that have to be just right at worse it won't get issued and it'll error out or in really odd cases it will get issued and weird things will happen so it was very very difficult but in a template i can configure all of those things as an administrator and then i

can get all of my users you know cookie-cutted certificates or all of my sql servers or all of my domain controllers very specific use cases that i can control so these are set once in the environment and then i load these templates wherever i want the cas to service them but the templates start with a lot of permissions take a look at a few of them you see in those little check boxes what we see there's not a lot of variants there's there's not a lot of uh off-the-wall options there they're pretty straightforward we can read meaning we can use command line we can go in a gui tool and we can see that the template is

listed enroll allows us to do a manual enrollment process auto enrollment means with something like a gpo i can turn that on to work with a template and without anyone's input or impact i can issue certificates to anything domain joined but there's a few other items there and they're a problem so right and full control there's no reason that gets given out there should never well 99 percent of the time there should never be a right privilege given it just doesn't make sense full control will happen by default the last administrator that made a change to the template is going to have the full control rights or it might be the person who created the template

but that should only be an administrator so there's a lot of stuff that needs clean up there as well first think about just overly broad rights i see this a lot so if you look in this the the second item from the bottom just above enterprise admin says domain computers so that was probably a default in this template and domain computers means every computer every server literally everything in the domain will have whatever those rights are if i were to click on that well if i'm only issuing certificates off this template to certain systems windows 10 devices sql servers exchange servers something then domain computers is a bit too broad and a lot of people don't trim that down

and it absolutely should be any right and full control permissions definitely should be looked at and this is something that you should be alerting on auditing regularly and just have good care and feeding in the environment a lot of defaults are there and they're really bloat they're really there to get started for example domain admins and enterprise admins by default they have rights on every single template generally those people are not your ca admins the people who are my experts in pki they're probably not my experts in domain services now there might be some crossover there and there may be reason to to use them but it's probably not everyone in that group so it's very common to have a

separate group for pki admin and strip some of that out in either case likely enterprise admins doesn't need to be there regardless and it'll be a little different per template based on your architecture based on your organization but all of these should be reviewed i guarantee everyone i would find at least one item of bloat if i look through templates that you had in an environment with the templates aside now let's think about the cas the cas have their own set of permissions just like templates do and we see a lot of the same problems but now it has much more impact so on the templates it's rather on the cas themselves just looking here at a

clip of the gui just like a normal security tab a normal acl that we have here but we just have a few options so manage ca it's basically the service administrator for the ca they can change configuration they can stop services they can revoke certificates they can back up things they can change things they can really do anything they want issue and manage would be a right that you would delegate out if you had someone that needed to work with certificates to approve requests to revoke certificates but you don't want them being the service administrator so with that again we see bloat we see a lot of the defaults there again domain admins enterprise admins they probably don't

need to be there in some cases there are other things that fall in based on the template that you're using based on how you configure the ca over time this is going to change as well pkis are generally living for decades in an environment sometimes more so if you have something that got built five years ago or 10 years ago you've probably gone through different naming structure of your groups organizational changes so you probably have groups that aren't right anymore people that don't exist in the organization anymore and you're just not catching it so it's important that again this gets audited this gets monitored service accounts are big so having a service account with privilege at the ca level

is pretty dangerous now there are certainly reasons to do it but you think about where the cas fit in that structure again tier zero critical resources just like a domain controller so now if i have a service account that's an administrator on a ca now follow that attack path wherever that a service account goes whatever application or servers that it's a part of now that there's an attack path from there to my critical systems that also means that likely all the administrators of that application are in that attack path so a very very broad stroke here so putting service accounts accounts for other applications at the ca level is something that you have to be very

careful about now i'm not saying don't do it there's absolutely reasons to do that things like mobile device management tools they'll need exactly this but when you do this you have to be aware of what's happening perhaps even build your structure so there's a separate ca with only that function so if there is a breach if there is an attacker if something goes wrong or doesn't behave well you can just cut that off without impacting the larger organization enrollment agents are the the big case for things like service accounts so an enrollment agent is something that can get a certificate on behalf of someone else also very dangerous so if i could do that i'm going to ask for a certificate in

the name of your cfo or a certificate for my computer in the name of one of your domain controllers so dangerous there's reason to do that but when you're allowing this kind of behavior again tightly controlled and monitored most organizations don't have any of that and we regularly see attackers abusing exactly this kind of escalation path by and large local groups and bloat of administrators on any windows server is generally a problem for organizations but absolutely no place for it on a ca should absolutely be turning the knobs on that bringing that down to the bare minimum an extension to this now with the security issues what happens when powerful certificates become available and there's just a

couple categories of this that i'm going to mention one is that if you allow the ca to accept input from a user when they're requesting a certificate like i just mentioned i'll request a certificate in the name of your cfo if as a user i could put in anything i want that's an issue so we say that in that case the ca is allowing enrollee supplied information in the request certainly a reason to do that in certain cases but for most cases there's not so if and when you are allowing that it needs to be monitored things should not get automatically issued in those cases administrators should be checking that request templates for very powerful systems like

domain controllers you should be very clear on what's getting issued and when there should never be mistakes where a domain controller certificate lands on some other server or a code signing certificate goes to a random user anything that's powerful like that should be very carefully controlled and then sand names it's an interesting topic so a sand name a subject alternative name it's basically an alias so i will issue a certificate to a subject so maybe my name so i have a certificate literally issued to rick davis but i have a san configured and it's my email address things like that can happen they can be configured for whatever you want there are applications and services that

require very specific sand names be entered so there's a real reason to use them but again if i can put in anything i want that becomes a security risk so the use of sand names is is really hard to catch there's not a lot of good monitoring for that so it comes down to the hygiene of the cas and your administrators and at the end of the day for all of these things monitoring alerting is incredibly important i kind of beat the dead horse with that because just about everyone fails at that initially once someone realizes hey we need to do better with our pki we want to start looking at it it's not until that point

that they realize we don't really have any monitoring any auditing any alerting we don't really know what's going on even if you have dedicated staff for it so closing some of this out have an assessment of your pki i'm not trying to sell you one there's plenty of places that have them there's even free ones that you can get online scripted and through other services but get one just get an idea of where things stand whoever you have responsible for pki whether it's a dedicated team or person get them some formal training there's a lot of good resources out there both free and paid level set if you've got someone that's been doing it a long time

i guarantee there's new stuff they don't know about if they've not been doing it a long time there's a steep learning curve and there's stuff they should know and then finally if you have a pki it's probably critical for your organization and unfortunately you're probably not treating it that way so it should be part of your dr plan it should be tested it should be just as important as things like domain controllers and sharepoint and email because if those things are using certificates and your certificates fail it's a big problem so that's all i had for you hopefully you're able to take some of this information back take a look at what you have and push for some improvement um

i'll be around for most of the day if you have questions track me down i'd love to talk pki shoot me a message on twitter whatever that happens to be and hope you enjoy the rest of the conference [Applause]

PKI - Avoiding common pitfalls - Rick Davis

Related talks