Care and Feeding of HSMs: Key Management in Hard Mode

Name: Care and Feeding of HSMs: Key Management in Hard Mode
Uploaded: 2025-06-04
Duration: 31 min 31 s
Description: Care and Feeding of HSMs: Key Management in Hard Mode Nick Pelis Cryptography's dirty secret: your security is only as strong as your key management. Dive into the treacherous world of HSMs, which promise salvation but deliver operational nightmares and hidden costs. HSMs: not for the faint of he

BSidesSF · 202531:31524 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Nick Pelis

Tags

CategoryTechnical

TopicCryptography

StyleTalk

Mentioned in this talk

Tools used

OpenSSL

Standard

PKCS#11

About this talk

Care and Feeding of HSMs: Key Management in Hard Mode Nick Pelis Cryptography's dirty secret: your security is only as strong as your key management. Dive into the treacherous world of HSMs, which promise salvation but deliver operational nightmares and hidden costs. HSMs: not for the faint of heart! Care and Feeding of HSMs: Key Management in Hard Mode

Show transcript [en]

So with that we find ourselves here about time and we'll be talking presenting to you care and feeding of HSM key management hard mode and I will agree with you there by a former colleague of mine Nick Pelis. Okay, thank you all very much. Thank you Jeremy. Uh I'm actually utterly shocked that there are this many people in this room here. Um so we're going to we're going to talk about a lot of stuff today. Um, but before we get started, I'll give you a quick little intro on myself. Uh, my name is Nick. I work in the security uh department at Verata. Uh, I've always been interested in embedded systems and the security of

embedded systems. And I don't know how it happened, but at some point in my career, I got involved with HSM. It was not intentional, but here I am today. Uh, and I hope to share some of these uh, war stories and lessons learned with you uh, during our time together. So more often than not, you know, we as security practitioners deal with a number of problems and the solution to some of those problems is cryptography, right? Uh you want to send all your cryptocurrency to some scammers in a foreign country. What's the solution? Cryptography. Uh you want to uh buy some illicit products off of some illicit market? What's the solution? Cryptography. You want to send a message

over the internet and not have the NSA listen to you? That's cryptography. So when we talk about cryptography, the thing we need to keep in mind is that most of the problems involve uh always have a key in play. So most of us are familiar with this type of problem. Uh we have Alice on the left. Alice wants to send a message to Bob, but Alice is afraid that Eve in the middle is going to read that message or potentially modify it. So Al, what's Alice going to do? She's going to take that message. She's going to encrypt it perhaps with her favorite uh email encryption client like GPG. Uh and and all seems well,

right? It seems like we've solved the problem. But actually, we haven't because we've just traded the problem of the eavesdropping for the problem of what to do with the key. So, what do we do with the key? Well, we could ask around on the internet. We could ask our favorite AI LLM assistant and uh it gives some advice. That sounds pretty good on the surface, right? You know, always use trusted cryptographic libraries. Yeah. You know, most of us are told it's not a good idea to roll your own cryptographic library. Uh, choose appropriate key links. Yeah, that sounds also like a good solution. Uh, and then right down there, use hardware security modules. Hm. More on that in a

little bit. But, um, you know, ultimately what we're getting to here is a discussion about key management. So, you do some more research on the internet. uh it'll you'll eventually you'll come across this document from NIST uh NIST SP800 uh 57. This is the recommendation for key management. And like many NIST documents, this one is both very comprehensive and not specific. So you have to read between lines to understand what it actually means for the problem you're trying to solve. But if you go in there and you read it at kind of like a high level, there's some really good advice. And so SP 857 is going to tell you about the different types of

cryptographic keys that are available, right? You have keys for encryption, keys for signing things, you have key encryption keys. There's probably the most important lesson in this document, which is key usage, meaning use one key for one purpose. Why do we do that? Well, we want to minimize the blast radius. If you have a single key that you use for encrypting the data on your say on your workstation and that's also the key you use to encrypt all your customer data in the cloud. It would be really bad if that key got compromise somehow. So you want to limit the blast radius there. Uh it talks about terms like crypto period which is that you

want to use that key for the shortest amount of time possible makes sense. Uh security strength if keys are longer uh if then they can stick around for a longer period of time. Stronger keys are better. Uh archive and recovery. If we lose that key, maybe we want the ability to bring it back. And revocation. What do we do if that key gets compromised somehow? And like it see sounds complicated to think about all of this stuff, especially if you're just reading this NIST technical document which is difficult to read, but ultimately the concepts are quite similar or quite simple to grasp. If you think about cryptographic keys in terms of like your house key, right? If you have a key on

your house that has a single pin on the tumbler, uh that's not going to be as strong of a key uh with the lock that has like, you know, seven or eight pins and the lockpicking lawyer takes more than 60 seconds to break in it. Like that's probably a better key. Uh if you have one key for every single lock on your house uh and it opens all of them equally, well that's not quite as good as if you have a single key for a single lock because if you compromise one lock one key for all locks, then that key is gone and you anybody can get into any door in your house. So something to keep

in mind as you read through this. And so now let's walk through some examples uh now that we've talked a little bit about SP857 and let's think about how we view keys and key management. So most of us, you know, you go to a a website, go to your favorite website besides SF.org, you click the little box in the web browser and it's going to tell you that this is an HTTPS connection. It's secure. And what is it about this connection that makes it secure? Well, that comes down to the TLS handshake. So uh in TLS 1.3 uh the whole sequence the whole dance of events begins on the client side. The client will generate an

ephemeral key pair and send that over to the server the public key. The server will also generate an ephemeral key pair and send its public key to the client along with a certificate and some identifying metadata. Now with each side having the other side's public key uh we can do a key derivation typically diffy helman key exchange and both sides can derive a session key and at the end of it at the end of this in number four uh what we have is a session key that encrypts all the application data the HTTP messages uh and that's that's used effectively to provide the cryptography of the session. So based on that what can we say about that session key right

well the crypto period the length of time for which the key is value is is valid is short because those keys are uh thrown away at the end of the session. Uh the security strength of that key well it's nominal it's chosen by the specific cipher suite but if you're using TLS 1.3 let's say it's strong enough it's good. Uh what about key revocation? Well, compromise of that session key is only going to leak the data for that session. What about archive and recovery? Well, again, this is just a session key for a single session. If you lose that key, then you just like revisit the website. It's not a big deal. And like, what about the

cost to replace that key? Well, your browser invented it on the fly. It's an ephemerally generated key, so you can just generate another one. There's no cost to replace it. And what about auditing? like is somebody gets a hold of that key and you know do we need to worry about that? It's not really applicable in this case. In other words, by setting the crypto period to a very short amount of time, we've kind of sidestepped a lot of these thorny key management issues, right? Uh but what happens if you can't always do that? So what do we do with the session key at the end of the session? We throw it away. But what if you can't do that?

because you need that key around for a very long time. So let's talk about another example here. IoT device. So most of us have these in our houses, our buildings, our offices. Most of you here in this room today have one in your pocket. There's some in this theater room. And these IoT devices last for a very long time. They could be installed in buildings. They could be in the ceilings. They could be in the the air ducts. They could be monitoring things in your house. wherever they're located, they're going to be there for a long time. And so most IoT devices implement a feature called secure boot. And the whole purpose of secure boot is just a

firmware validation feature to ensure that the device is booting legitimate authorized firmware instead of like a botnet code like we saw with the Marai botnet in 2016 2017. And the way that this works is is pretty straightforward. So there's a lot of boxes here, but the thing you need to understand is that at the beginnings of this system, there is a public private key pair that is used to protect the very first piece of code that loads. So there's this boot ROM private key that's the box in red and then there's the private key in the box in red and then the corresponding public key which is typically stored in write once memory on the chip. It might be a

hash of the public key, but it's right once memory. And so it's programmed in there when the device is manufactured. Then when the device comes out of power on reset, that public key is used to validate the first piece of code that loads. That's the shim. The shim contains uh a public key of the next piece of code that loads. That's ATF or ARM trusted firmware. And this process continues down and down the chain until we have a chain of trust. The system is up and running. And now we know that our IoT device is running legitimate firmware that we meant to put on there. Uh but we still have a key that we need

to manage here. And if we look at this red box here, right, this is a private key for the firmware on this device. This device is going to be out in the field for who knows how long, years, maybe decades. So we can't just throw it away like we could with the session key. Uh we have to do something with it. So what can we say about this ROM private key? Well, first of all, the crypto period we know is long, right? It's going to be out there forever. The session strength, well, that's determined by whatever cryptographic algorithm is used by the chipset, but let's just say the security strength is 1128 bits. So, we'll say

it's fine for now. What about key revocation? Well, remember we had to write that public key into write once memory. It is possible on many IoT devices to revoke that key if it gets compromised. But it's not an easy process. Uh write once memory is usually very limited in size, maybe hundreds to maybe a thousand or a few thousand bits. So you don't have a lot of opportunities to rewrite that key, right? So in other words, we don't want to revoke that key if we can at all avoid it. What about archiving and recovery? Well, that seems very important in this context. If we lose that private key, that's going to block future software upgrades from

getting out to that device. If you're a IoT device manufacturer, you lose the ability to update your devices in the field. That seems really bad. Uh what about the cost to replace that key? If we lose it, then we might have to recall every single device we've shipped if we're that manufacturer. Uh and that could be cost prohibitive. That could bankrupt the company. And then what about auditing? Well, I mean, you think about it, right? Let's say you're the operator of a fleet of devices, say 10 million devices all around the world, uh, in businesses and homes and wherever. It's very important that those devices are running firmware that you want them to run, right? If they get

infected with malware, that would be really bad. So, we want to have a lot of auditing. We want to protect this key for a very long time. We need to be able to back up and recover it. Okay, so we need something for this ROM private key. So, how should we protect it? How should we protect it? Well, let's think about our options here. If we just leave it on a laptop, that's great because anytime somebody needs it, they can just compile it and use it. Uh, but then, you know, the laptop could die or it could get stolen. Um, you know, we could check it into git and then you're not just linked to a single laptop. All your developers

can access that key whenever they need it. Uh, and that's kind of nice and convenient, but then again, once it goes into git, you're never ever deleting it. Uh, or that GitHub repo could get leaked to the internet, um, which actually happened with, uh, the, uh, UFI secure boot keys very recently. Um, okay. So, what are some other options here? We could be super paranoid and write that uh, private key down on a piece of paper. That's nice because it's offline. There's actually a number of crypto exchanges that use this. Um, but then you know you can copy pieces of paper. Uh, they tend to break down and yellow and degrade over time. Uh, and there's

really no auditing around them. So that doesn't sound like a good solution. What if we put it on a USB stick? Well, somebody could steal the USB stick. Um, but what what are other options here? Well, we could use a cloud man cloud service provider's key management service. That sounds really nice. Somebody else gets to deal with the availability problem, the backups, the restores. we get whatever identity access management the cloud provider has. So you get that for free. Uh but then again basically you're renting someone else's equipment. So you know over time that's cost is going to increase. What happens if like 5 years in the future you decide to change cloud service providers? Uh maybe there's some

functionality uh that you need that isn't there or gets taken away or moved over time. Uh so then like what's the grand solution here? If we listen to what uh Claude told us earlier, use an HSM, use a hardware security module. Uh and so with that, you own your whole key management story. That's great. What you're signing up for a lot when you decide to operate an HSM, which we'll get, we'll talk about here now. So HSM, hardware security modules, what these are are purpose-built physical computers that are designed to do one thing, which is store and manage and and create and modify uh cryptographic keys. So you can buy them in many different form factors.

UB Key has one the size of your thumb. Uh they come in little uh handheld units that look like little tablet computers. They have them in PCI form factor or rack mount systems. Uh the cost is as much money as you want to spend on it. So the cheap ones are like 600 bucks. The more expensive ones can be over 150 grand. Uh how you interact with them. So you could do a username and a password or you could use a smart card. Uh and and so there like many different options out there. And uh and so you have to like really sort through these to figure out what's going to work best for you.

Now, let's say you're crazy and you decide to operate an on premises HSM. Well, why would you do this? Well, like we talked about, maybe you have some highv value keys. If those are lost or leaked, then that's going to completely disrupt your your your company or your operations. Or maybe you don't need access to those keys frequently. Well, that's nice because then you can keep those keys offline. You basically sidestep a whole portion of your threat model when the keys are not connected to the internet. that actually makes a lot of things a whole bunch easier at the expense of an operational nightmare. Uh maybe you want to keep them under your physical control, like I know my keys

are here because they're over there in that box right over there in the corner. Uh and then maybe you know you you're willing to do this because you have the patience to deal with PKCS11. So PKCS11 I is a standard put out that by this group called Oasis and it is an interoperability standard for telling you how you can work with different HSMs uh and everything is interoperable and everything works magically. It's a lie. It is a lie. You cannot port your key between different HSM vendors. uh you cannot port your keys from an HSM vendor to a cloud to like AWS or or or Azure. Once you choose a path, you are locked in for life. Uh and so you need to keep

that in mind if you're going to go down this route. The other thing that's important to know about PKCS11 is that it defines a library interface in the C programming language. And so, uh most people are not writing production C code these days. Well, that's fine. You do some search on GitHub, maybe you find some PKCS11 bindings that are written in Go or you find an engine that works in OpenSSL and every time you go through one of these layers of software, you lose information. It's like taking a JPEG and exporting it to a JPEG and a JPEG and a JPEG and eventually you're like losing information in here and you lose the you lose the some of the core

functionality that brought you to the HSM in the first place that you chose, right? Maybe you really want the ability to do smart card authentication, but the library that you've chosen because you're going through seven layers of indirection doesn't support that. So, just a bit of warning about PKCS11. The other thing you're going to have to deal with if you operate HSM is a key management ceremony. So there's a lot of prior art which we'll mention briefly here but most mostly a key management ceremony is just a procedure where you are creating distributing uh authorizing cryptographic keys. It's usually attended by multiple people at the same time. Uh and the reason for that is you

want to reduce collusion. You want to reduce insider threats. Uh and fundamentally at the end of this you have like a chain of custody like these people were here and they created this cryptographic key and because they all witnessed it that makes it valid and true. And so uh if you're interested in creating your own cryptographic key ceremony uh there's a lot of stuff you can read about this online. Probably the most uh famous and public example is the DNS sec root zone key signing key ceremony. so that they do this four times a year uh to update the the DNS SEC root root zone. Uh it's live streamed. If you ask them nicely, you

can go attend it in person. Uh there's also a Radio Lab uh episode that talks about Zcash. I I highly recommend listening to that if you've got half an hour. Uh and then there's also some other standards and blog posts out there. But fundamentally, you have to create your own uh key ceremony if you're if you're going to operate an HSM. it just comes with part of uh part and parcel of operating one of these things. So, uh, early in the days when when I was first learning about HSM, uh, it was just it's like something you just had on your desk, right? Like I work in hardware, so people have, uh, you know, uh, development prototyping boards. And

so I was like, oh yeah, yeah, you're going to do some setup work with the HSM. Uh, go ahead. Uh, why don't you, uh, borrow this unit for, uh, for the weekend? And so they were like really close to getting it working. They stayed late. They're trying to get everything done. And then like they left for two weeks and this unit is like $15,000. And so the next week we're like running around the office in a panic search. Where's the HSM? And like there were a lot of questions raised. Well, why did they take where did they take it with them? Did they go on vacation? Like what's going on? And so fundamentally we were like, okay, we don't trust anything

in that HSM anymore. Let's just zeroize it and start over from scratch. Uh so the takeaway there is like control the equipment. After that, we put the HSM and all the smart cards in like an access control place. So, we put it in a safe and then the safe is in another room. And so, there's like a separation of responsibilities between who who can access the safe and who can use the keys. Um, and so the other thing you need to keep in mind with when you're operating HSM is how the segmentation of roles occurs. So, when you authenticate to an HSM, you're not authenticating as as you a user, you're authenticating as a role. And in the HSM world, according

to the PKCS11 standard, uh there's different users uh that can perform different types of operations. So, sort of the granddaddy of them all is the security officer. Uh they can administer the HSM, they can do backups, and then they can also instantiate and modify other users. You've got the crypto officer. uh they can create uh other keys, they can uh use them, they can destroy them, and then sort of like the lowest user on the totem pole is is the crypto user. Uh and then like how you choose to set these up for your specific circumstances is 100% up to you. Um but the thing you need to remember is that ultimately it's like people operating

these things and people make mistakes. So at one time I had this situation where uh I had to reset the role for the crypto officer because they forgot their PIN. And uh to do that then you have to authenticate as security officer uh to reset the crypto officer role and we had it set up as a two of five quorum. We'll talk about that shortly. Uh but one of the two of us forgot their PIN uh three times in a row. And when you do that, there's there's nobody higher in than the security officer. So the HSM just freaks out and wipes out all the keys. And then we had to go through all the

operational hassle of doing an emergency key rotation. Uh in this case, that actually meant pushing out new firmware to the factory to use up one of our few key slots on these IoT devices uh that these keys were for. Um so that that wasn't very good. Uh so the takeaway from that experience was that people uh uh make mistakes all the time. So now when somebody participates in a key ceremony, we say, "Okay, I'm glad that you know your PIN. I'm glad you have it memorized. I don't care. Please write it down somewhere else. Don't tell me where you wrote it down. Please keep another copy somewhere else." Um the other thing we do is we we

operate uh uh multiple HSMs uh in multiple places. So, uh, on on site, I always have a primary and a hot spirit. I may have another copy of those keys. And I may also have backups offsite, and those are just in case like the building burns down or uh there's an earthquake, California falls off into the ocean. Um, that sort of thing. The other HSM horror story I love to share is the story about the battery. So HSM are interesting devices and that they keep all of your cryptographic key material in batterybacked SRAMM, right? So the idea is like uh if you lose power then the battery is the only thing keeping those keys alive. Uh and so uh that means

periodically you have to replace the battery because batteries tend to die over time. So the instructions for doing this are insane. But what the manufacturer wants you to do, they they give you this little temporary battery holder, right? And you're supposed to take that, you're supposed to take the plug and plug it into P8 there and then like hold it in the case while with your other hand you're unscrewing the main battery. You have to pop it out, then put in the replacement and screw it back in. Uh, like never mind the fact that you chose to put your keys in a $45,000 HSM because they're high value. This is like diffusing a bomb. Like if you

accidentally yank the cable out, then you've lost all your keys. If it falls out of your hand and it grounds against the case, you've lost all your keys. If you install the temporary battery in backwards and mix up the polarity, it's all gone. It is like literally insane that they want you to do this. Uh and so like we had a very similar experience one time. We had an HSM model um where you can set it up in such a way that if the HSM thinks it's lost power or you know if it thinks it's been touched or modified in any way, it'll go into this tamper state. Uh and so then you have to

you have to recover it right using some type of other authentication. And on this particular HSM model, it's possible to set it up in such a way that you need a smart card in order to recover the HSM. Well, the implication of that wasn't apparent. So, when I set this up the first time, I'm like, "Okay, great. I have a recovery smart card." 3 months later, I forgot all about it. I overwrote it because I needed it for something else. And then eventually, that battery died and the HSM and its keys were like completely unusable. you cannot like reset it. You have to actually send it back to the manufacturer uh to get it repaired. It's

like completely insane. Uh so like the the takeaway here is like when you see things in the manual that look really enticing like hey you should you can have a smart card recover the HSM that and you think like oh that's a great idea. You need to read that multiple times and think really closely about what the implications of that are. Uh, the other thing that I do now as a result of this experience is I always have multiple operational HSMs handy. So then if one of them dies, I can just flip back to another one. Um, and so like that's that's helped me immeasurable times. Um, the final thing I the final horror story I want to share

with you is that uh I mentioned earlier how HSM's authenticate according to RO and not by user. So what that means is like using Shamir's secret sharing uh you can split this role secret like the crypto officer secret amongst any number of key shareholders. Uh and so then you know that's kind of nice because you can say well I'm worried about somebody modifying the contents of my HSM so I always want two of five or two of whatever uh crypto officers or security officers present. Uh so if somebody wants to modify the HSM then like there's less collusion risk. Um but you know the problem with that assigning these smart cards to people is that

eventually they quit. Uh they get sick, they leave groups, they're not interested in doing it anymore and so eventually you have to recreate the role which means you have to write a new secret. So we had an experience like this where um a new person joined the team and we wanted to rekey the security officer role. So, we went from two of four to two of five. And uh as we're going through and we're rekeying this role, right, I have everybody in the room in a ceremony and we're like going around round robin. Uh and so one person will go in, they'll reprogram their smart card, then we'll go to the next person, the next person, the next

person, and we got to the third uh person in in the set, and then the HSM just threw an error and locked up. And that was really bad because uh at that point in time what we were doing was reusing every single person's smart card. So that effectively left things in a state where I I had three of the four original smart cards reprogrammed with the new secret. The HSM was set up for a two4 quorum. That meant I did not have enough key shares present to authenticate as security officer. So the takeaway from that is like you should always rehearse critical operations on the test HSM. Do not rewrite smart cards that could still be

used in production. Uh, and always create copies of everything and rotate between the copies because eventually they're going to fail or people are going to fail. So, what do we take away from this? In conclusion, HSM are a very serious commitment. If you're going to take this route, only choose it because you have the budget, the staffing, and the patience to see it through. definitely have backups of everything and have backups of those backups and build a process that is going to accommodate mishaps because they can and will occur. Thank you.

All right, thank you Nick for that hilarious set of anecdotes. Uh we'll definitely stay here for questions. So again the way we do questions and I see one has already landed in here but just as a reminder you go to sli.do D O or you can go to slido.com gets you to the same place. Put in the conference code besides SF2025 and theater 15. We are in the theater 15 category at least. So with that, this one looks really targeted. This might be a colleague asking a funny question, but let's get with this. How do you design the provisioning process for Verata Dome series cameras to ensure long-lived ROM private keys remain resilient and easy to audit through

their life cycle? That's a great question. That's a great question. Hilarious. We'll we'll definitely let some more slide in here. Is Is that just a planted question or is there a a a point in there? Uh that's a great question and uh yeah, that's that's that's private information. I can't share that. All right. So, that was definitely a colleague egging you on. No worries. If you're particularly shy and you're in the 50, I'm not going to do this a whole lot, but uh you could raise your hand if it's a short question and I'll I'll ask it directly, but we do want to prioritize people on the streams and possibly in the overflow. Let me

look one more time. Okay, we'll go for this one.

So you don't really need to

Can you repeat question? Uh yeah, the the so the question um talked about how a lot of PKCS11 is effectively legacy technology and in now these days and age we're used to like you know how how do you authenticate as yourself rather than like having people write PIN numbers around on slips uh and and like how do how do I defend that? I don't I don't work for an HSM manufacturer. I think these things are like very difficult to use. Uh I I think there's an area here that's ripe for innovation. I also think that uh the world market for HSM is not big enough to justify uh that innovation. Uh and so for for those

specific use cases where you actually need something like this in place where the reasons the business reasons the risk and everything justify something like an HSM uh then you're just kind of left with this like this technology that's old and difficult to use. We are unfortunately actually at time and we do have to cut off. Uh Nick will be available definitely for questions out there on the floor. Thank you everyone for attending the great topic and uh well thank you Nick.

Care and Feeding of HSMs: Key Management in Hard Mode

Related talks