← All talks

bscrypt — A Cache-Hard Password Hash

BSides Las Vegas · 202228:39196 viewsPublished 2022-09Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
DifficultyAdvanced
StyleTalk
About this talk
Steve Thomas presents bscrypt, a new cache-hard password hashing algorithm and key derivation function designed to resist GPU-accelerated attacks. The talk covers the evolution of password hashing techniques, the distinction between memory-hard and cache-hard approaches, and bscrypt's design decisions including use of overlapping S-boxes to target specific CPU cache sizes. Benchmarks compare bscrypt against PBKDF2, bcrypt, and pufferfish2.
Show original YouTube description
PW - bscrypt - A Cache Hard Password Hash - Steve Thomas PasswordsCon @ 17:00 - 17:55 BSidesLV 2022 - Lucky 13 - 08/09/2022
Show transcript [en]

i'm really happy to have uh steve scooped thomas back on stage uh first time was in paso scott in oslo in december 2012 so almost 10 years ago uh a pretty uh fun experience and a good talk back then uh he's also been with pascal several times before and uh about steve i can say for you know for certain that i understand maybe two percent of the stuff he's talking about because i basically suck at math and crypto but i can also say that if you have anything that is using crypto that you intend to put online if you haven't hired steve to look into it first to look for vulnerabilities you're doing it wrong because he is

incredibly good at finding vulnerabilities and now it's steve and via script a cash hard password hash so go ahead okay uh just for clarification that's crypto as in cryptography

okay so uh bscrypt uh it's a new password hashing algorithm and kdf so uh password hashing and password kdfs follow on a fall under this umbrella of key stretching so key stretching can be used for other things like so if you have a small key you know you have something with like 30 bits of entropy and you want 40 bits of strength you can stretch for 10 extra bits you can use that for like um uh fingerprints so you can have a shorter fingerprint but it's harder to make a collision or whatever but we're only going to be talking about uh password uh key stretching in relationship to passwords so uh key stretching um so

uh the first thing that people came up with was uh computationally hard so what this is is you just hash your password then you hash that and over and over and over and memory hard you just use the resource of memory and you hash stuff and you know uh whatnot here are some algorithms so pvk df2 uh you may have heard of maybe not and you know uh there's argon 2 s crypt uh balloon hashing which doesn't actually have like an official uh spec or anything like that uh yes crypt um so computationally hard uh there's two sub sections one that i explained before where you hash the password then you hash that then you hash it again it's a

very sequential operation i realized that i should probably uh so it's like kind of a new algorithm people are kind of afraid of it but pbk df2 has this foot gun where um so if you ask for so pbktdf2 sha256 uh that's 32 bytes of so sha-256 is 32 bytes uh if you ask for 32 bytes from pbk df2 uh you run through all the iterations but if you ask for any more it will do all the integrations again but with like a different counter at the top so all those are done in parallel if you ask for a large amount of output so i just this is the entire algorithm uh you just ask for a large amount and then you xor

each of the blocks together and then you do that uh this should you know uh nist should um like fast track this into pbkdf3 or you know uh something so that we can get something that's better than pbkdf2 df2 at the end you'll see a benchmark between this and other algorithms and yeah [Music] so this brings us to cash art this is relatively new basically uh the difference between memory hard and cash hard memory hard uh reads like sequential blocks of memory so you'll read like a kilobyte from memory randomly and then do something with it and then you know what not uh cash hard you want to do small fast uh random uh lookups so you're looking up

uh eight bytes versus a kilobyte and doing a quick operation so decrypt this is minimally catch art it was kind of accidentally cache hard because it used this encryption algorithm called blowfish which had uh 4k of s boxes and we didn't know that you know cash hard was a good thing until gpus came out and uh you know people were noticing that you know the speed to generate one of these hashes from a defender versus how slow they are for a gpu attacker you know it was way better than other things so uh bs grip that's what we're basically going to be talking about uh then uh puffer fish two uh pufferfish was uh submitted to the password hashing

competition and then uh i forgot exactly when it changed i think was in the tweaking round uh or something like that but we'll also be comparing uh that uh yes script and argon 2 ds those are memory hard algorithms that have this extra thing where they do some cachard operations um you probably haven't heard of argon 2ds if you've heard of argon 2 because this was actually uh created after the password fashion competition ended uh i i even keep forgetting about it uh so we're basically just gonna be talking about bs grip and puffer fish too so uh how to do a password hashing uh how to do key stretching well so what you want to do is hash all the

inputs then do some sort of work with that uh so you generate the seed then you do some amount of work you generate this value called work and then you just put it in a kdf and stretch it for whatever length uh you know if it's a password hash you probably want to have that be a fixed width of like 32 bytes and if it's kdf you know go whatever [Music] so um uh bcrypt they didn't do that first step uh where you hatch all the inputs so what ended up happening is um there's bugs with null characters where uh you know it so a lot of the implementations are in c they expect c strings but if you give it

you know utf-16 then you have no password part um then uh there were bugs where um the uh with uh unicode characters and like utf-8 where the sign got extended and cleared out parts of the password um but basically this uh right here is blowfish encryption uh so the block right there that's done like 16 times s uh zero through three those are s boxes and so pufferfish made these changes to uh uh blowfish uh also blowfish is 32-bit uh you know all the variables are 32-bit and these are 64-bit so that you can have a larger larger masks so you can have larger s boxes also the uh mask allowed for scaling the s boxes to

um you know nice binary numbers of sizes so like 64k 128k stuff like that um so in uh april 2019 before bs script was called bs corrupt i figure it was called but um i was like oh i don't really like how pufferfish was not using some of those bits to do lookups on so i was like i'll do eight bit lookups and then just have uh so this right here is uh 64k of s boxes uh 16. did i say 64. whatever uh 16k of xboxes um and when you wanted to do 32 or uh some weird thing like 48 or you know something multiple of 16 you just do eight more s boxes and then you just do

a for loop around this to do multiple iterations um so what i didn't know was that cpus were going to have weird sizes for cash like uh you know 48k or 1.25 meg uh so um i tweeted out this uh benchmark in uh sonya team uh and so two different cpus one has only two layers of cache one has three uh you can probably guess but um so the what i noticed was uh the pink line that one little duh it it came uh that was at quote six uh 256k uh which is the cash size the l2 cache size for uh one core on that and i noticed a jump and i was like oh

that's weird um so i realized that it was the p boxes because i think i actually was using multiple p boxes uh permutation uh s box is a substitution box it's like a cryptographic term whatever uh anyway so i removed those because we're not using this for encryption and that's like where the key bits were and stuff so um because the only part about this being cash hard are these s boxes so uh removing those um it then uh you know got rid of that little bump before uh when you are at that limit um later uh i was like you know what uh i should switch to accumulators because the only thing that matters is that you're alternating xor

and add um so you just do lookups and then xor or add into the accumulators like this uh but there are some issues with them like if it's zero and then you if your accumulator is zero which is one into uh two to the 64. so very rare so it doesn't really matter but if you add a value and then you xor the same value it'll be zero um basically i was thinking of having just like one s box that i you know scaled uh because you could actually do that with an accumulator but you could it's possible to read the same variable twice so if you did that with you know a non-zero value then the least significant bit

stays the same which is also not really a big deal so i was looking at different sizes for masks you know how many look-ups i could do and like what the limits were and whatnot and so

uh then uh i had like a aha moment so i have two s boxes that are overlapping and you can shift them over to get those weird cache sizes so now i can hit like any size you really wanted with within a eight byte boundary granted the way that i did things uh it makes it nice for like 128 byte boundaries and then i really like how easy it is to read the settings and know what they mean so i just made it uh kilobyte boundaries so the m parameter if you see 256 that's 256k and whatnot so it makes it uh nice and easy to understand so um the largest size that i could do with

these overlapping uh s boxes where a special case where they're you know totally separate for the largest size would be these values and um you know so obviously 4k that's kind of bad um and 64-bit masks you know that that's kind of excessive so i i got rid of those and then i was looking and i'm like well a megabyte that's kind of low so and then um depending on whether you wanted to push out into l3 cache some gp some cpus have more than 32 megs now so i guess this one wins [Music] so this is the main operation uh basically uh all right so green and blue you can basically consider those like uh l one

uh well l zero through l three and then r zero through r three uh anyway so the reason why i do four of these in parallel is uh password crackers for decrypt what they did was they interlayed two um password uh you know they interleaved two um uh decrypt hashes with two different passwords and you got like a slight speed bump because it was hiding latency a little bit better so uh i wanted to do that but now these operations are just parallel so i do a block of that then i rotate them and then i rotate them again and again so it's basically four blocks of this and uh it mixes all the uh

uh you know all the different ones that are done in parallel and this is basically a free well it is a free operation uh so and then i just read sequentially and then i also do uh you know a rotate again when you write and um again you know rotating them at the end it doesn't really matter too much but it's free so uh so following a uh an accumulator variable a uh you know you get the add xor and then the s boxes oh i never noticed that or whatever oops uh add x or add x or in the s boxes and then xor and then add so uh this was uh a version i had before

then i realized that those two values could be the same so i'd be accumulating with add twice um so i then changed it so that you have a block that starts with adding and then every other and then a block that starts with xor and then every other so when you look at what's being accumulated it's always uh add x or add x or add x or and then what you're writing add xor uh so on um then uh so those are two different blocks uh then i rotate all the values this is what made me uh so remember when i said that uh i didn't like how some of those bits weren't being used for lookups well when you

rotate the variables uh those bits that aren't being used will then later be used in you know other calculation uh other lookups and all i did was uh do a through d after the first block and then e through h after the next so uh initial fill and finish um you know i'm pretty sure you're all bored with code so you know you can go look it up if you want but basically uh i used uh blake 2b for the hashing of inputs and the kdf at the end to expand it and you might be wondering why i use blake 2b instead of like like three um well blake 2b is 64-bit and uh blake

3 is 32-bit so i wanted to basically have a native write function on like a native read so um it doesn't have to do any uh bite endian swaps or anything like that so you it basically just deals with only 64-bit integers and then the finish is really simple i just run over the entire uh data by adding an xoring like 16 in parallel and then you know combine do some you know stuff so uh where can you find this code well it's

is let's hope this works

it is on github as soon as this finishes

yeah yeah there we go so this could be you i bet no one knows that i was there when bs crypt was released

all right so benchmarks um so uh pbkdf2 shot 256 uh sha 512 decrypt and bs crypt so uh these the settings for all three of these were uh set to what a gpu attacker could do uh for uh bcrypt uh cost nine because that was the one that you couldn't have as much leeway with the settings so it's basically 53 kilohashes per second per gpu and so as a defender you can spend almost 800 milliseconds doing pbk df2 uh sha 256 or you could spend uh four milliseconds running the script so you know uh so this is on uh actually this laptop i realized that i should have i thought of the slide last night

and i was like oh i need to get a benchmark so i can do this but yeah pufferfish2 isn't on this uh multiple reasons i didn't have the code and two um the uh settings are also not as forgivable like uh jeremy wanted to have it such that uh you know he had a t constant m cost which were uh two to the whatever that value is like decrypted so it'd be hard to get an exact comparison but uh so okay uh so this is again this laptop um so it only has uh three megs of l3 cache and uh if you look at the green line um bs crypt p equal to so this is with two

threads so it's running on two cores um with each with uh two megs of cash each so it's actually now pushing into memory which uh as you can see gets a lot slower basically uh oh right so uh pufferfish too fast basically what puffered fish does after every round of encrypting you know encrypting all the uh s boxes uh it runs it hashes all those so it has to run through them with uh hmac sha 512 it's a hashmall which is a slow operation and it's not exactly helping its cache hardness so it then gets a little bit faster for uh the defender so uh different cpu um this one uh has six megs of l3 cache and

you know uh for bs grip uh p4 uh the last two are now hitting memory um so uh the nice thing about it is uh with multiple cores uh you know so these are all uh i should mention that each one of these points is uh quote the same hardness against a gpu attacker technically uh you might be able to be faster on the uh slower side uh the smaller side so like 16 or 32 64k those might be not enough cash hard to you know push gpus into doing memory lockups i forgot to mention that earlier so um so minimum settings for password hashing you should aim for being a gpu being at most

10 000 hashes per second per gpu and that's not like just any gpu the best gpu for password cracking at the time uh which currently is uh rtx 3080 uh 12 gigabyte version there's two uh there's one with 10 which is slightly older whatever um but so when you put in settings for vscript with p1 p2 and p4 it takes roughly one millisecond to get to the minimum strength that you should have so um basically this means that we can push well beyond and make gpu cracking really really slow so if you're familiar with decrypt cost 15 is ridiculously high it takes a very long time to generate one of those hashes but basically the equivalent settings for

bscrypt to hit less than 85 hashes per second per gpu uh i ran this benchmark uh as you can see uh i guess it doesn't really help too much going with a fourth uh thread even though it's a quad core but um you know uh you can get under 100 milliseconds on uh this cpu is like seven years old so it's not even that new so uh it's definitely good because usually your maximum time to do a password hatching algorithm is 100 milliseconds you don't really want to go much higher than that but with this you know uh well technically if you're doing this on a server you wouldn't want to use uh parallelism because the uh parallelism

is actually going to uh lower your throughput so if you measure with all your cores doing it and then you know you're like okay it takes 100 milliseconds well that means your server can only authenticate 10 users at a time per second anyway uh questions that was i definitely skipped something i don't know what happened because i was running overtime before when i was yeah well that was that was ahead of schedule i don't know what happened yeah questions for steve or is everybody totally clear with this any thoughts on performance against an fpga or asic accelerated factor yes so the question was uh performance versus uh fpga or asic attacker so with fpgas um

there's this one uh i forget what it's called but it basically uh it's four fpgas on a board you can you can only get them used uh they're like uh 200 dollars used but um uh so bcrypt on it is really is really nice and fast because of the onboard cache on it well uh that cache is limiting how many uh units of bcryp can be put on it can be running at the same time on the fpga well when you that's only 4k if you were doing 256k you would then be 64 the speed 1 64th the speed so um that would drop it well below gpus and then i assume a6 are similar

but also no one's using a6 question for me steve um let's be paranoid do we have reasons to think to believe that specialized government agencies in unnamed countries could have specialized cpus that are not generally known or available that can be in existence today that have either incredibly large cash sizes or also can deal with the memory hardware the cash heart problem yes but you know you're not worried i'm not worried about them no one really should be you should be worried about some kid with a gpu like and and and the team have hashcat i guess yes yeah you know some kid that gets a gpu for christmas and then you know oh i can crack passwords that's

that's really any more questions for steve nope okay well then thank you steve and we are finished a little bit early but the last talk of the day why kids couldn't care less about your password advice is at six o'clock i would really recommend that one as well see you back then