PBKDF2: Performance Matters

Name: PBKDF2: Performance Matters
Uploaded: 2016-12-14
Duration: 24 min 30 s
Description: Joseph Birr-Pixton examines PBKDF2 implementations across popular cryptographic libraries, exposing significant performance inefficiencies that stem from unclear standards documentation. He demonstrates that naive implementations waste 2–10× computational work compared to optimized versions, effecti

BSides Las Vegas · 201524:3068 viewsPublished 2016-12Watch on YouTube ↗

Speakers

Joseph Birr-Pixton

Tags

CategoryResearch Technical

TopicCryptography

StyleTalk

Mentioned in this talk

Tools used

OpenSSL

Platforms

FreeBSD

Frameworks

BouncyCastle Django SJCL

Languages

PHP Python

Concepts

PBKDF2 scrypt

About this talk

Joseph Birr-Pixton examines PBKDF2 implementations across popular cryptographic libraries, exposing significant performance inefficiencies that stem from unclear standards documentation. He demonstrates that naive implementations waste 2–10× computational work compared to optimized versions, effectively handing attackers an equivalent advantage, and releases a high-performance reference implementation to address these design flaws.

Show original YouTube description

PW - PBKDF2: Performance Matters - Joseph Birr-Pixton Passwords BSidesLV 2015 - Tuscany Hotel - August 05, 2015

Show transcript [en]

to you take it away so I'm not actually here to talk about pbkdf2 I'm here to talk to complain about it I want to I I just it comes from that era of RSA Publications um late 9s where most of them are kind of crappy um like pc1 encryption which was really crappy uh rc4 maybe a little bit before md5 a lot before they were all kind of knackers and

can everyone hear me at the back no no let me just is that better okay good right um did you get any of that so I'm here to complain um pbkdf2 I think has like um longed outlived its welcome um it comes from a historic background of broken crypto um and I think the various standards that were written to describe it there are two um don't really tell implementers what what what is important and how to do what is important correctly um so with that um I'm going to give you a quick introduction to pbkdf2 and this may be kind of way below what a lot of these PE people in this room um already

understand um so I have cats I have cats I have picture of my cat cat yeah um then I'm going to describe why the standard is bad and then I'm going to as a direct result of that I'm not blaming implementors because they've done basically exactly exactly what what is in the standard that a lot of implementations are bad and then I'm going to uh release some software um so the [ __ ] cat's not there my cat is missing right so imagine there's a cat up there she's a beautiful cat um so the it's a secret oh I'm very disappointed with my cat I was hoping to sh off my cat she's very pretty um so Merle Dam damgard hash

functions this is construction for making hash functions out of this a a relatively simple unit called a compression function and you feed in your message in um blocks with a particular size and then at the end you get um the result this is a kind of um simplification because there's some padding on the right you have to do to make sure that um message is going in a prefix fre but whatever um next is hmac which is where you take one of these hash functions and you make signatures out of it um um and it looks like that this is probably the most um complicated slide so uh yeah if you can understand this then that's

great so going from the top you start with your key you EXO it with a block size constant and you hash that and then you hash your message separately you hash your key you EXO your key with another different uh block size con and you hash that and then you hash the thing you originally got and that's a signature very good um and then we come on to pbkdf2 um and this is this is one of the problems is that they try to be clever and they try to parameterize the whole standard with this prf which the idea being that you can swap in and out different prfs um in practice everyone uses hmac um I think Apple want to use

CMAC for something but that was crazy of them yeah a CAC I think for some weird reason they [Music] have right um so uh yeah so the main thing is you put assault in the side and a password there and you get a key out and you can tune it to make it take a different amount of time and yeah it came from RSA kind of a long time ago so this is the main slide when you're using slow hashing you come up with some kind of budget for how long you want to spend doing this say 50 milliseconds and then you work out how to tune this the the function you're using to take exactly that time

and the problem is if your pbkdf2 that you're using is say 4 to 10 times slower than it could be then you're either giving a four to 10 times advantage to your attacker who has a like a boatload of gpus or you're just wasting your user's time and power and general energy and it's just a disaster um so yeah the the main thing that Defenders want to do is Maxim attack a work for the work that they have to do and I think they're kind of failing at this um uh the rest of my slides kind of assume that all the salts and passwords are less than the block size this is like 64 bytes so it's generally true um

and we're only going to take like the first hash outputs length of pbkdf2 because actually this thing this capability of pbkdf2 is kind of crap and is broken I'll come on to that later in the last slide um so pbkdf2 looks like this you put you get hmat you put the password in the top repeatedly and you um sign the Sal and then you sign the output and the output and the output and you get all the signatures and you X all them together and that gives you your output so it's fairly easy to see that this um has very strong data dependency um so you can't C ccate u3 without calculating U2 and you can't do

that blah blah blah blah blah so you have to do them all in sequence so it's not easily paralyzable by your attacker which is nice um well yeah so yeah there's some algebra at the bottom I tried to remove most of the algebra this all these slides were 100% algebra um until someone who I gave it to a couple of weeks ago told me to draw diagrams instead um and this is the generalization um of of that previous uh diagram uh to a round which is neither at the end nor or nor at the start um so the question is how many compression function applications per iteration well you can count them 1 2 3

4 and this is what most people do they go there's one 2 3 four there's four box boxes for excellent um but the standard doesn't tell you that this is kind of off by a factor of two um there is one paper for from a long time ago which points this out in passing there doesn't seem to be a lot of documentation otherwise um that tells people that they should be uh paying attention to this um the problem is that these two um compression function applications don't have any data dependencies to the others um and therefore what you do is you do it you do those two first as a pre-computation and then per iteration

you only need to do these two um so uh yeah this is not so to be very clear this is not a new result this is known about by lots of people um it's especially known about by people with Bo loads of gpus but I think it's slightly less known about by people who write Defender pbkdf2 implementations which is the people I'm talking about mainly here um so yeah you end up with two plus two I for ey iterations uh this kind of begs the question of how um Defender implementations are fairing um so I looked at some um so uh interesting things on this slide escript and the escript so escript and the escript most

people know has a um pbkdf2 uh pre-processing and postprocessing processing step but it only runs it with iterations one so it's kind of interesting but it's not it doesn't change script in any way um however if you want a pbkdf2 you probably shouldn't go and take theirs um and apple core crypto as well which was mentioned yesterday I think um all the rest were from code review this one's from dis assembly because Apple hates freedom I don't have um so the winner is sjcl Standford JavaScript crypto Library um and they win because um I look through the entire version history of uh the library and they got it right from the very first commit so excellent very good

job that that's what happens if you have your crypto Library written by uh like a world class cryptographer I guess um other people open SSL and python core so these are basically this they share the same code um written by a German python core developer uh named Christian Hees and he noticed this 2013 time and made sure that both were quite good Jango um uh Scoops did the same thing but in a kind of different context so this CV was actually a denial of service so one thing I didn't cover um is if your password is very long and you do a hm for each step with the key scheduling and everything uh hmac has to pre-process

very long Keys and that means that if you if you send a Django uh web server uh say half a megabyte of password it will duly uh hash that every single iteration and this will take um minutes hours of of CPU time and it will not well and then the B castle on Apple cor crypto uh I I cannot say whether Apple cor crypto has always been like this or whether they fixed it um and yeah so the rest are kind of just yeah are all relatively slow um FreeBSD is particularly interesting because that's uh dis encryption but what they do is they measure at runtime how quick this thing is going and then determine the iteration count

based on that so if it is really really slow um you're immediately throwing away any advantage of of of having fast Hardware by uh well you just give away a two-time advantage to your attacker and that's probably not what you wanted to do so um don't blame implementers for this um there's I mean this I think demonstrates so ignore all these because this is is has been fixing the problem maybe Apple cor crypto don't know um you can't say that one person has got this right and everyone else is incompetent I don't think that that flies um so don't blame them for for doing this because they've done exactly what the standard said and yeah the next question is how much

practical um difference does this make does a two-time advantage does it get lost in the noise does it um is it covered over by other inefficiencies by slow compression function implementations and stuff so well let's measure it for huge iteration count this this all scales linearly with iteration count so whatever um and yeah so back open a cell and Python 3.4 roughly the same because they're the same kind of implementation and then there's the slow ones PHP 5 which is not too bad um and then those two which are maybe twice as slow kind of so I wrote a patch uh which is Upstream for PHP 5 and that made it that quick and I thought well that's quite

good it's kind of concerning that it's suddenly much quicker than open a cell cuz I in my brain my my opinion of the open cell uh project is that they optimize for Speed first and and that's basically it um yeah so if if you imagine applying that um Improvement that I saw in real life to the other ones then you kind of think maybe there's a a slight a line happening there where it's kind of 12 seconds on this particular machine for that many iterations um that result from PHP 5 got me think thinking well why is that uh why is that kind of so much quicker slightly quicker and it turns out you can do a lot better

um so that's the code I'm releasing today um it's there and it's public domain so just take it and please take it away um and oh it's on the last slide I'll leave the last slide up so that's right yeah so I thought I should illustrate why um I is so slow um basically this diagram show is area C CPU time versus area so on the right there's fast pbkdf2 and you can see almost all the time apart from a tiny blue sliver at the top is spent doing compression functions um on the left is openness cell and this um beautiful cyan um is uh the compression functions and everything else is just random crap and so like

memory allocation why they they do a memory allocation and a free at a secure memory allocation and a free so they clean everything out and then they immediately copy the same thing back in every iteration it's really great um so yeah so hopefully this this illustrate so they spend 50% of their time not doing anything useful um yeah uh so um I said that I kind of wouldn't talk about a long uh pbkdf2 outputs because it's broken um and it's broken because well all Defenders have to do this because they want the output but most attackers don't because if they're guessing passwords then they can uh prove a guess from the first block and if they

are if they know the the format of your output then they can say well the the interesting thing I wanted is in the second block so I'm just going to generate that and test that and then if it's right then I can um generate the rest and that's great um you can you can actually fix this with pbkdf2 you can glue different things onto the side you can have a step where you spend time make one one block of and then you can have a step where you don't spend any time and you merely expand it uh that's fine that works uh but out of the box it's not good but if you really do want that and you're kind

of crazy and maybe you have a a a deployed system which just is dependent on this working then uh it optional this Library it's not a library it's a single file um can paralyze this with um open MP and so I have a graph which is the last graph so here we've got um open a cell for one two three four blocks um this is wall time now not super time um and yeah it's kind of 12 seconds which is what on the on the last side and then 24 blah blah blah and then same thing single thread so the the slope is less because it's quicker and then uh multiple threads so this is on a two core machine so

um you can see that between one and two blocks it doesn't take any more time and that's because this is wall time if it was CPU time then it would kind of be going up as well but um and you can also see I think due to the effects of hyper threading that the slope is less so it's able to kind of parallelize but not completely um some of related blocks uh so yeah that's quite kind of interesting because if you if you are using open a cell with four blocks then there might be say uh 10 times performance difference between what you could um achieve if you were willing to throw a couple of calls at

it uh and that's it so I I think it's a poor design um a lot of these things could have been fixed very very easily with very small um design changes um probably which would have I mean some of them some design changes that you could consider would ruin um uh various uh security proofs applied to hmac say about um the use of related keys and stuff but whatever um it's described unhelpfully um they could have made a lot better job of saying well you should do this to start with pre-process the key and then save the result and then use that as a starting point on every iteration that would have been really great

um uh most implementations waste time and power particular interest map in power because my day job is uh in Mobile and I think if you're not paying attention to power and you're developing apps then your users hate you um they probably don't know it yet but they do um um and yeah you can probably drop in a faster one and that's you

lot I like possibly imagine there are any questions and comments on this uh questions and comments Ste Ste Steve was actually first this time Jeff yeah so I I asked you on Twitter uh and you said that you were going to use uh Vector instructions right yes um that's one thing I didn't cover so um the whole implementation is uh designed in a way that um everything is trivially inlinable and that means that things like xoring together the intermediate values um with GCC will get transferred compiled to S2 um uh block exos so they're very quick and the main thing is everything is available to the compiler so it knows exactly what's happening there are no um

uh function boundaries about across compilation units where it can't um optimize without whole program optimization which is a disaster and no one uses it um one thing I would say is I don't I'm not claiming the well the thing back there is a performance record because I think there's still some space to do that and the main reason is well using um open ssl's hash functions um unchanged means there's some um swapping back and forth of endianness um so the hash function input wants things in network Andi inness and it wants to process me host end in this um and so you have to kind of swap back and forth in me in a loop and that's

kind of annoying it's not necessary but I don't also don't want to um start maintaining a load of uh assembler uh hash functions in all different platforms because uh yeah it's just boring hi um sorry yes I enjoy life and just not doing that um um I've got too many questions so I'll try um do yeah yeah okay um uh uh okay one quick one um are you aware or have you tested how any of this stuff works on Android or just recing and that that just happens to be where I'm facing a bottleneck yeah on pbkdf2

so I think um right there's a couple questions there I think uh bouncy castle um is quick but Android has its own Fork effectively bcy Castle which they haven't updated so I did look at that and it's kind of way way behind and it's slow and it's eventually going to fall back down to some hash function implementation which is also slow um so we uh we I'm not here talking to my my employer but we use um open cell PK F2 called through um ffi and that works you know reasonably quickly um but to answer your main question no I haven't tested this on arm or compared um arm pbkdf2 arm open cell pbkdf2 um to

my want to expect kind of you get similar results yeah okay so I would say we should give him of course one more time and you'll be around for some most har I guess oh yeah yeah yeah excellent so now there's um you know uh the ability to relax there's a happy hour in the CH out room we will will be uh starting up here again at 5 the next speaker is Andrew hins from Google and he

PBKDF2: Performance Matters

Related talks