
right now we have a great talk Rohan McLure going out on a limb accelerating elliptic curve cryptography let's welcome him to the
stage thanks for coming to see me uh present basically a math lecture from a terminal um hi my name is ran I work for IBM um I do software support for the power architecture um I predominantly work on Linux kernel stuff so Arch power PC um the power architecture is the instruction set which our modern uh server class processors uh run on um as an aside I also work on cryptography libraries so the the general Linux on power software um ecosystem uh the largest of which is openl this is a library which probably doesn't need an introduction at a B sides um it's a crossplatform uh cryptography library and in fact the most widely used one serves as an
upstream um for some of its Forks um and it's heavily geared towards performance written in some pretty lowlevel c um as well as uh architecture specific optimizations in assembly the assembly is written in Pearl as well which is a lovely caveat anyway my work is uh to do with elliptic curve crypto um so I took this one particular algorithm and uh in fact completely reworked its implementation so that's 2,000 lines of sea I managed to ship into open SSL um and we're seeing some pretty strong speed up across the board um with it uh as well as um a fair candidate for ASM optimizations by your relevant architecture why this curve is relevant uh is because so the NSA has put forward
their advice on the suggested curves for every different commercial purpose and so for uh digital signature generation and verification as well as key exchange this is the recommended one um sort of its key size fits within the kind of Goldilocks I suppose um but funnily enough it's actually the the last such curve in op SSL um from from nist uh to receive sweeping optimizations so yeah um against a an already optimized for power Baseline uh which I've written as Montgomery um we have we're seeing about four and a half times speed up um which is yeah more than I basically could have possibly expected um uh that's so yeah there are already assembly optimizations into that number so um if you remove
those that brings us up to about like a 5.2x speed up off Baseline and then in ver ification we're seeing a 1.7 x speed up so these are substantial um this talk is about where do they come from basically how do you make crypto that much faster um a little bit of uh background on elliptic curve crypto um so it's a public key Crypt cryptography scheme um uh equivalently you may have heard of asymmetric cryptography um and it's largely supplanted in the process of supplanting RSA um so for a smaller key size you actually get a comparable um degree of security which is nice um and so I mean the the upshot I'm going to give is is
just allows for better performance really so um what is an elliptic curve this will appeal to the mathematicians in the room um uh although I'm not going to go into too much detail um essentially you analytic curve is the set of solutions to the following equation in pols um are equipped with an algebraic group and so you use some corrar of bazo theorem and you get a notion of what it means to add points together on the elliptic curve to subtract points um as well as a notion of identity through um uh points at Infinity uh for performance however it matters only not um what operations are going on and how you get a discrete
logarithm problem it matters what the the underlying mathematical objects are and so or at least at that level is where I've put my optimizations so you need only know what X and Y are basically well they're elements of some sort of finite field uh there are two types of finite field uh there are not very many types of finite field at all but there are two types which um matter to us kind of in the cryptosphere from what I'm aware um there's binary fields which we won't be talking about and and basically just integer arithmetic modulo big old Prime uh one real advantage of ECC is that these big old primes don't change unlike in RSA where your modulus
is determined by key setup um programmers in the room will be familiar with a percentage operator in most languages um but we need our modular reduction and all of our arithmetic to be constant time because it's it's crypto um a bit of a primer on modular arithmetic uh essentially you can think of the modular operator as divide by our n and take its remainder um but modular arithmetic uh basically it's a it's a ring homo warm I shouldn't have said basically um but uh that triple equals operator you're seeing that means that the left hand side and the right hand side share the same remainder and so you can have that triple equals operator in
as much as your you're only ever adding or subtracting multiples of the modulus um and then the lovely thing here is that when you do arithmetic additions or multiplications you can reduce your numbers modular n at any stage or partially reduce them Etc and when N is a prime you get a finite field delightful now zooming back a bit just in fix Point arithmetic there are some uh performance uh pitfalls so suppose you have some A and B bounded by some n um then the upper bound of their summation is going to increase arithmetically that is two upper ANS uh added together less than 2 N but the upper bound um V product is going to increase geometrically so three
terms is bounded above by n cubed four terms by n the^ 4 uh and so we're going to need double the number of bits in order to store their product a bit of a solution here is to use some sort of data structure which is permissive of variable width integers um and you can build that up using what's called a limb representation so the idea is you have your Lis which are themselves fixed width integers and you concatenate these fixed width integers together just like you might digits of a base 10 number so 2^ n in this case 64 is our radics um and we're concatenating our limbs together um every bit in limb one is less significant than any bit in
limb two Etc uh well one uh strength of our algorithm is that we don't have to do carries we can leave carries basically to the last possible moment we can lazily carry um and so one uh function we'll want to write is a multiplier or a square kernel um where we we're going to want to add together these cross terms these AIS time ajs each of the AIS let's say is bounded above by 2 64 we run into an obvious problem we need some sort of larger backing type we need to do that cast to 128 in in order to store the result um what we've just done is we've done uh redundant limbs that is we allow
for each of the Lis to actually be larger than their distinc their their radics the the the difference in place value between limbs um all right well we've done that n is still equal to 6 64 and so we end up with the cross terms fitting neatly inside of an unsized 128bit number but we don't have any space to actually add the cross terms okay well we need to introduce more redundancy and I I like to call these unsaturated redundant limbs um the idea here is that we set a precondition on our arithmetic which is that uh the limbs which compose a um are going to be no larger than 2 to the 56 what that
exponent is in our case it's our limb width is 56 it doesn't really matter in as much as we can promise that we'll never overflow in our calculations um and so now when the limbs of a are suit suitably small we can't possibly overflow um in our 128 bit type this is uh a little bit of unicode diagram of what unsaturated redundant limbs can look like so limb width that is the the difference in in the starting point of each Li successive limb is 56 bits but the underlying type can be 64 bits and so um to the left of those dotted lines you can have a little bit of overflow you can have a little bit of
High bits um which uh redundantly store um uh the contents of of higher limbs so basically Bally where two limbs um uh overlap vertically um we're going to have to do some sort of addition to remove remove the uh the redundancy um this brings us finally to uh the overall uh implementation I I wish to to use called selus um the basic premise is that we're going to keep writing these kernels until we've uh comprehensively um produced our algorithm um which are going to be able to compose so uh the inputs and outputs to this function should be reduced um modul Prime and uh each of the limbs should be minimal um we can then do some operation
like a square or like an addition something which could possibly overflow and Pat out into these 128bit limbs these larger 128 bit limbs um and so long as we reduce and so long as we're able to reduce um the the final state of temp um then we can now arbitrarily compose these Square reduce kernels or something similar um a little bit of a primer to understand the rest of this talk uh so we know what a polom is um a polom is just a weighted sum of uh powers of some variable let's say t and you'll observe that um in summation form this looks very Sim SAR to the value represented by a multi-m representation we've just swapped out
our radic that is 2 the N for T and so we can actually um think of arithmetic modulo a selenous prime that is a a prime easily expressible as some low order polom of powers of two um we can reexpress our problem in terms of reduction in uh the Pol in polinomial rings basically um and so uh a smaller Prime prescribed by nist is p24 and so if we let F of T be this degree 7 polom then our Prime itself is just our polinomial evaluated at 2 32 and because we're doing arithmetic now mod f of T um we can manipulate these polom Expressions uh the up here is that whenever we get uh a limb s a
limb 8 or anything higher than that we can actually reexpress um that limb um in terms of lower limbs so T 7 has the same remainder mod f of t as TB minus one and so basically the substitution the summation from a couple slides ago is saying that limb representation is just uh polinomial coeffic a Lim is a pol coefficient um where T is 2 the Lim width um and so our inputs to a multiplication A and B are just pols we polom long multiply them each of them was in terms of 1 through T6 and now their product is in terms of 1 through T12 and so we take the kind of reduction rule from earlier uh T the 7 can be
expressible in terms of smaller uh uh powers and we do the same up till T12 and we can now reduce all the high limbs using some slick matths um the problem is is that we're actually dealing with a harder Prime this is likely uh why there's been no selenous um implementation so far in openl prior to this patch um and all of the indices they're neatly divisible by 32 um but we would really rather not be wasting if we're on a 64-bit machine we'd really not rather not be wasting half of our um registers um with empty arithmetic and so let's let's actually pick our limb width to be uh one by less than this is arbitrary one bite less
than the maximum size so uh 56 um this is an optimization for if your architecture provides um uh fast instructions for bite add quantities um and so we end up with this absolutely horrendous looking polom in terms of our in terms of our t t is two to the power limb size uh so yeah things have blown up in sophistication basically um and so it was not at all uh clear from the offset that this would yield better performance results but it did um bit of a high level overview of what what what we going to want to do is um we've just done a multiply or a square so we've got a 13 bit number a 13
Lim number um and each of these limbs is going to be pretty fed up because we've just done a sum of cross products um essentially we want to remove as many add or subtract multiples of our Prime such that we remove all bits uh which are greater than or equal to a significance of 2 to 384 so limbs through 12 we're going to want to get rid of and then the High bits of Limbs four five and six kind of they straddle that dotted line uh and so this is our game plan we're going to want to subtract at some stage So to avoid underflow we're going to have to do something um we're going
to want to get rid of Limbs 7 through 12 uh with our first substitution um let's carry uh limbs four and five up to six so that we can remove the High bits of limb six all in one go um that's going to be our second substitution and then finally we carry out all redundancy in our representation and profit uh I got this lovely review comment uh telling me that i' put a whole bunch of random constants in critical crypto code and not explain things uh and I I don't know why they say that but um here basically is a telescoping sum um so here's all all these constants collapse to just a large multiple of our Prime we're going to add
a large multiple of our Prime so whatever we do whatever subtractions we do we won't underflow um it's telescoping in that uh we're adding a copy of 2 124 to one Limb and then subtracting a copy of 2 to the 68 from a higher limb 68 plus 56 is 124 so we're adding and subtracting the same thing uh not only is this a large multiple but each one of its limbs is sufficiently large so that we can subtract uh limb from limb uh and then we do our first substitution uh a bit of mod arithmetic later and we end up with um the ability to substitute t7 in terms of T2 T1 and t0 and so this gives us a rule to
eliminate limbs 7 through 12 um to eliminate limb seven uh again uh vertical overlap of Limbs means we're going to have to add we uh add a shifted copy into limb two a shifted copy into limb one um we subtract and add shifted copies into limb zero and this is what our code looks like uh We've now we're we're going to start with limb 12 um it its reduction rule can be derived from limb 7 just by multiplying T by T the 5th um and uh surely enough um we add into limb add into limb six subtract into limb five add into limb five as as the formula would say but to avoid overflow this
time we need to actually not only add into limb seven but add a little bit of our limb 12 into into limate so yeah lots of complications here lots of caveats um and after our first substitution we've just gotten rid of Limbs 7 through 12 um we've redistributed them into the lower end of number by in effect subtracting a multiple of our modulus which is lovely um what remains to be done is the High bits of Limbs four five and six need to be substituted so let's carry forward um those High bits from limbs four and five just some bit shifting clearing and adding um everything's bit shifting with this that's the lovely thing about Selen
methods everything's bit shifting no mul no no proper multipliers um and then finally we do uh one last substitution of all those High bits and limb six notice that four and five are skinny now um and that's what our substitution looks like those bottom three blocks of code um have share the same format as the last substitution we're bit shifting limb six to get the the high the High bits and then redistributing them avoiding overflows and then finally um we set our pre- and post condition for each of our kernels to be that uh We've reduced and gotten rid of redundancy that is uh all of our limbs should be less than 2 56 and so we just do a whole bunch of
carries um to accomplish that we actually haven't fully reduced our numbers mod modu P but we don't need to um we just need the pre- and post conditions um of each of our kernels to to um be compatible that is to be able to compose um and we make everything skinny that's what our carriers do um and yeah that's our algorithm from 30,000 ft so the bit which will be relevant to a room of cyber people um I added 2,000 new lines of C into openl um and yeah the code standards were kind of high like they told me when I commented incorrectly but um there's this hilarious bit where they said oh I have a mate who knows maths let him get
back to you and then he never got back to me so so they just merge 2,000 codes of 2,000 lines of OB fiscated maths basically from a from a new contributor um so yeah it turns out if you wanted to and and I don't this is a great way to back door most of the internet um don't get any ideas yeah so to next um again I've set up these functions multiply Square reduce they're up they're Upstream they're ready to go in version 3.2 of open SSL uh but there's further ASM uh optimizations that we can be doing on this curve as well as um powers of byian architecture it can run little endian or
big and there's some trivial patches to bring big endian support I continue to provide community support for power PC um for openness of on power PC and I hope to touch on these curves and see if we can get some extra performance somewhere for anyone who's interested um that's my talk anyone got any [Applause] questions are there any questions in the audience why here uh thanks for a great talk row um sorry before the cenus method um the previous method that you compared against um could you give us a brief overview as to how it was running um and in particular like what sort of arithmetic operations did it use was it mostly shifts and ads as well or was it
doing some like multiplications and all this nonsense yeah good y so essentially uh the previous method is called Montgomery ruction it has a nice Wikipedia page um in short it's a it's a ring homomorphism again on modular arithmetic um uh but it has this nice property where you can you can avoid doing yeah doing any um explicit uh divisions by a um by a non power of two basically it's a um yeah anyway so Things become bit shifts again um uh where we actually see performance gains so here's some bonus slides um basically uh selenous methods are hyper optimized to the particular choice of prime Montgomery is however is uh a generic approach um uh the real Advantage however is that
we can write code like this so we can actually leave our carries to the very end um conceivably you could you could write this into into Montgomery as well um but the the primary advantage in leaving carries to the very end is that you um already have an arithmetic bound computation um and if you keep really really loose dependency chains and you kind of frontload a lot of the arithmetic then you open up your code to running really fast on a super Scala processor so instruction level parallelism is basically the the answer in short um conceivably you could uh and the p256 code has a has an optimized Montgomery multiple uh method um conceivably you could get similar
optimizations out of that but um this seems simpler actually um but yeah there's that um yeah
thanks