← All talks

Matt Wood - How Not to be Seen: Creating Non-Speculative Side-Channel Resistant Code

BSides PDX · 201942:17132 viewsPublished 2019-11Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Software side-channels have been a hot topic recently, and with good reason. Many of the techniques are used to liberate secret information from other processes or trusted execution environments (TEEs) such as Intel’s SGX, ARM’s TrustZone, and the like. Some of the techniques making headlines are related to speculative execution properties of modern processors, but there is an entire class of non-speculative techniques also receiving a lot of attention in recent research. Luckily there are a few techniques available for implementing algorithms that use secrets—like cryptography—so they present as few opportunities for leaking information as possible. In this talk you will learn the anatomy of a few classic non-speculative side-channels on mathematical algorithms used in just about every system in modern computing, followed by industry best practices for mitigating them, and finally what you can do to help minimize the risks for your applications. Matt is a Principal Engineer in Intel’s System Software Products (SSP) division, focusing on open source security. Part of his day to day job involves helping project teams understand how to implement and use cryptography properly.
Show transcript [en]

hi everybody I think we're here to learn things correct well if you aren't that's too bad your obably gonna learn something by accident just sitting in the room so coming in have a seat sounds your devices and let's learn about how not to be seen Matt what is a principal engineer at Intel system software products division he focuses on open source security part of his day-to-day job involves helping project teams understand how to implement it to use cryptography properly and he's got some great tips on how your code cannot be seen everybody please welcome Matt

thanks for showing up either before lunch when you might be getting hangry or after lunch when you might be getting sleepy hopefully everybody will be happy here who knows where the title how not to be seen comes from few people okay so as background it's a title of Monty Python skit where basically it starts out with a scene you don't see somebody and the announcer says hey could you please stand up as the guy stands up he's summarily shot by a sniper in the distance and this happens over and over until people learn how not to be seen and the side channels when it comes to secrets are sort of like ways of nicely a secret to stand up and and get plucked

and so the idea here is we want to basically use our secrets in a way that cannot be seen so specular side channels are basically what are getting all of the sexy headlines nowadays and yeah but there's this whole class of side channels out there that have been around for a couple decades now practical for about 15 or 16 years that don't take advantage of speculative execution right and they span every architecture out there some of them work on an API level some of them you have to be on the same machine and take advantage of cache attack or something like that the headlines on the screen here basically cover the gamut of getting secrets just

across the different processes on OS on Arman Intel little breaking them from trust zone from SGX and so on right it's a pretty common problem so what are we supposed to do about this right so pretty much every paper about these non speculative attacks or traditional attacks or whatever you want to call them use a phrase similar to use constant time programming and this will solve your problem but they don't tell you what that means most of the time right they they assume you know how to do it or where to go find it so that's what we're going to talk about today is what does it mean to use constant time programming how to make your code so

they can't be seen so when I talk about a non-speculative software side channel what do I mean right so typically they are mechanisms that they take advantage of timing differences in how your code executes and you're basically watching what these time differences do to derive something about the secret that's being used it's a long sentence and you know but it makes more sense as you go through it like I said it can operate in API level it can operate you know across processors on the same machine shared cache shared pipelines and things like that Emily I said before has no dependency on speculative side channels the I'll actually talk about a specific one how it works and how we defend

against it you're coming up pretty soon that worked well before a lot of the current generations of processors so what causes a lot of these timing differences some of them are pretty obvious right taking early exits you know based on errors or even on at normal run time the happy path taking early exits I can detect a lot of those early exits even over a network interface local network sometimes even across internet depending on how long the time and difference is variable variable processing of the input based on some data I want to know about usually it's a key or something like that but you know variable processing is where you're gonna find that and then

when you get to the local attacks you start running into mutual access access of shared resources and this weren't going to spend majority of the time today by far the most popular resource that attackers watches the cache whether it's l1 to l-3 whether it's data or instruction cache attacks are definitely the most popular target there was a paper recently though that I found really cool I talked about an attack called port smash that took advantage of port contention in the pipeline to liberate secrets and one that a lot of people may not think about is there are some instructions that you want to stay away from in your constant time code because they do different things

depending on the parameters so you know one that you would think might be safe because it's basic math or a integer division that's a tough one to use because if you pass in a parameter as a divisor that it you know has a Hamming weight at one right it's a power of two a lot of most processors will take that and translate that into a shift instead of actually executing the devise of IDE logic and the difference in the cycle count for that instruction is quite noticeable so constant time principles to the rescue right so there are well-known constant time principles out there if you know the right terms to search when you're you know in Google

you can find them there's references here and there and you can find them on Wikipedia and other places but generally they aren't all brought it together in one place they don't know they have that great of a in-depth description I was a Amundsen actually just published when he's a well-known cryptographer he recently just published a crypto coding paper I'd recommend go looking at that because he covers these as well also keep in mind constant time on modern processors is kind of a misnomer nothing in a modern processor actually runs in constant time because of the way pipelining works out of order execution and so on in early generations a processor where things ran synchronously you could

actually do this right you can make sure you know your code is always going to execute in 237 cycles or you know whatever the case may be that's never going to happen today in modern research is called data oblivious computing so a little little harder to Google but the idea is that no matter what the secret you're operating on it always takes the same path always access is the same data in the same order and by doing that you know even if an attacker tries to kind of disrupt your process they might flush a cache they may do something like that they're not actually gonna gain any information because they're gonna see the same cycles on the bus so you know the main

principles here and I'll cover them a little more detail you know first the kind of overarching thing right make sure your your runtime is independent of the secrets if you're thinking at an API level you don't take those early exits on error you know make sure you always do the entire operation and then say what happened right make sure all your code access patterns are the same so if you were to trace through the execution of a crypto algorithm any instruction pointer sequence that involves using that secret better look the same for every secret that you're going to have every secret of the same size that you're gonna pass in an algorithm all your a is 256 should look the same all

your RSA 2048 should look the same and so on and then ensure your data access patterns are independent a secret value right so no matter what your exponent and the RSA key is it should always take the access the data in the same general pattern and so on so a little more detail on that first one I'm sure this one's obvious right avoid flows that are detectable based on the input right you know any secret input pay attention to your error states right so blasian Bakker one of his first attacks detectable in API level actually use the error states in TLS to expose the TLS private key it was the first attack of its kind that

showed that chosen ciphertext could work against asymmetric algorithms and I'll happen because the basically it would give too much information away about why that that protocol failed look at your entire data flow your upper level algorithm and the execution of say RSA may look like it's all doing the same thing every time but every little instruction of your a multi-position math library for instance has to be checked because it may have something that's off and can be used to differentiate the operation and you know pay attention to what instructions you use because and so on there's variable behavior a lot of times you can find this from the development manual reference from the manufacturer code access patterns

don't let the secret influence you know your your conditionals if you have an if-then that takes it for branches based on a secret value we're gonna find it we're going to take your secret right in direct branches basically that's a call through a pointer right make sure you're not two different pointers based on a condition and so on pay attention to the secret data but also pay attention to any values derived from the secret a lot of times people forget that one a yes for instance you take that key and you put it into the the init function it's actually driving a bunch of sub keys you have to pay attention to the use of

every one of those sub keys as well because they are a variant of the secret and there are attacks that have looked at how sub keys are derived and how sub keys are used and your instruction pointer sequences as I mentioned before has to look exactly the same if you were to trace it so you know ensure your data access patterns are unique this looks very similar to before there's a pattern here if you haven't figured it out yet or in about this your data size and loads have to be the same so if you're dealing with one yeah one secret don't access it in eight but you know 64-bit chunks in one case in 32 and another

always accessing the same bit size because that's going to cost the same cycles over the bus the same timing same cache effects and so on pay attention to the secret data and derived same thing as before this is phrased a little different and a little because of heap allocation and SLR right if your algorithm uses a static resources a SLR actually may move it around in the process space so execution to execution of a process may have that that static resource at a different place in the case of heap allocation even within you know if you run something at a loop your key value is gonna be at a different place in memory each time to the loop

because of malloc but the access is within that allocated buffer have to be at the same offset or else it'll give away the secret so you know let's look at a simple pattern that you can often see in an API level right and so this is off comparison timing is the simplest one everybody does it how many people have written something that compares a password or something like that right pretty much everybody has and it's quite possible that most of us have done it wrong the first couple times in our career right so you know certainly these require constant time comparison comparison of Mac values comparison of passwords things like that we need to use a non constant time

comparison it creates what's called an Oracle right so in in a classic mythology history in Oracle is this this priestess or a shrine that can be queried and it will reveal some piece of hidden data right in cryptography and Oracle is a mechanism that will statistically reveal these to reveal secret information right it doesn't have to be 100% accurate but you have to be able to know you can use it in query it enough times that you can eliminate the bias and out pops a secret right so you know the basic data comparison function looks like this you know you're the same and then you're gonna you know check everybody the data the problem was we've

got a couple Oracle's here right the first one is the length comparison it's gonna be data dependent right and then the second you don't notice it's it's taking early outs based on the comparison right first time to find something that's different it's gonna exit and so I'm gonna be able to use that to attack the data and so you know here's a sample scenario right raw password comparison short passwords are only allowed three to five characters because it's a super secure system you can only use you know ASCII a through Z and you know we're assuming the hacker can basically you know pass in any password they want and as frequently as they want and so in this case we know

that the target password is oops right so how would the the attacker go about finding this right so the first thing they're gonna do is try to figure out what length the password is right so I'm gonna call it a number of times you know within my constraints and I'm gonna wait until the timing gets a little bit longer and so I'm assuming here for this example that every comparison takes one increment right and so you first find the like D query it with the minimum you query it with the next one up the timing changes and you go I found it right then you move on and you try and find each one first you

start with that first letter in there and you just iteratively start calling it until your timing gets that next a little time bump and which point you know the first letter then you move to the second letter then you move to the third this kind of attack has been used not only on passwords but it's actually been used on Mac values too in cases where you know they're doing a protocol there's a Mac comparison that needs to be made there's no timestamp you know bunch of protocol violations I should have been a bunch of red flags and in that scenario but this was deployed code where it turned out that you could attack this in much

significantly smaller time that brute force right so you know how do we go about silencing one of these articles right so what you'll see is code very similar to this in just about every crypto library out there right you know the first thing it does is basically it's it's going to accumulate data to instead of doing direct comparison it's going to accumulate differences of bits to determine it's a final answer and so you know the first thing it does is compares the lengths by doing an XOR right and the value is going to cause a bit into the accumulator and then it's going to run through the loop and it's gonna you're gonna X or Y or Y or bytes all

together and the end if you have a value that's nonzero then you know you result you'll notice that the computation here is all math ops there's no logical and no logical ORS there's a logical comparisons but that's taking advantage of the fact that the C compiler treats a comparison as a result of 0 or 1 into this mathematics it turns out that GCC as a compiler will generate the code that you think it does right it doesn't take early exits here no that doesn't say that clang won't be clever and I'll get into that that kind of thing later but you know this is the kind of gymnastics you go through to try and get

the code to actually execute in constant time so how do you get rid of that length Oracle though all right I can still do that length attack because you know very easily figure out when things things match right and so really the only way to do that is as the caller making sure you're always passing into the same lengths buffers right you're always storing your your raw password or your hash value and thus in a specific length buffer you're always bringing your comparison value into that same length buffer and you're comparing the same number of bytes every time right you can't solve it in the comparison function because if you try always try and compare a max length

you're gonna have kind of that you know read read pass Enda buffer problem right so make sure that your caller you know all this passes into equal buffers better get a case of passwords actually do the right thing and sort of password hashes and that way you also know the comparisons going to be the same because the hash links are the same right so shift gears a little bit and start looking at the local attacks right so you know classic local cache attack cash missing for fun and profit who's read the paper that that title comes from a few people and you don't count so so it's some background on that paper right so this was published by common personal

in 2005 phrase it's the first paper that really kind of got notoriety for this kind of side channel odds are there were other researchers that had done this but you know this one I got you know press and exposure so this one in the paper he targeted you know Pentium 4 with hyper-threading enabled right and basically showed that you know with this configuration you could mount a cache attack and extract exponents of a private key from one side to the you know in a separate process takes advantage of a cache observation method called primer probe and basically it would it would use it to determine whether a square was being done or multiplied and and then you know you

could subsequently actually be used to figure out what exponent was being computed into the computation so you know how does this work write a primer probe basically you know the attacker trying to figure out what's happening in this other process starts up in this attack loop right the first thing it does is it Prime's the cache by making pushing everything that's not its own referencing its own data out of the cache right how you do this can can be a little different on different architectures but there's a general pattern you can follow to to get a cash to pull all of your data in instead you let the target process run for a slice of time from a practical

standpoint this just means you just immediately go around in the circle you don't don't wait at all and then the attacker then goes back and accesses the same data it tried to pull back in the cache and it looks for timing differences and how long it takes to access each cache line and if it takes longer to pull that cache line in it realizes aha some other process on the system access to memory the aliases to the same place in the cache right and then basically once you have that data then you can start looking through it and you know do some signal analysis on it and quite often you can detect when something like a crypto algorithm is

happening and what it's doing if it's you know it has if it hasn't learned how not to be seen you will see it so you know what does what does the data from one of these attacks look like right so this is a heat map representation of a capture of an RSA operation happening right it's actually a small sliver of the operation of one mod moduli exponentiation happening with a Chinese remainder theorem computation right I'm not going to get into that deep in math but you can start to see certain patterns in in this this cache map right you know you look down the right side and you see you actually start seeing a cadence there's another one on the left

side where you can see there's there's a cadence effect of actions happening and then you start looking to the the other lines across there and you start actually noticing what function is being called because instruction cache and so on right so to read this right each column is a cache set caches are organized into you'll here called you know so many rows deep so many columns right and different different processes to choose different architectures for different types of workload optimization but you know assume you can you know what we're gonna look at is you know what what the the time it takes to read through a cache that is down as successive samples right and you know

darker spots represent a longer access to the thing right so you know it turns out if you look at some of the the way the the columns are accessed and and how dark they are you can actually start noticing where a square operation is happening and where a multiply is happening because it's touching different code lines and bringing those into the cache so how do we apply this and so modulo exponentiation right it's the basis of you know computes you know a equals B to the you know it's the basis of everything that pretty much all the major algorithms you use right you know RSA diffie-hellman algum all right you know basically anything that's not

elliptic curve or a block cipher it's you compute mod X using multiple steps and if you can if you can identify what the steps are you actually end up reading the bits of the key out right away right we used to make a game out of looking at the heat maps and figuring out which one of us could identify the key in in the cache block quickest so you know let's go through real quick how you compute a equals B to the e Montand right so we're gonna start with the simplest solution because it's easiest to understand it's also easiest to see where things go wrong right and we'll make some simplifying assumptions right we're gonna treat multiply of a thousand

24-bit number as if it's a basic operation alright assume that it's actually implemented concept time and we're gonna assume that multiplies the squares take the same amount of time in reality squares are actually quite a bit quicker than multiplies when you optimize them so you know this this method is called left-to-right exponentiation the names may seem a little technical they're taken directly out of the applied handbook of cryptography great book if you want to learn how the numerical methods work and you can get it free on the internet so definitely check that out it's the most naive one it's run time and how it operates is directly related to the number of ones in the in exponent and if

you're familiar with RSA or diffie-hellman that exponent is the secret key right so you know we're trying to find e so you know you'll notice in here right there's a branch based on on the exponent bit right if I can detect that branch I know when you're processing a one when you're processing is zero and if I'm just looking for data accesses I can actually just look to see are you accessing the value containing B or just the one that contains a right and so in bit size up is also something you have to be careful of because you'll want to make sure you know if this if this operates on the most significant one bit in the value

versus the you know your exponent is thousand 24 bits but bit 997 this is actually the last one with a 1 in it bit size better 1024 so how do we fix this right you can do it this way wouldn't recommend it right you know it basically goes through and and for every loop through for every bit it computes it computes R which turns out to be either B or 1 depending on if the exponent is is 1 or 0 you know so it's going to access the same data every time it's going to do the same computation every time right your efficiency greatly suffers right this is I say you probably don't want to do this

right in the standard method approximately 1.5 you know multiplies per bit exponent now you're you're guaranteeing two right so you've taken you know roughly 25 percent hit right so how do actual crypto libraries do it right they use a method called left to right K area exponentiation so what you do here is you choose a bit slice and you're gonna process the exponent that many bits at a time and so in this case let's just say it's three right and so now you're gonna pre compute powers of B up to you know the K to the K minus one and you're gonna have those ready and then now that you know those those powers you're gonna start going through

the exponent K bits at a time and you're gonna always Square to the same number of times K times and then you're gonna pull out this pre computed value and and multiply it so now the time to process that key takes the same time every time for an exponent right you know you're you're non-trivial multiplies the bigger the K is the you know the approximation for the number of multiplies actually gets better and better right but you did have to pay attention to the size of your exponent because if you choose a big K with a small exponent then you spend all your time precomputing and not actually doing the computation right so you'll notice that in certain crypto

libraries it'll actually look at the exponent and pick K based on the size of the exponent but who's noticed that now we have a data access order problem here right so you know the the way you get around this is you don't actually access the memory directly right you have to have an access function that's basically accesses every byte of every pre-compute every time loop it has to do it in the same order and it has to hide the fact that it's choosing a certain index within that table accomplishing this is a bit architectural dependent I didn't want to throw some code up on here because I beginning down to assembly and I'd be getting to multiple assemblies because

for instance x86 you know without SSE instructions you can use a loop with C move to scan through the table and move it but if you have SSE right you're gonna start doing you know load and mask operations on these fairly large chunks of data 128 - you know 256 even 512 bits at a time right and so you know you're gonna have different loops depending on what instruction set your processor has to optimize this this access at the table and then you know of course arm you're going to use a different different one and so on there's a lot of code dedicated to doing this type of operation and various crypto libraries so I alluded to you know the compiler

doing weird things right so you know beware your compiler right there is a problem for cryptographers there is a problem with the compilers for most the time it's not right and that problem is just because your code looks like it's constant doesn't that mean that's what the compiler is gonna do right the compilers get very clever clever in ways that you know they will do path analysis of all the data through the code and they will figure out where to add early exits to things they will figure out how to recompute the math that you've done to be more efficient on the current architecture but possibly a quite a different path than you think functionally equivalent but not the same

right most of the time we want its health so we don't so basically you always have to to verify that your generated code actually does what it is typically that means your validation person or you or you know whoever has that role it's going to decompile the code and actually check the assembly right or you're gonna use some tool to check it and then you want to make sure that as you update your compiler as you change your you know your crypto library anything else you're gonna want to revalidate that it still has that same property because GCC for instance will chain add things to its optimizer from even minor version to minor version at times and so you have

to be careful unless you want to be absolutely sure right in assembly which i'm sure all of us some of us love doing I have my people that like it but they're typically masochists right I'm also it's like two so right in a slightly higher level so if we look back at that example that I gave who noticed the volatile keyword in there okay one person to person if people right so the volatile keyword you know basically here if you don't put that volatile keyword in there the compiler actually generates code that's equivalent to this you know I started running this the same function you know through the compiler different optimization levels GCC seven for ram

the bun to 18 for if you want to do this experiment yourself and you'll find that without the volatile keyword the code gets kind of convoluted in places and looks a bit odd but it actually has the same logic in the end and so what it's doing is it's going it sees that x equals zero comparison at the end of the previous implementation and goes Oh anytime I see this X variable go to zero I'm just gonna exit because I know I've already lost right it can't go back to zero because it's all horse so it basically inserts these early exits for you that you're specifically trying to so you put that all the keyword in there

and you know it's what's called a tight qualifier and you know kosuna tells the compiler don't make any assumptions about this variable because something may modify it outside of this code the control of this particular code typically it's used for say registers memory mapped registers and such because you never know when the hardware is going to change that value but it can also be used to influence the code generator effectively tells it you know avoid anything that might optimize on this value you write you know don't make any assumptions till the end it's useful in some cases again you have to verify that that you know it produces what you think it should but you know there's a

big caveat to this in that technically the compiler doesn't have to honor this this type specifier right this there's another one called register right you can give the compiler or hints it can completely ignore it it can assume it knows better and it can just do what it's going to do anyway right and the other thing is it's also not going to necessarily generate code this is efficient as you think it is right because it doesn't know when this value is going to change it might introduce extra rights that it wouldn't normally have to do right it would normally store a value you know in a register and so on so you know take away

make sure your code make sure it all generates the same way and a lot of times you crypto libraries all all crypto libraries and a lot of times your OS will give you constant time implementations certain basic functions comparisons of different data sizes and so on take advantage of them don't don't mess around of doing it yourself right so you know how do you find these things right typically code inspection is how people did it right code being both the assembly disassembles and you know the the main code that you're looking at it's painful it's error-prone the compiler can get you the architecture can can get you if you're not familiar with that architecture a lot of the papers that

are coming out that the authors would write the spoke tools that would search things so that heat map what it's actually from a tool that I wrote when Percival came out with his paper so that I could visualize data I didn't have to go much further with that one because it turns out that I could you know today I would run through some libraries and the data would have popped out right earliest earlier this year that was actually a nifty tool that came out called my car walk the authors actually use this to expose some non-constant time implementation in Intel's crypto library right they reported its Intel you know the responsible disclosure and you know in telling through and found it

released a fix to the code before the you know the paper came out you know the thing people are supposed to do the codes open-source right it's it's actually based on a debugger tool that the intel has released and has had out there for seven years now dynamic instrumentation is actually uses em Erik theoretical methods to go through timing data and and find hot spots where it's likely to have some sort of side channel in it right but you still have to do some work you still have to you know pull it into various analysis tools and actually do that and work yourself right so to that you know to that end to make it a little bit

easier you know yesterday we pushed code to github a tool called pin base constant execution checker and I'll talk about this a little bit so basically it's also a dynamic instrumentation tool so it's also based on the pin framework but it uses a different analysis mechanism so in this case it's using taint analysis and diffing to look at your code executions and basically point to instructions where your code went wrong right already or where some coach accusing went wrong targets Lenox and you can find it at the Earle here right side github slash Intel slash based CAC so you know what do you do with it right so you're ready to test a program that

wraps the function you're calling you mark your buffers that you say here's a secret you call the function and you know then basically tell it to finish the traces run your traces and then there's a post processing script that'll run through all the traces and then basically spit out here's the function that had the problem in it and here's the index within the function where this happened right so some you know a sample script here you know this is actually an excerpt from the sample and the in the repo you basically you have your secret you know generator secret in the loop you mark it call the function in this case it's a key setup is when it's

instrumenting and then when you do the post-processing right you get the script output that tells you right here you know your your sub word which is a SS box substitution that one has a problem right and so when you do the D compile it actually pops from right out at you that yep you're accessing a table based on the secret data right so this is this this helps you actually you know if you have a function that you want to verify you can basically write a set of test cases around your code that will tell you when when things are going wrong so what have we learned you know we have these kinds of time principles you know

constant runtime the access patterns are the same your code patterns are the same gave you some simple examples illustrate what's going on pay attention to your code compiler that you know this this is one that people always forget about so I kind of broken record right don't trust your compiler always check it and write trust but verify if you will and you know there's tools now here's do it and I pointed to you to a couple that you can use including you know the one like I said Intel just released yesterday so that I could you know give it to you guys to start hacking out some code and we welcome any poll requests we want to

hear how you're using it we want to know it's been useful so you know final thoughts here please don't implement your own crypto for the reasons we've talked about right wow if you look at any crypto library they go through all sorts of somersaults bending over backwards to try and trick the compiler into generating the right code to you know adapt to different architectures they do all that work for you right whether it's open SSL Google you know so on take advantage of that work they've been having things reported they continue to get things reported for fifteen years right personal attack I pushed a patch into open SSL fifteen years ago now right that was kind of you

know when we've seen those things hit ever since you know use the well maintained ones right don't grab the crypto library office somebody's PhD you know website I've seen code where it uses implementations where it says I haven't really tested this for side-channel it worked and my friends liked it so I put it here and I find this in code right luckily we stopped it before it goes out the door but you know pay attention to the libraries you using use the well-known ones right we've also published a web paper a white paper on the website URLs kind of long here I think the slides will be available for people to to get but you know basically

it's the content in here more detailed examples and it will updated over time you know with new examples new references and so on so questions and if you have questions I'm told to tell you to come to this mic here in the center so