Spectre and Meltdown: Speculative Execution Considered Harmful

Name: Spectre and Meltdown: Speculative Execution Considered Harmful
Uploaded: 2018-05-15
Duration: 43 min 15 s
Description: BSides Canberra 2018

BSides Canberra · 201843:15662 viewsPublished 2018-05Watch on YouTube ↗

Speakers

Yuval Yarom

Tags

StyleTalk

About this talk

BSides Canberra 2018

Show transcript [en]

and now we are privileged again to have another amazing speaker in fact you've al-haram was one of the authors of Spectre and meltdown in terms of the research and the only Australian of that list and heels going to talk today about speculative execution being considered harmful and it's such a low-key topic speculative execution considered harmful but it has such significant impact so let's all welcome Ivar and we'll go from there

thank you for the introduction and it is a pleasure to be here thanks everyone for coming this was not I was not the only one to do this work there are a lot of a coauthors or collaborators in fact we were five different research groups that have found parts of these vulnerabilities each of them independently and I'll just introduce them yun horn from Google is probably the smartest of us all he found both a vulnerable these he was the first one to identify them the first one to exploit them and the first one to report those to Intel a another group daniel grows more asleep stefan mangar than the missile sparks from the Gracchi Technical University they discovered

meltdown they are probably the strongest research team in the area of a micro architectural attack that we currently have in the world a burner has Thomas a pressure are both a Intel working the security company cyber us in Germany others fog is a private researcher he does research a just does hacking publishes his results work a lot in a in the area of a microarchitecture all attacks and our group was a Mike Hamburg daniel ganc in pole culture and myself for those who don't know the name pole culture is the person that first published any results on side channel attack so is the first person to take a and i scope and put it to a computer

while the computer runs a cryptography decryption in that case a Mike hump work from our perspective it's the one that came up with a question and daniel ganc in is a brilliant young researcher he's the person that decided to check what sounds computers make while they decrypt data managed to show that acoustica cryptanalysis works and one of the questions that we are always asked is how come no one found these problems for the last decade or so and suddenly five different groups find it and there are a lot of conspiracy theories running in on twitter if you look Intel always knew about it that's the NSA and we did not really discover it and that problem

doesn't really exist the answer is much simpler than that we were all working in side channel on side channel attacks this was a natural extension of the last four years of work for most of us and we were just lucky that young one gave Intel six months to disclose everything otherwise it would have been the only name here okay so what are we talking about basically we have two different attacks one of them is Spectre which in it exploits a design flaw in most modern processors I will explain more about how the attack works later and what we do is that we look at speculative execution when the processor runs code speculatively assuming the believing

that the program will go there and that leaves traces and we exploit these trances and meltdown which is more specific it is a bad although Intel doesn't acknowledge it is a bug in Intel implementation of a memory protection it exists also in some ARM processors and the result is that even though memory is protected from reading we can indirectly read the data from there from the attacker perspective meltdown is more potent it's extremely easy to exploit it we can read the whole memory of the processor or the computer anything that is there we can find it from a specter is a little bit more complex to exploit but it it will probably remain it is

probably much harder to fix and will remain for longer with us and it allows us to read data from other processes it allows us to bypass language based security so the protection that it provides okay the main thing that we use is micro architectural channels what these are basically when we have a program and it executes it leaves some traces in the process or States and these footprints that it leaves in just a a product of the fact that the processor tries to optimize the performance and guess what the program going to do and that's information that stays there and as the program continues to execute it leaves more footprints and we have known for probably 15 years now

that if we analyze these traces we can learn information on what's happening there and this this gives us information so we have a program running it sends us information the technical name for this is a channel and information channel or communication channel and we when we analyze those when we work with those we distinguish between two types of channel the covert channels are intentional we have a sender that effects the processor state and the receiver that you reads the process of state and we have a side channel that is inadvertent we have a sender that doesn't want to send but does it by the way it works and the receiver receives the data and for many

years coverage channels a were interesting for military people that wanted to run multiple security levels on the same computer but anyone no one else was really interested in those side channels were more is more interesting because we can use them to you find cryptographic keys we can use them to break a SLR we can use them to get keystrokes a information everything that we want we can use the side channel so most of the focus was in snipe channels specular and melt them in my opinion change this because we really use cover channels here before we go to those just an example of what sort of what are these channels how do they work and the channel that we will use is a

cache channel and what it exploits is the difference of speed between the memory and the processor and we have what happens is over years processors have become really fast and memory has become faster but not at the same rate so when the processor reads data it has to wait for this data to come to the process and if the processor needs that data again it would have to wait for it again so to fix this these designers put a small Bank of memory called the cache between the processor and the memory and the cache because it is fair if smaller and because this different technology it is much faster and now what we can do is

when the processor reads the data a copy is left in the cache and the next time the processor needs that data it can get it much faster but the fact that this data is in the cache is some trace of all some my footprint of what happened earlier if we can find what data is in the cache we know what the different other programs have accessed before we access method and this is what we exploited for the past four years we exploited caches exploited databases contention exploited everything as forgetting every aspect of the processor almost all of them and the design attacks from those now the other aspect is a speculative execution and for speculative execution we need to

understand a bit more about how processors so if we look at our processor we have the machine code that something that run a program that runs and when we think of that we think of it as executing the instructions one by one and for in doing and processing the one by one when one completes the next one starts and so forth and this is the way that we think about processors because that's the way that it easy to analyze or understand what's going on however a the each instruction the die itself consists of multiple steps and in modern processors we have 30 to 40 steps we'll look at five here so in this example I have five steps first we need

to fetch the instruction from memory then we need to understand what the instruction is get the arguments that it processes do whatever it does add the arguments multiply whatever we want and then write the data back and by this time the instruction completes and if we look at that this takes five cycles to execute an instruction and that's not efficient on the other hand we know that if you look at those each instruction use different unit the Fatima unit that fetches instruction for memory is different from the unit that does the calculation of computation so we can use them in parallel and that's pipelining is the solution that does that and basically what we do is we execute

multiple instructions in parallel so we first fetch the first instruction and then in the next cycle with the code the first instruction and fetch the second one and then we proceed to prote and get the arguments for the first decode the second fetch the third and continue that way until the first instruction completes the second and so forth so basically what we have we have sort of a diagonal front on the execution of the program and as it progresses the program and this allows us to a compute one instruction per cycle on average so we have five instructions completely running parallel pipe a stage for each on average we get one instruction then per cycle but this creates some problem

for example we have depend there problems with dependencies between instructions if for example the first instructions instruction does a division and this next instruction will takes the result of this division and needs to calculate on it then we can't start executing the second instruction before we completed the first one so what the processor does in that case is what we call a out of order execution it actually executes the instructions in whenever the data is available rather than just executing them by the order that the program executes them so in this example we'll start will fetch the instruction do the execution and now the second instruction needs to wait for the results of the first one so we block it

but everything else continues finish the first instruction and then we can continue moving forward now the important thing is that happens is that at this stage we got the third instruction to complete before the second one so the processor instead of really completing in storing the data store this information in the temporary buffer and it will actually the instruction will wait until the previous instruction completed and then all of them complete together so what we have again is we have our a front that is going we don't know multiple instructions with run at different stages and behind that we have what completed execution and now we have the question why did the process or not

finish the execution of the first third instruction before the at the second one and the answer is answer to that is that it would create a problem we were at this stage now what happens if B is 0 B is 0 the first instruction should fail that means that neither the second nor the third should execute so basically we have to have wait with all instructions until the all preceding instruction finished before we can accept the results and what the processor does in this case is sorry what that means first is that all out of order execution is speculative in nature we don't we start executing instruction before we really know that they will be

needed and to solve the problem of speculative execution what the processor does is when it reaches to this stage finds out that B 0 the device should fail it just abandons these instructions throws the temporary results that it may have calculated and continues working and an continues handling via the exception so the behavior is if the instructions were executed one by one sequentially but we got the performance a benefit of using out of all pipelining now this mechanism is also used used for handling branches the problem with branches is that we don't really know the process or starts executing it reaches the point good thing reaches the branch instruction and this branch instruction depends on a data that has

not been calculated yet because the front hasn't there already the green area hasn't reached there yet so what does the processor due process will ask a fortune-teller what the branch will do we have the fortune teller they look at their a the crystal ball we have tells us the branch will go to the right and left sorry depends on which side I look at them so the process of continuous executing and at some stage it will reach to the branch with the data and we'll know what the result is so now we can test whether they a prediction was correct and if it was correct everything as well and we can continue executing if it wasn't correct what happens what it

does is it goes to continuous execution from the other branch other side of the branch and abandoning everything that has been calculated by the way and the nice thing about this mechanism is that we can't decide which way the branch will go before we have the data so we speculative execution if we did a correct prediction we gain performance benefits if we had a misprediction than in the worst case we just didn't get any performance benefits we just execute as if we as if we just didn't do the prediction so we lost a bit of power for calculation but a performance remains high technically we have a two different branch prediction mechanisms we have mobile to important ones one of them

uses the branch history it checks on conditional branches whether they were taken or not and we have a fortune teller that tells us whether the next access to this branch will be taken or not and then we have the branch target buffer which is used for indirect branches so if we have a branch based on one register or even a return instruction which goes to somewhere which the program doesn't know a priori we're Google and this is a different predictor that just tells us from here the program will continue at a different location so this is the background we have speculative execution how it works and we have cash footprints so what our discovery the main discovery

with this both spectre and meltdown that is that whenever we have abandoned calculation what is abandoned is the nominal results of the calculation what is not the abandoned is the footprint this calculation left in the processor so the the specter and meltdown are techniques that use our techniques for extracting these footprints and finding out data on what's going on in the computer and the distinguishing feature between the meltdown exploits a instruction that should have faulted or the truth that fail but something like execute after them and Specter exploits the branch prediction the attacks basically work on to an attacker that runs on unprivileged code where unprivileged could be a user process trying to access kernel or one process

trying to access another process or javascript code trying to access the data that is in the browser and what the attacker does it finds a piece of code that we call a gadget that the attacker can control the input to that gadget and data the gadget causes sends information over a channel and so so what the attacker does is called speculative execution of this gadget with the input the attacker controls and the and then learns what the attacker what the what the gadget leaks through a conversion how does this work in practice first we'll look at meltdown so the typical code for meltdown is two lines usually you write it in machine code but it will be easier to see it in

C the first one the reference is the pointer and the second one takes the value of read from that from that pointer x 256 and the reference is an array with that index and the gadget here is the second line and now we'll look at how this works so we have our program our address space of the of the process and before meltdown and the address space was divided into two parts the left part is the user space so where all the data that the user process can access the right side is the kernel dry space which is protected so it is mapped to the same address space but the user cannot access it only called running in

proven high privilege mode only code running within the kernel level within the operating system can access that location and now we look at having a cache in the system and we start with beautiful so we'll start with a pointer that points somewhere in memory both of these locations are cached and we have the array that is not mapped in it is not cached at the moment and let's look at what the executions do not just looks like the first thing we do is with the reference a pointer and get some value from memory and then we use this to point to somewhere inside the array and access this data and that means that this data comes to the cache this is the

way that this code works no problem in here right beautiful so now we'll go to a different scenario in this and in the different scenario the pointer points to a location in the address space of the kernel so it's location that the user should not be able to read the data from this location this location is happened to be cached and now what we do is we do the same thing we access the reference the pointer at this stage what the processor does is two things it first it accesses the data and in parallel with that it checks whether the user has access can access this data and it will take it some time to find out that the user does

not is not allowed to read this data while the user is not allowed we speculatively execute the next instruction the next instruction is use the data that we got from that memory to a address the location inside our array and get that into Y now this will happen and at some stage the processor will realize that it was that it has made a mistake it shouldn't have allowed this axis it will abandon the value of I will not change the value of y will not change and we'll get the trap and here comes a meltdown attack we have the meltdown attacker now the meltdown attacker does this very simple thing it tries to read one location from the array and measure

how long it takes for the data to come to the ring to reach out along States to do this reading and in this case this data was not cached so the read was slow and then it tries another location and again it will be slow only when it tries the location that we have already brought to the cache earlier this thread will be faced in fast and when the read is fast we know what the value what the index that we accessed our they array in and therefore we can find out the value of I and this value of I comes from the kernel space so we can read data from the kernel this is basic meltdown attack

there are some complications in the init if you want to actually implement it first thing you need to handle the fact that the processes already died and so you need to have some mechanism that doesn't kill me to keep your process there are several ways of doing that one is to catch interrupts we can do that in UNIX or at least in UNIX you can do it easily and the other way is to use a intel has a beautiful feature called the transactional memory t6 and what t6 does is that it tries to execute access and to execute some instructions and if it fails it failed restores the status and goes back and the beautiful

thing about it is that if it gets an interrupt while executing it just goes back and says over oops I couldn't execute this transaction so we start a transaction do this code it fails and we know that we can continue running the other thing is that Intel has some mechanism which we don't know why they do it that at some stage in here when it finds out that the instruction should be dropped it also resets the value of I to zero so in some cases what we will read from here will be the value of is being zero and the true the problem is that we never know whether the zero that will read is the

real value or or not so the solution is try to read it 2,000 times if one of them is not the zero you find that we found the value if the all of them are zero it's probably zero in there so this is meltin now spec Spectre is two variants one for each branch predictor so for the first variant this is typical code this code is found either explicitly in C programming or implicitly in problems that can check up for array bound access and what this does what disgust does is very simple it takes an index to an array checks that it is within the array bound then if it is it takes the value

that the index may point to and accesses a different array again with the value times 256 and again the gadget is here now how do you expect air attacks that works the first thing we do is we call this code we execute it with the Knicks with legal value of X within the array and again let's look at our memory we have all the elements in memory and we also have a secret data that some secret data that we want to read and the first thing we do is compare the value of x with the length of the array if it is smaller than the length of the array we read the value from the array this is

used as the pointer to the second array and we can connect and we got everything vs. works nicely when we do that with a real with a value within the range within the array our branch predictor learns that the value that we used is winning the array and she knows that the branch was not taken we have not skipped the code of the gadget okay so now let's look at the same thing with a cache we have the secret cache we have X cached and the rest is not and we have our attacker calls now with the value that is large much larger than the array at this stage what will happen will start executing the testing the

condition but because our alien is not cached it will take some time for this if statement to execute while we bring this data from the cache we need to know whether what the processor wants to know whether to continue to inside the branch or to go outside consults with our fortune teller it tells if the branch is not-taken so it starts executing takes the value that a X points to and because X is large we now get a value from the secret and the rest continues the same way as before we get the value into the cache at some stage the value the array length arrives to the to the occasion derives to the processor and the processor realizes

that it was wrong it was a misprediction take everything back but we still have our trace in the cache and which spoil it the same way the beautiful thing with this attack is that we can do it from JavaScript we just do the same similar code in JavaScript test for array length and we can use it to read everything within the browser it only works for browsers before they were fixed now it's a bit harder to do the same attack the second variant of Spector is you is a bit different it uses the branch target buffer so in this scenario we have the attacker the trance specter and we have a victim that runs another runs whatever

program it runs and these are protected for one another we are not expected the attacker is not expected to be able to read anything from and now we look at their address base and we have the address space of both of them and the attacker has its own address space the victim has its own address space because we use virtual addresses then this could be the same addresses in both sides and in the attacker index or in the victim we have some indirect branch that in this case it takes the value of register and branches to the location of the register and we have some gadget that contains code that basically does a red eye from

memory and x 256 find where it is the the important thing is that we don't really need the branch to ever be able to jump to this gadget this could be totally different pieces of code and and what week we will what we will exploit is that we teach the branch predictor to two branch to the how many bugs are in this code currently for I just found one at least anyway the branch we teach the branch predictor that this branch is supposed to go to that gadget and then execute it so how do we do it we take the address of the gadget in our in the attacker what it does in its own

address space in the same Polly in the same virtual address as the attacker as the jump instruction in the way at the victim we take the value of the gadget and we do an indirect jump and in the location of the gadget we just do a return or whatever way of recovering and being able to execute it and then we execute this piece of code so the attacker just run a branch into its own address and it does it several times and after it does it enough times we get our branch our branch predictor to learn that this is the branch that happens from this location once that happens we can actually execute the attack we cause

the victim to execute this branch and because we have speculative execution we have our a if the branch target buffer will tell the same branch predictor will tell the processor that the execution is supposed to continue from that gadget and for me this it follows the same way that we had the attack before so they are these are the two attacks and now we'll look at the other side how we prevent where we protect our systems from these attacks so first we have meltdown the solution from meltdown is simple we use the separate address space or the kernel so if we look at the process as it was before we split it into two different address spaces one of

them only contains the user level and that's where the process runs and the other contains the kernel map and the a copy of the mapping of the user because the colonel wants to access this data easy the main problem of the main problem that was initially seen with this is that we have an overhead when crossing address spaces so when whenever we run into a system call we need to move from user level kernel to system level and that takes no longer because we need to switch address spaces performance results that were published range from almost no performance impact to 30% performance a it's depending on the workload another problem with this solution is that it's

not easy to implement it so not a long time ago Microsoft had to issue an emergency patch because it turned out that it solved meltdown by giving us their full access to the kernel space and the more important issue is that this does not really address the core issue here the problem that we have with meltdown is that page table protection the protection that the processor provides doesn't really work what we have the solution here that's that moves the part of the address resolved the kernel dregs to two different address space it's solved the symptoms it's like giving panadol to two person to hit to avoid a pain the caused by a tumor it's it heal its symptoms but the problem

remains and any system that uses a word that you realize on page level protection on process of return or the process of providing protection who fails to will fail to work second mechanism suggested for variant to the solution is to add some processor features that prevent prediction across security domains basically what that means is that instead of one branch predictor we have to whatever we teach the first one we don't teach the the other one doesn't learn and and vice versa on a technical level we have three layers of protection the first one for processors that have a hyper threading Intel hyper-threading they prevent learning across hyper threads so whatever happens in one eye portraits does not affect the other

hyper thread on the same core the second is that code running in higher security level does not learn from code running in lower security level so whatever the user space teaches the branch predictor does not affect the kernel when the user switches with the kernel and to protect between across contexts which though will run one program before another and they are in the same protection level there is another way another instruction that just kills the branch predictor so whatever was in the ground predictor is deleted and now we don't we don't rely on that technically this should solve all issues practically when Intel issued the first page for that it caused some machine to stop working Microsoft in

this example to K just disabled the fix others had similar things I think it was two or three weeks ago that Intel released the new new page and they also told us that they are not going to pitch every processor because they don't know how to do that AMD was much more careful and the patch has come out I think it was yesterday so until now MP processors were vulnerable and now we will learn whether AMD is doing a better job on then Intel last very inspector variant one the solution then most commonly suggested solution is compiler patches to block speculative execution the idea is that if we have our footprints here we don't let the code execute beyond the

point that it starts leaving footprints so when we get to this position in the execution the execution they print a speculative execution stops and then if we learned that we went the wrong way then no footprints we cannot exploit this theoretically this works it's very easy to do that we just need to put in interphase instructions so speculative at this portion we put them in both sides of any conditional branch and we are protected the problem with that is that these branches are expensive in tests that we have done sha-256 when we start adding all these branches into it and all these barriers into it executes about 60% slower which is usually not acceptable so instead of using that the

solutions just intrusion is let's use static analysis our compilers will look at our code and decide what this code may leak information or not and put the barrier if they see something that may leak information Microsoft has done that there is a page in M SVC there are other pages for lvm so we have some patches around and we tested one of them or pool culture did and the result is that yes it does block some patterns that leak information exactly dot that with those that we mentioned in our paper Paul tried 15 different patterns two of them were blocked the other 13 were not and the problem is that it is not really easy to analyze the code and see whether

it leaks or not and they are going to try and it will probably be better but the problem is will still be there we still there another solution that he suggested is limit the memory access or use of data and the example that one example for that is to replace this piece of code we've seen earlier with something along these lines so what we have instead of testing that we are within the within bounds we force the value to be within within bounds we just reduce it module with the size of the array that ensures that everything is inside and we cannot read things outside the array and we can do it with them without the array the bounds tastes and

technically it works but it only works for the scenario of arrays it only works again on the sort of attacks that we have shown it does not address the core problem that speculative execution may leak information in other scenarios when we don't test for a railing so we are left with a problem that we don't really have a good solution and since this mirrors what some of us may remember from the 80s 90s that the scenario that we had with buffer overflows when they started appearing people suggested solutions that are still used today non-executable stacks block the attacks that were popular in the early eighties but as long as we don't address the core problem the problem still remains so we

believe that Spectre is going to haunt us for many years even though inter promises next generation of processors will solve so some up for the past decades processor development focused on performance the war is some work on security but not enough there we got the result with spectra and meltdown hopefully these are the last ones this is not much different from what happens in software development time to market from security all every time a performance always more important but so we have bugs in software but the problem here is that it's much harder to fix those it's relatively it's very difficult to fix a process on once it's out in the market second thing is that we are going to

live with this problem so if anyone is interested to look at it more that's something too that I think will be a very good area to investigate for then at least two three years two three coming years and third thing is that micro textual channels particularly cover channel now now matter and being able to eliminate them or to understand them and know what happening there is now is much more important than we thought it was until few months ago and in that respect I would like to recommend everyone go to Xi'an stuff today because she is going to speak about how we block how we don't block these channels and with that I will conclude and I'll be happy to take

questions

[Applause]

Spectre and Meltdown: Speculative Execution Considered Harmful

Related talks