← All talks

Securing Bare Metal Hardware at Scale

BSides PDX · 201852:232.1K viewsPublished 2019-02Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Matt King (@syncsrc) and Paul McMillan (@paulm) Less than three years after it was discovered the Equation Group was backdooring hard drive firmware, courses on how to create such a backdoored firmware are available to the public. New exploits in BIOS/UEFI that enable bypassing OS and Hypervisor protections have become commonplace. Once compromised, remediation is virtually impossible; malicious firmware is perfectly positioned to block the very updates that would remove it. Truly defending against these threats requires a different approach - traditional vendor firmware signatures and secure boot implementations aren’t good enough. Without mechanisms to detect and recover the firmware, a backdoor could be forever persistent and undetectable. Fortunately, nearly every device available has an existing mechanism to force it into a state which can be used to restore the writable firmware components. We’ll describe how we’ve made use of such capabilities at scale, the challenges in doing so, and what the future holds for securing firmware. Matt is a security geek responsible for ensuring platform and firmware trust at a cloud service provider. He has pen tested a broad range of systems, helped implement hardware implants, and has a history of rendering all manner of computing devices inoperable. Paul McMillan enjoys drinking cocktails, breaking the internet, and doing the impossible. He also works on security at Netflix.
Show transcript [en]

hello I'm Paul I can barely see your laptop your priv privacy Privacy Screen Works too well Matt sorry so uh the typical disclaimer I'm not speaking on behalf of my current employer uh Matt isn't either um and we're definitely not making any announcements or talking about specific product features or vulnerabilities so uh let's start by talking about what we're not talking about um we are not going to be addressing virtual machine security here we're not going to be talking about runtime security here we're not talking about securing laptops and desktops those are out of scope and honestly very very hard uh relative to even the problem we're trying to solve here which assumes there are guards at the door to

the data center uh we're also not talking about the TCG and TPMS and so on and uh I know you are all here because of the super micro thing but we're not talking about defending against Hardware implants uh or vendors who are doing bad things so this uh we are talking about allowing customers to run on the bare metal uh that as you will see is a metal bear on a cloud um so so the particular thing that we're concerned about in this context is the firmware on the devices the customer can actually access in a VM context everything is abstracted away they don't have can't get to the low level but we want to secure things and give people

directly direct access access to the hardware so uh again the firmware we're concerned about if you look at a modern modern server in the cloud today it will have between 50 and 100 different places that have mutable firmware or writable data that might be accessible to an attacker um individual chips scattered throughout the throughout the platform uh we know that customers can run the firmware update utilities the manufacturers give them to them and we can't it's hard to just turn that capability off when you don't need it and turn it back on because it's all wired directly into the hardware um if customers do that and very often Fleet Management utilities uh you know the

automated tooling that they run in their data center will change to a version of a firmware on Hardware that they trust they say oh you know we only qualified this old version we know our workload works with that so let's just put it back to that while we're using this Hardware this is a great flexibil very very good for being flexible for the customer lets them do what they want with Hardware but then when we get it back we need to make sure that it is not malicious we need to make sure that they haven't stuck an old bug buy version of the firmware and the hardware uh and we need to do this consistently and

reliably um we want to when you rent a system we let you have do what you want but when we get it back we need to be able to restore it to pristine State and it has to be automatable this has to be be able you have to be able to do this at scale and it has to work every time um update routines that depend depend on the firmware itself doing a good job of updating it fail more often than you think they would unfortunately um so and the the other thing that we need is we need to be able to say even if this device had a bug in it and maybe we've discovered and

fixed it we still need to be able to assert that the version we're giving to the next customer not only doesn't have that bug but that bug hasn't been exploited to back door the next uh the next customer so I'll let Matt talk to you about the background

all right so as Paul was mentioning there's a lot of things if you look at a modern server platform even a laptop but servers are just more of the same um so there's a lot of things in there you know how does it get built how does it end up it obviously doesn't appear fully formed in a data center ready for us to use and rent to customers right it the the manufacturing line the process that gets it there is actually pretty long and complicated and intricate and I don't know if anybody's everever had to actually try and build a real physical server um it's a lot more painful than you could imagine if you're used to

dealing with software um so not a whole lot if you if you're familiar with the old PCI architecture diagrams of like what does this computer look like right things haven't actually changed that much over the last 20 years um the io buses have gotten faster we've taken some things that used to be discret parts and put them onto the same package but we've still got you know memory controllers application processors IO devices some management stuff storage uh the the general layout the general architecture the way these components interact hasn't actually changed a lot even though the protocols may be a little different now um one of the things that has Chang though is that literally everything has

a microcontroller in it right it used to be all Hardware uh pre-programmed functionality State Hardware State machines uh that's kind of Gone by the wayside everything uses microcontrollers uh and they all have firmware of some kind um so like up on this picture I think the blue ones are the microcontrollers and the orange ones are firmware um so you've got you know even for the dam you've got some bit of data there that tells you how big this dim is and what the timings for it are right because if you want the system to support a 100 different dims from different manufacturers with different settings and sizes you need something that tell the system what those are so

it knows how to set up the memory um this gives the system a lot of flexibility it allows you to build things it allows you to fix things after you've built them it allows you to plug different things together and hopefully they all work because they can sort of tell each other what they are and sync appropriately but it now also means we have a bunch of things to deal with right there's a whole lot more places things can go wrong and we've now got a lot of people who are not software developers writing software to run in places that are not like easily visible or accessible or verifiable um and it's really like this

is this is still the simplified picture it's worse there's more stuff in there right um most modern high performance devices are no longer single core they have multiple cores they might have multiple different firmware images running on them um your emmc storage for your embedded controller probably has its own firmware in it um you've got other things on the board that may not be documented that you know clearly have some kind of non-volatile storage on them um sometimes it's as simple as some part identifiers so so that when you need to return something to the manufacturer they can identify which part this was and send you an appropriate replacement sometimes there's actual like microcontrollers in

your battery keeping it from exploding um so there's a there's a lot of things right your your hard drives all have them your USB Wi-Fi device probably has separate microcontrollers on both the USB side and the Wi-Fi side um there's a lot of things in there there's a lot of places if you want to persistently store malware uh there's a whole bunch of places you can stick it right and uh the ability of the system to go investigate what's in the hard drive firmware is generally pretty limited because if you look at the protocol specifications for like a SATA hard drive it doesn't have a command in there that says hey tell me exactly what your firmware is it's got

like an update routine that if you go look at the Linux utility says very clearly don't do this you'll Breck your hard drive so not only can people not inspect it but they've been told explicitly not to even try updating the firmware on a bunch of these devices because it's dangerous uh and it can be if the manufacturer's done a bad job building the device but that also means you're not getting security updates when there's problems with it um so that's what's in a server how does it come to be a server like there's tens of thousands of different components in a server right individual components from passives to uh small Active Components like uh mxes and other

things to you know the big CPUs that we think of as being the actual part of the server that we're interested in um and they're like they're built all around the world uh you know uh South Korea and Taiwan uh Israel build a lot of individual components right and these sort of get farmed out these are in most cases commodity things so when you say I want to go build a server um I need 10,000 of this resistor to put on the motherboard uh you know somebody in your supply chain organization goes out out and orders 10,000 of those from wherever they could find them either available or cheapest and then they show up at the

contract manufacturer and the contract manufacturer uses them uh and there may or may not be good tracking of those all along the way from the place that built them to the place that package them to your contract manufacturer um but even assuming you you have that right we we've got components coming from one set of Manufacturers we've got pcbs being built in entirely different places um there's plenty of places that have a lot of PCB uh manufacturing but the components in the pcbs are not generally manufactured together right the components come from one vendor the pcbs come from another vendor um and then they all go to a third place where they get assembl right and this is the the

assembly is usually where we talk about contract manufacturers where we get all the pieces in the same place and somebody solders them all together um and that's great but if you don't have tracking on what comes into the contract manufacturer you're not really going to have a good idea of what you're actually getting is that the thing you wanted I know there there was talk this morning about the this part of the supply chain aspect where you've got a whole bunch of coming things in I think it was Joe Fitz that made the comment of just replace a real right like at the contract manufacturer they're just going to have buckets full of the specific components

and instructions on where to solder them down on the pcbs um if you can inject new things into the supply chain at that point um this is sort of where they was talking about it happened somewhere in the process of taking all the individual Parts putting them in the same place and soldering them together um that's a really hard problem to solve tracking all that uh we didn't really attempt to solve that problem we were worried more about just the firmware because all of these places uh all of these components all of them that are programmable at any point in this process somebody could be putting hands on it programming it testing it in fact hopefully they are

putting hands on it and programming it and testing it make sure it works before they assemble everything and send it to you um and so there's going to be firmware on these devices right everything that's programmable by the time it leaves the contract manufacturer is going to have some firmware on it because hopefully they have done some level of testing and if they haven't then you get really upset with them and you go yell at them and make them do that because you want them to send you working devices um and then once you have them they get all over the world right so I know there's the the stories about uh Customs interdiction of devices and

reprogramming of them because they have to cross three or four borders uh to get from point A to point B um because like as you've seen from the slides you know probably every part in your server's been around the world two or three times by the point that lands in your data center so there have been plenty of opportunities for people to touch them test them reprogram them and you know do something uh that you might not actually have good record keeping of and probably not good ways to verify before it gets to you and you're ready to start using and running that server with you know whatever's on it um so a big part of the problem we're

trying to solve here is once the system arrives it's in my data center it's now in a secured physical location uh how do we know what's really in there right right um there's been some testing obviously that has verified that the system functions or we would have rejected it and sent it back um but in most cases the firmware on there is not going to be what we actually want the the process of building this server has taken on a on a good cycle three months um so whatever versions of firmware we installed on this thing to test it as it was getting built uh are almost certainly out of date by the time we get

it and we want to run it someone them might still be okay but the the odds are that everything on there's out of date um probably has known bugs that we want to fix and get rid of and remediate um and test and verify that hey we've got this thing in the data center H how do we know that before we hand it to the customer we want to give everybody a pristine system with a known set of firmware on it how do we verify that on the system when it's ready to deploy when it's ready to use

so uh before we go into the uh new thing we're talking about what what we did to solve this problem I'm going to go through a few of the existing solutions to this that are in the wild today so the first place you want to start is with signed firmware uh this means that you have a vendor signs the bits that they put into the device um and this should prevent unintended code from running um if they have done everything perfectly and have no bugs uh this most most uh devices are moving to signed firmware uh the majority of Manufacturers care about this it keeps customers from bricking their devices uh so that's generally a

good thing and NY really wants you to do that um this works best when one vendor controls the system end to end um the limitations though you have uh you know it all the signature is saying is these are the bits that I intended to put in it doesn't say that there are no bugs it doesn't say there are no back doors and it doesn't even say that uh we aren't that it can prevent unsigned code from running it just says using the normal process we try to try to run only signed code um the other problem we have is that runtime firmware could prevent the installation or the update of the new firmware you want to put in um this

is a fairly common feature of the BIOS back doors when people have been developing those they want to prevent you from taking their back door out and so uh accepting acting accepting an update and saying great I've applied that and then dropping it on the floor is a pretty easy thing for for a runtime firmware to do and it's very hard to verify that that sort of thing hasn't happened um the next technique that people are using is what's called secure boot this is definitely better it generally consists of a ROM a very small ROM that's masked into the device which takes a check sum of the firmware that is about to load and makes sure that

it's signed by the vendor uh before it actually runs the firmware this is great because it means that you your code is checked at every every boot the downside is it doesn't give you any time protection and it also still has the issue associated with uh associated with configuration which is expected to be changed at runtime by the customer the vendor can't sign that because it doesn't know what the right an right values are there and often times we'll see exploits that are actually uh the exploit is written into the configuration area and then exploits a parser that allows you to leverage code execution even though the device is running signed code from the vendor the other difficulty with this is

it's very hard to end up with a functional REM revocation mechanism where if you have a downgrade vulnerability or the uh you know vendor has lost their keys updating often requires actually just replacing the hardware um that's hard to do at scale and it's very expensive um the by the way the configuration vulnerability that I discussed is how we got into the Intel me last what was that about a year ago um yeah most the most recent one um so the other thing that we have in devices is uh or that we would like to have in devices is a way to measure what firmware is running this is rarer um we're working on making it more common

but uh usually this this is a process where the device uh signs signs something using a private key stored in the device gives you back that value and when this is implemented correctly it gives you assurance that the code that is running is the code that you intended to load um so TPMS are supposed to work this way Google's Titan chip works this way uh the liit this is also limited though the biggest one is very few devices support it and uh the measurements are often unstable uh as firmware changes as pieces of configuration change uh you you can't hash over the configuration for the same reason you can't sign it um and unless you're in a very very carefully

controlled environment and even then serial numbers and Mac addresses and so on mean that you don't have don't necessarily have matching hashes uh so this is this is hard to do at scale it's easier if you have very very similar Hardware but it's it's difficult to do effectively so our challenge signing is insufficient as currently implemented most devices don't give us measurement um and we need Assurance about the running firmware we need to know what is inside our device

so so as as someone noticed from my water bottle here uh I've not always done security um before I did security I did Hardware engineering and validation and you know one of the things about pre-production Hardware um just like you know your first attempt to compile software it doesn't ever actually work um the first version of something new is going to have a lot of bugs uh things that you expected to work right even very simple things uh very critical things are not going to work right and you're going to end up with either a very expensive uh door stop that you thought was going to be a prototype you could use for development or you're going to waste a

lot of money uh building more prototypes as you fix every little bug uh that gets cost and time prohibitive real quickly if you have to do new production Cycles every time you need a bug fix as you're going through trying to bring up a new piece of Hardware uh as I mentioned earlier manufacturing is hard and takes a long time um so Hardware manufacturers do have capabilities to take something that is not working right that doesn't have good firmware on it that has bugs in it and still do something with it right it the the process of taking sand and turning it into a computer um involves a lot of of work and a lot of

debug and a lot of validation and the hardware engineering community has put a lot of effort into figuring out how to do that how to determine hey does this transistor work hey does this CPU work hey does this laptop work um without throwing it out and building a whole new thing so you know what what happens is when you're doing firmware development right you've got a platform you know hot off the presses you've got this brand new thing in front of you you're real excited you want to boot it up right you just want to like get bios to run through so you can get to an EFI prompt and test some things um the you first

get that there's no firmware right you've maybe done something in an emulator uh that you have to put on here like the very first time there's nothing on the system to even accept a firmware update right you can't run a capsule update before you've got the very first bios installed um the update routines are often not the first thing you develop right you're not worried about updating something if you can't even get it to power on and like accept commands um very often when you're doing firmware development you will make a mistake and the system will hang uh and it's real hard if it hangs before you can get back in to update it to get your new code in to see if your

bug F bug fix Works uh another big problem even if you're doing signed firmware uh if you have implemented that in correctly uh you now have a problem because now you can't update because your signature checking is is broken in some way that prevents it from accepting an update um you know very often Hardware features don't work if your uh non-volatile storage controller has some bugs you may not be able to do rights like there there may be protocol issues there may be other problems that just prevent the system from working as intended uh so if you're used to dealing with production Hardware that mostly Works you're like we'll just run the update and that is just not something

that's possible on pre-production Hardware you can't rely on most of the normal functionality that you would use to do things because it it is either not there not implemented or just not working properly um the these have largely been solved right we are able to ship new systems I think most of the people in this room can go out and buy like a new computer or you know work on new systems this obviously isn't blocking us from Shipping things this is not an unsolvable problem right and what's happened is most Hardware uh both components and higher level platforms that are you know components put together uh have recovery mechanisms that allow the developers to continue

running tests to you know load new firmware into the device when it's not working um and these mechanisms don't depend on certain levels of functionality of the software right they're explicitly designed to require as little of the system to be in a functional State as possible so things like JTAG um is really designed to work with a very very small number of transistors actually operating normally um and let you find the other ones that are right so if there's if there's problems with the system as long as your uh JTAG controller is working you should be able to get in there and go start querying other parts go put instructions in and make sure they're executing

properly and do things to the system uh in a way that it doesn't really matter what the state of the rest of the system is right it's designed to work regardless of what is currently happening on the system or what has happened in the past because it was designed under the assumption that probably everything is still broken right and we need to go and be able to diagnose and figure out what's going on and then make it work again um lots of systems have recovery mechanisms built into ROM right there's real a real common implementation is you have a jumper on the system and you put the you know put a jumper on a header and now it

goes into some failsafe mode that allows you to recover the system into a known State um lots of systems have like debug serial ports that are active all the time no matter what the actual firmware is doing uh and there's like way too many other proprietary mechanisms to attempt to go into all of them but there you know everybody who's done this has had to come up with something to allow them to go back and recover the system um when it's not working because you know you can't brick all your prototypes if you want to ever ship product uh so we asked ourselves like hey there are mechanisms in the server we can use to force it into a known

State like our firmware developers are using them during the course of their day jobs can we leverage those for security too uh I don't think anybody got the joke this is the band

Yes um I somebody suggested we like play the song there and I'm like I don't want to rely on audio working uh so uh yeah the answer is yes right we can use these recovery mechanisms to get very high Assurance of the firmware on our platform right we can go in uh apply the updates to like the runtime the mutable firmware uh without relying on any code that's actually running on the system um right and this this gives us Assurance because the the mechanism we're using doesn't depend on the runtime code it it doesn't depend on what's currently on there that we don't know right if if we assume that like very malicious things are on the

platform or it's just been completely erased and the thing is a brick and we can't um can't run the normal update routines like this gives us a mechanism to go in sort of regardless of where the system is now uh hit it and bring it back to where we want it to be um and so this gives us Assurance because it doesn't really matter where it is now we believe the mechanism Works regardless of what's currently on there and when we're done using this mechanism and pushing the firmware we want into the system uh we have very high confidence that you know when we start executing that firmware that's going to be what's really running because the

mechanism we used to put it in you know was kind of oblivious to whatever else was going on at the time um and so there's there's lots of ways to do this uh as I mentioned just a minute ago uh options from you know booting the device into you know a specific recovery mode um JTAG often allows you to load known code onto a system and then start executing it regardless of what's there uh most of this ends up being somewhat specific to the device the process is always a little different um but almost everything we've looked at has some mechanism to do this because that's the thing the developers of the system have been using to do

this um oh I just covered most of this so uh you know real common one is is sort of the jumper thing and the device will go into you know it's usually called a recovery or an anti brick mode and very often it'll just sit there and wait for some custom undocumented vendor utility to supply at a new firmware image that it will then like write to its storage and then when you reboot you come back up into a known State um and if you can convince yourself that that mechanism will actually you know write the firmware you're giving it to the storage regardless of what's on there uh then when you can be confident when you

reboot that's what's actually there um sometimes the it's a multi-stage process right sometimes you update just the bootloader and then you reboot into that and go do a normal update from the boot loader um sometimes it's uh you know you can use JTAG to force a known thing in there or provide an image over a Serial debug Port that the runtime firmware doesn't actually have access to interfere with um the mechanisms vary a lot like I say it's quite uh device specific um so I thought this was in the next section uh so how do we use this right we''ve got tens of thousands of of servers in data centers um you know putting jumpers on things is great when

they're sitting on your desk uh that doesn't work when you need to do it to 10,000 systems uh on a you know daily basis potentially um so in order to operationalize this uh we built custom Hardware uh we kind of had to um it gives us connectivity to you know whatever the interfaces on the specific components we're trying to require are uh it has things like you art and JTAG and other interfaces that we can plug into different components uh we intentionally tried to make it flexible enough to support lots of different devices from lots of different vendors um and it's actually just sort of a simple thing and it it gives us a path

to go from our uh server management system to actually go you know drive whatever interface on the specific device we need in order to put it into the recovery State um and because we're relying on this uh right this is this is now becoming uh the thing we trust in order to repave all the firmware on the rest of our server uh we very explicit made it not runtime updatable so if we need to reprogram it we have to we do have to put Hands-On to reprogram our recovery device uh that was a very intentional choice because that means that there is no mechanism from software running on the system to modify our recovery device there's just no wires

there there's no connectivity it can't be done um because it's not connected that way uh and that means that when we do use this we've got very high assurance that you know we programmed it it is what we expected you know we've got the firmware running on our recovery device that we expect uh and because we have physical possession of the systems and control over who has access to do that programming we have pretty high assurance that it's actually working as intended

uh am I doing short I thought you were am I oh I don't know okay uh I think I think you may have done one of my sections oh yeah um so yeah as I was saying we did build custom Hardware uh it's actually not interesting it it's not much more than an Arduino on a custom PCI form factor card it turns out if you want to stick something into a server uh and you need connectivity to something outside the server you don't have a lot of options for what form factor that's going to be uh this is uh there are pcie Edge fingers on there uh the only thing hooked up is power um we

draw power from the server and that's it uh turns out we don't need recover servers that don't have power um doesn't help much uh so it's like it the the form factor is kind of irrelevant it's basically an Arduino on a custom PCB and like I was saying it's got connectivity to a bunch of things so we can plug some wires in uh make the operations people very unhappy about the extra wires and use those I see nodding in the back uh very unhappy um and and use those to you know instead of having to walk up and physically place a jumper onto a header we now have wires going to it that we can flip through a gpio on this

thing question uh questions now is fine I'm

curious uh it's not

much oh the time wiring it in uh I don't actually know that um I remember like the cost per board was trivial uh like it it did take a while to get the people who were wiring to do it correctly every

time that's the uh that is that is a feature not a bug okay um I don't know how many firmware uh updates you get from your vendors uh I get a lot some of them don't work sure how how many of your firmware internal firmware updates have critical

errors so I mean uh we have um we have in fact sorry I should be closer to the mic so we have in fact where this has where I know this has worked and where this has saved a lot of time and money is we actually at one point did get a firm update from a vendor to fix a bug that introduced a new bug that basically bricked a bunch of devices uh and we were able to use this to roll back to the previous version on a bunch of devices that were otherwise inoperable and would not have been able to be rolled back automatically um so all all the pain it caused to try and install and use this

thing uh everybody was real happy we had it when we started uh needing it um think sure so uh let's talk a little bit about the limitations including what you mentioned um we have we had to build this custom we this is not something that we could you know if we gave this gave you one of ours it wouldn't be very useful to you for your servers um the other thing is you really need to cooperate with the manufacturers that you're doing this to uh you need you need information from your server vendor or you're going to spend a lot of time engineering stuff what you may any way even if the vendors are being

cooperative uh Unfortunately they often don't know how their device works either um but it's it's a bunch of work especially if you're trying to do it without cooperation um the other issue is on many of these devices they The Flash chip is used for debugging it's used for runtime metrics logging and so uh when you have the data center reliability folks who come by and say hey we need to figure out why this thing failed uh can we look at the last 100 hours of logs or whatever you say Well it went to a new customer and we blew that all away uh great the previous customer can't back door the new customer with that data not so great we

don't have the data anymore um as you mentioned also the cabling is really really annoying for this uh because most devices are not designed with the idea that their debug ports are going to be wired up during production in the day data center uh as we move forward with this and uh we work closely with the vendors and we redesign our boards the cabling gets better but it's still not uh it's still not perfect um you know cables falling out uh headers that are pins rather than latching retention all things that Enterprise server people care very deeply about for reliability reasons um I already mentioned reverse engineering the other thing about this is that it very often requires you to

cycle power to the device to take up the new firmware you've installed uh if you're in a shared environment you have multi-tenant something like that that may not be tenable for you um depending on exactly which Hardware you're taking this approach to uh if you are choosing say a network card vendor you've got to look and see who has this capability and how they've implemented it uh again you need to sit down with the sit down with the vendor and have some have some conversations that they don't usually answer correctly the first time around because it's hard to understand the question you're asking um the and and so you know if your chosen vendor can't do this then you may

have a problem you may have to switch vendors or ask them to build the feature for you um the other thing is for some devices The Flash chip is inexpensive and you only have 10,000 rights uh if you do this all day every day eventually going to wear those out and then then your device stops working so make sure that that is something you've taken into account before you do it um the other thing is part part of the reason that we ended up doing this was because we knew that it was possible for someone else to do it um if we have this lowlevel known to blow it all the way at the hardware

level functionality we don't need anti-rollback protections like you were mentioning because we know that when we put it into the state that's the state it's in and it's not uh you know it can't be back door then not in in terms of the firmware it it's not a uh the bugs that were in the previous firmware can't keep it bricked it will only have the vendor supplied back doors at that point yeah I'm giving everyone the same back doors uh so uh did you have a question before I actually was wondering if you only did this for like your host CPU fir or did you actually do this for all the devices a great many of

them uh yeah it was so the the pcie card was wired to all the other Hardware yeah there's there's Custom Design more than just the pcie card this is what he was saying that like we we could give you one of these and it would be useless if we don't give you the server to put it in also yeah um okay so talking about things going forward here uh one of the things that we're working towards is figuring out how to do this inband on the device um for some kinds of devices this is already possible but uh you know so going in over the PCI Express Bus rather than having to squish firmware down over

serial cables dangling around the inside of a server uh is definitely optimal um the other thing we'd like to move forward with is getting it to the point where you can do a full firmware reset on the devices without resetting the host this has both the nice property of if you're using them for multi-tenant you can still get them back into a own good state but also means that the recovery can go much faster one of the challenges of this is servers are very slow to reboot if you have to do this a bunch of times you end up with you know an hour or two hours uh the you know our first unoptimized variant of it took a

very very long time and if you have that in the data center folks are going to be complaining about you know we could be renting these servers why are they taking so long to recycle so getting that time down especially if you can do it online without a full power cycle is a very op ideal OPP opportunity um the other thing that we are hoping Hing we can move forward with is being able to detect if firmware has been modified uh right now the best you can usually do is ask the device hey what's your version number of your firmware it tells it to you and then you either Believe it or it was lying and

you can't tell that it was lying um there are some folks who have done some very clever things with looking at the timing it takes to do that to try to figure out whether or not you're actually running what what it says it it is but at the end of the day that's it's an arms race that I don't think is prac iCal to play the better solution is ask the vendors to tell to give us Hardware secured ways to figure out what the firmware is uh and when we talked about this last time I said that a vendor was working on this for us but I couldn't tell you who uh now there is Intel has

actually published the draft spec of of their Hardware level firmware attestation and the way that works is they have a ROM in the device that does a check some of the it sorry it does a hash yes I want to say check some it does a hash of the of the firmware before it boots it and then it puts that into a locked pcie register so you can get that value out of the device and because this all happens at the hardware level you have very strong attestation of exactly what booted in that device what you do with that value is up to you but it it allows you to see in in a way that you couldn't

before um so I strongly recommend take taking a look at that um and commenting on it with them uh there are some things in there that are uh also solve other problems but the bits about getting the hash of the firmware I think is very interesting uh the other thing other limitation this doesn't do anything for runtime Integrity uh that's a very hard problem we'd like to hear your ideas but uh the the fundamental issues are until we have boot time Integrity trying to solve for runtime Integrity is maybe not something that that is worth the effort until we get this out of the way and we understand which firmware we're running on um but that's an an area of a lot of

research still uh so conclusions here ging Assurance on your firmware is hard but it is possible uh ask your vendors for these capabilities the more we go to the people we're buying Hardware from and say I would like you to do these things for me so that I can secure your Hardware in my my data center the more likely they are to do it if we if we present a unified front to them in asking for these things it's much more likely that we'll actually get change um and with that we'll take

questions magic uh it it's got its own interface to the network uh yeah it it's externally controlled right it's not something that comes in through the system we issued commands to it from something else that gets plugged into the network by some other thing so essentially specialized BMC uh it is it is like a BMC for a BMC in some ways yes and it's much less complicated and much easier to do very careful audits on how it function uh yeah several of ORD of magnitude less

code so uh as I said it's you know it's small enough that it in terms of the functionality in the code that an individual can very carefully validate what it does um it is not updatable uh one of the concerns we have is all Enterprise Hardware is built to be updatable in situ because that's how you deal with bugs in it we really don't want that for this piece of Hardware so once we put the firmware in because it's simple because it's only doing a few things I mean it it doesn't have the ability to uh it doesn't have enough storage even to write out a back doored firmware and it's doing things like toggling GPI openin

yes yeah so that's being handled basically the same way so um on a whole bunch of devices we have to rewrite all of the configuration settings when we do this yeah I mean we have have to do things like store the serial numbers and put them back in that suest

we we store the store those D store that data when we put the system into the data center and we restore with known good values not not the ones supplied by the customer supp by manufacturer if the manufacturer is screwing with us we have different set of problems than we're trying to solve yeah yeah we're we're trying to ensure the vendor supplied back doors and not the other customer supplied back doors talking

SC there's nothing that prevents us from doing that uh it doesn't usually happen in practice because of just sort of the rate at which machines come back to be recycled um the the process like uh current processes involve as Paul was mentioning a lot of reboots so we only do them when the systems are not being otherwise used we have to move stuff around to vacate systems in order to run the process so we don't generally have like a hundred of these going all simultaneously and some of some of the limitations in terms of how long it takes are some of the processes for instance involve squishing firmware down over a fairly slow serial link so that

that can only go as fast per machine as it can but but there's nothing that prevents us from running it very par there's as far as I'm aware there's no limit on how many of these we can do simultaneously but the process is like it's it's individualized so you have to run through the set of steps on each server in the right order it won't accomplish a whole lot so you have to go through that on each server individually

I don't know that we needed anything but 3.3 honestly uh how many voltages how many different IO voltages did we have to support we had at least one device that we were worried was going to have 1.8 but I don't think we are currently doing anything that is other than 3.3 right now but it it ends up being a small number right everything in the server is running generally on 12 volts 3.3 or like 1.8 and there's not that many different like main voltages that you have to interface

to we build it into all the servers it's always there yeah the it wouldn't scale if we had to put it in every time you wanted a new server I mean we build clouds right

one of those yes the device itself is quite simple yeah that's what I was getting I mean at at at the end of the day so at at the end of the day it does need to be connected to the network to automate it so we do have to control access to whatever is managing access to right like yes we we have uh we we do lots of other things to control access to this to make sure that it is not

misused at the end of the day if the people walking up to the server want to do things to it we have different set of

problems you you have to you have to walk up to it to to change the firmware yeah and we have actually that process is somewhat involved yeah so we do have some mechanisms for verifying that this has happened correctly and device was programmed correctly and we know that it hasn't been modified yeah there there are if you walk up and just try and touch it and do uh something not terribly clever we will notice there there are some bits involving crypto that they asked us not to talk about you do every re or only between customers between customers it takes long enough you don't customer we no we like you can't stop customers on bare metal from shooting

themselves in the foot yeah I mean and it's a feature too right you know it it is a very legitimate use case to say I know that I have qualified this network card firmware for my software and I don't want whatever the newest stuff is Enterprise does that all the time and customers want to do that so we let them

yeah from device

so this is talking directly to the Spy flash as directly as we can so the original firmware may be back doored it may have a bug it may have a bug that allows someone to get it into a you know for example about every year there are vulnerabilities in smm that allow an attacker to write to the firmware or write to the BIOS on a machine um because we're not using the BIOS itself to do this kind of update then even though even in the context of a world where we know there are either either we found them or we haven't but we know there are that kind of bugs because we're writing directly this spy

flash we are bypassing the C the possibility of those exploiting the

update no we we explicitly don't want to use the firmware update routine we're trying use some Hardware mechanism that does not depend on whatever the firmware is at this time any other questions all right thank you very much