Inside Google's Discovery & Remediation of a Critical CPU Vulnerability

BSides CDMX46:1037 viewsPublished 2025-07Watch on YouTube ↗

Show transcript [en]

Hello, I'm doing very well. How are you? >> I'm very happy to introduce you for bis 2025. So I'm very exciting to to that conference. So we're talking about the inside Google discovery and remedation of aricical CPU vulnerability. So please welcome to to Joseph.

tow.

Okay. Hi everyone. My name is Ysef Hussein. Uh I'm going to do this presentation in English and I'm a security engineer at Google. So first I'm going to start with a little bit of an introduction. So I joined Google about uh two years and a half ago and uh I did multiple things at Google. I'm a security engineer and I've done a bunch of security things. I actually moved between teams within the two years and a half roughly I worked with like four teams already. Uh but the the talk that I'm going to be doing here is more related to my first team that I was part of when I was at Google. Basically, so I was in a team called vulnerability

coordination center which is the team that was that is responsible for the mitigation of large scale critical vulnerabilities that affect Google products. So when we have uh something pretty big like let's think of uh somebody found out a way to easily shut down Google search. What do we do about that kind of problems and then we would escalate this to the vulnerability coordination center and they have the experts that are capable of addressing this in a reasonable time frame and then there is the other team that you might be familiar with the buck hunter team or the buck bounty program team. Now sometimes there are security researchers that find vulnerabilities in Google products and they report them to Google. Uh so when

you report a vulnerability to Google in a Google product, it goes to the vulnerability program uh the vulnerability uh rewards program. I was also part of that team at the same time. Sometimes what would happen is somebody would report a vulnerability on this side to this team because we don't know how big it is and then the person who triages it would discover that this is too big. We need to escalate this to somebody who can orchestrate a large scale remediation effort. Then they would escalate it to this team vulnerability coordination center and then somebody from that team would handle it. So when I was in both teams there were times I would be like sitting

here and I would triage a vulnerability and then I would decide hey I made a judgment this is big we need to escalate it to the vone CC team and then I would be the same person on the other side and then I would pick it up and then I would do the remediation effort uh management for it.

So uh also as part of the introduction one of the things that I worked on also is development. I like to develop tools. So I'm also a software engineer and uh when I was in the vulnerability coordination center there was a process that we had to follow. Uh one of the roles that we had to do was making sure that we're always aware of the current landscape in terms of vulnerabilities in the public domain. So somebody would analyze a public domain information and then see hm if somebody talking is talking about a vulnerability uh we want to assess that is it really something that we should care about or not and then we say okay it's not important

sometimes it is important and that work was manual so one of the things I did was working on the development of a system that would help automating that so if it was manual it was it would look like this go to vulnerability intelligence sources publicly And then you would basically look into Google infrastructure to see if this really affects us. And then you see you decide do we need to fix it or do we need to just ignore it and not fix it or anything like that. Now in terms of the automation, it was basically just replace a lot of this with code that would be doing the same kind of work but a little bit more efficiently. it would

do it a lot more effectively looking at vulnerability intelligence data because we can get that data through APIs from the vulnerability intelligence sources and then we can automatically integrate that with Google's infrastructure to find out if we have any service that is affected by this particular vulnerability and then also the code certain at times would automatically close the case and say well this is not important but sometimes it would say no this is important and we need to fix it and then it would facilitate the workaround around fixing it. So the vulnerability coordination center I already talked about them but there is the part that's more that's also interesting which is how do we do that.

Now I can basically tell you all about how we do it but there is a better way to do it and I'm probably just going to go through an actual scenario so you get a feel of how the process looks like. So, first of all, before I get into the story itself, I need to build a little bit of context so that we're all on the same page. Can I ask a question? Can anybody look at these bits and tell me what these are? I'm glad you said no because I'm not expecting anybody to do and there I'm not really this not a trick question. uh but as you know in computers you would write highle code that would end up

being bits when it's transmitted through the network or when it's stored in a computer right so this is how it looks like and humans are not expected to actually read this and make any sense of it but there is a way you could interpret this but you have to basically speak a different language and Uh if you're familiar with something that looks like this, can you raise your hand? Okay. So that's good. Uh I remember like the first time I got introduced to this, it was in the context of somebody telling me you don't have to read bits. You can just read this because this would represent what the bits actually do in the machine. And

then I was looking at this and I was like wow that didn't help. I was like, well, I mean, it didn't really go too far. Uh, I'm not exactly sure. I know there are some smart people that love writing stuff in assembly language. I'm not one of them, and I find it a challenging thing to do. Uh, and I'm not really expecting a lot of people would need to be doing that. But this is an instruction that basically tells the computer to do something. So when you write something in a high level language it would end up looking like this and you feel free to write the instructions or the program in this language directly. You can do that too.

So what this instruction does is basically an instruction for moving something from a source to a destination. You can tell that right it says move. So it's basically moving from this source and then to this destination. I don't know why they did it this way, but it's just like that. So the source and destination here are basically registers in the computer. So these are the names of the registers. That's as simple as that. A register is EBX. A register is EAX. So this is what it looks like. Moving something from here to here. That's what it is. There's another instruction that's also move, but this one is a little more interesting. This one doesn't have a

source and destination. The reason is because this particular instruction is a special and it implicitly dictates what the source and the destination are. So the machine already knows what the source and the destination are. So it's just going to move from specific source to specific destination. The machine is already pre-programmed to do that. Now there is something that comes before it which is rep short for rep which is for repeat. basically means do this more than one time. And this repeat is called a prefix. So we are telling the machine to do the move and we're telling it to do it more than once. What happens if I tell it again repeat like repeat repeat and the

instruction? As it turns out, nothing happens. It would work just fine and the machine would just ignore that. Same thing like this. Now the other thing I want to introduce quickly is that CPU manufacturers try to make the CPUs cheaper. One of the major ways you can do that is to make the CPUs more efficient and there was a feature that was developed. It was called ERMS which is short for enhanced repeat move store. It's really just an optimization feature to make the CPU do things faster. But there was a problem with that feature. If the data you're moving is more than 128 bytes. If it's more than that, it's fine. It makes sense to do the the the

optimization. However, if the data is less than that, then the overhead does isn't really worth the investment because uh the overhead really uh is going to take a little bit of work of the CPU and in this case anything under 128 bytes shouldn't really use that optimization. That's where the other feature FSRM came about is to address that challenge. With FSRM, you can still efficiently move data from a source to a destination with a CPU even if the data is under 128 bytes. Now the other one is I think FSRM is short for fast short repeat move. I'm pretty sure these guys are brilliant who came up with these names. I I really have no idea why they

came up with these names. They're just very difficult. Every time I do this presentation I have to go and read it again. It's really really terrible choices. So one more thing I want to mention is that The source and the destination which are called the operand of the assembly instruction are uh three bits. Three bits basically means you can address how many registers? Eight. But then with 64-bit machines we have 16 registers. So we were like what are we going to do? How are we going to be able to address 16- bit registers? So what they thought they could do is to use the prefix that would be ignored by the CPU. But in this case, we're not going to

ignore it. We're just going to borrow one bit from the prefix. That way we have four bits in the source and the destination. And with that, we could effectively and successfully address 16 registers in the machine. Now, except for anybody that works at Google, because there's like a bunch of them here, surprisingly. Does anybody know what this logo represents? Okay, that's good. It's by design. I'll talk about this a little bit later. So, now let's talk about the responsive story now that we're all assembly experts. So we have security researchers at Google and all they do day long is trying to find problems in computers, find vulnerabilities, ways you can hack into computers. One of the techniques that

they do that they follow, I mean is fuzzing. Fuzzing is really just uh uh introducing a lot of parameters, different set of conditions into the computer and then you monitor it. Once it behaves in a way that is not expected, then you can capture that state with all of these parameters and then you will investigate why did the system have that problem. Especially if it crashes. For those of you who are familiar with exploit development, you know when you crash a computer that's a major milestone. This is a good indication that you can find something very interesting. So what they did was they were doing fuzzing and as it turns out they introduced certain set of

conditions including the use of the prefix that you can use to address 16bit registers with any CPU that supports this feature which is a set of CPU families uh created by Intel. It basically leads to this. And what I mean by this, it leads to a system crash. And this was really how uh Riptar came about. Some of you who might have heard of it, it's a vulnerability in Intel CPUs. So the researchers worked on this and they found it and then they were like, "Wa, this is big. we're going to have to report this, escalate it to the vulnerability coordination center that I talked about earlier because now we have a vulnerability that is potentially

critical and it's a problem. So generally what we do is when there is a vulnerability like this one, we would set something called the embargo period. It's like three months where we say we needy need to kind of work on this problem, address it, make sure that we're all good because after the three months, we're supposed to tell the world about this so everybody else can also get a chance to fix this issue. So the I I'm a Nintendo fan, so I had to I had to do this. I couldn't help it. But at least now you get to see we're going from here to there. And uh there's going to be a set of steps we're going to go

through in details. We start with something called the impact assessment. We have a vulnerability. We want to understand how bad is this? How much of Google does it affect? How much of the world does it affect? So first of all, we thought about it. You've got a machine, right? And what happens is that you run you run the Xloit that was developed by the researchers, it will just crash the machine, right? The machine will just crash. That's it. So, does that sound like a problem to anyone? If you're crashing your own machine, is that a problem? Why is it a problem? You're not your head and that's now you're regretting it. I'm just kidding. But uh so

you crash the machine and the machine doesn't always have to be used by one person. But let's think about this other scenario. So this person I don't like this person because he seems to be making stupid statements but ignore him. So uh there's a saying that says sharing is caring, right? Uh when when you're having food and they say but share with everyone it's always good. It's always good. But actually, no, it's not always good because if I know that you're sharing food with someone and I want to poison you, I don't have to put the poison in your food. I can put it in your friend's food, right? Which basically means the sharing here

introduced an attack vector, right? So, let's assume that in another scenario, we have a machine and it's being shared by my sister and my brother and the machine has an hypervis hyper hypervisor. So each one of us has a virtual machine, right? So I could run the command and crash the entire machine. So basically um affecting everybody else's work that is on that machine. It's really the idea of uh sharing that could introduce a problem. So I'm going to have a question. What would be like anything comes to mind in terms of what could be affected by this at crazy large scale? What kind of environment would that be or technology or an ammo of a product?

Feel free to say that too. So where do we normally Oh, go ahead. Virtual machines. That's a that's correct. But is there something else you want to say? Running what? Running on. Okay. But nowadays the majority of our services run where? In the cloud. That's the key word. So the number one service that would be affected would be any cloud service. And Intel CPUs are pretty popular. They're pretty much used everywhere. In regards to Google, in Google it's Google Cloud. If you know this logo, it's Google Cloud, the best cloud. But uh Google Cloud was the most important uh part in this response because it was the most critical one and it was affected very big.

So that was part of the impact assessment. We were like, okay, we know now we have one big product that is affected by this. What are we going to be doing next? Is looking at other products. So before we get into that, this is how it looks like. You can go on any cloud with shared resources on the machine. You run the exploit as the attacker VM and you crash all of the other VMs. Simple as that. Now I was I shared this logo earlier. So let me give a maybe a little bit of history. Back in around 2003, Google grew very big and we were like, well, this is too expensive. Every team has their own virtual machine or their

own server to run their own services within Google. That isn't going to work anymore. So, we're going to really have to be a little bit more efficient. So, we created a system that would introduce sharing, which is the core problem we're talking about, so that people can share, which would result in more efficient use of our resources, and then servers will be a lot more cheaper. The system is called Borg. So Borg is the system that basically manages the the the efficient distribution and management of software across all of Google's data centers. And now that you know what Borg is, can anybody tell me something else came in the public domain as a result of Borg?

Can anybody tell me what that is? Something similar does something similar to what I just described. Anybody familiar with this product that starts with the letter C U? Yeah, Kubernetes, right? So B that's how Kubernetes came about. Uh it was actually something that resulted from Borg. It was born as Borg within Google. So we were talking about earlier uh we're talking earlier about the problem with sharing, right? Sharing is not always good and in a fundamental principle of work and how it works is sharing of resources. So Borg inside Google was affected by this too. So pretty much all of the services are run within Google are sort of affected because they run on Borg.

So and where do we use Borg? All of our data centers globally within Google we use Borg a lot more than we use Kubernetes actually. So is that all? So I was talking I've been talking about like Google Cloud servers and all that stuff. Can anybody tell me where where we might also be affected big time? Or maybe not we Google like generally where this problem could transpire. Maybe something a little bit more dangerous.

Any what? Any cloud provider. That's one. But I was trying to refer to something else. Let's not think about servers at all. Okay. I mean in your laptops, do you use Intel? Yeah. Right. Yeah. So uh you know laptops are normal computers. CL a lot of client devices have uh also C uh Intel CPUs. So when it comes to Google, we do have a product that's affected by this and it's basically Chrome the Chrome OS devices. A lot of them run on Intel. We needed to fix that. But everybody also need to fix their client device that runs on Intel CPUs. But here's the thing. When you're thinking about servers, we share servers and it's a problem to crash other

people's machines. But in clients, this is not a problem, right? Like why would that be a problem? Why would crashing a servers in your own laptop be a security problem? It's your own laptop. You're running code. You you might as well just download malware right away and run it. Right. Right.

Sometimes it would just slap you across the face and be like 10 sec 20 seconds and you're going to get a reboot, right? And it will just shut it down for you. Can anybody tell me why this could be a problem still in clients?

What? Well, that could be the outcome, but we're talking about crashing. So, here's the thing. In in the past, some vulnerabilities that would crash a service uh could end up being something a little bit bigger. Now in the clients like I said earlier I said earlier that when you crash a machine this could be a milestone to do very interesting things. I'll get into more details in a minute. So again this guy is saying it's no big deal. It seems like nothing is a big deal for him. So when we were evaluating the vulnerability within Google it was the Intel CPU is a black box. We don't really get to see how it functions internally in details because we're not

Intel. Intel does know that. So when we reported it to to Intel, they did their own analysis. And basically what happens was they found out that you can do privilege escalation with the same vulnerability. And this is when things got more interesting. Now we basically can own the machine and that includes servers and clients. And with clients now it's a lot more interesting because as clients wow if you could basically get somebody to go to your website and through the browsing session you could hack their client device that's a big thing right so privilege escalation of course it's an awesome thing as an attacker I would love to find privilege escalation vulnerabilities so now that we know the products that

are affected within Google the next step would be like how many servers do we uh how many clients do we have like okay we know the product but how many machines right how many devices and this is what we do at Google we have like these systems where we could basically run direct SQL kind of language uh statements and then we can figure out quickly how many machines do we have that are affected by this for example find me all the machines in Google cloud that are running Intel CPU this particular CPU family and I'll get the response right away so we got these numbers and we put them in the impact assessment document but And here is the

next issue. The next challenge is how are we going to fix it? We know we're impacted. We have all of these products. What are we going to do? Of course, there's like multiple thing we can do. Number one, we could just wait for the vendor. We could wait for Intel and tell and just tell them, "Hey, can you please give us a fix?" In CPUs, when you fix a vulnerability, they give you something called a microode update. So, that could be one solution. But as you a lot of you probably know the vendor doesn't always give you a fix quickly. So sometimes it's a good idea to find your own solution, right? So we thought okay

maybe we could also think of a way we can address the problem ourselves. So we wanted to investigate using Google how can we fix this problem while we waiting for the vendor and uh somebody came up with an idea. Now, for those of you who are familiar with a lot like very familiar with CPUs, can anybody tell me what she's thinking? It's a solution kind of. Okay. So, so what it is is something called the chicken bit. So, turns out that some of the CPU manufacturers when they introduce a new feature in the CPU, they would have a bit that you can flip to basically disable that feature. I think they called it the chicken bit because it's

like if you don't like the feature you can you might chicken out of it you know like you're scared. Uh so that's what it is. So earlier I said that for this vulnerability to work you need FSRM feature to be enabled. So if you need that to be enabled to exploit the machine how about you just disable it. Now nobody can exploit the machine anymore. But there was a problem with this approach. Can anybody tell me what the problem is? So the problem is earlier I was talking about FSRM as a performance feature, an optimization feature. You disable it, the performance of all of the machines go down by about 10%. And this is something we cannot

accept. But what we thought was we could just keep this solution in our back pocket just in case if the vulnerability gets leaked then we can use it because if a hacker finds this and they hacked into us and uses vulnerability and shut down 100% 10% is better than 100% right. So so that was kind of the idea. The good news is that the vendor did actually provide the microode update. So they were like here is a microode update you can apply to all your machines and you'll be fine. So now these are the solutions in general. So we've got like on the server and on the client the chicken bit works but affects performance. Microode update

works doesn't affect performance. So we're fine. So we're going to go with the microode update. Now there is the next problem we had. The next problem was how are we going to apply the fix to the machines? These are like a big big number of machines. How are we going to do this? And we only have a short time until the embargo period finishes.

So before I get to the roll out, there is a term called microode update. uh PV tested PV means production validated which means that we've tested it in production and it's fine so server microode rollout how are we going to be able to do this so there is the typical approach of that we follow when you have in your laptop when you update even your phone right sometimes you update or your Nintendo Switch and then you get to uh to reboot it, right? It says you have to reboot, right? So, we'll we can do a normal like BIOS update where you just apply the firmware update, reboot the machine, it's fine. The problem is that Google

something like Google Cloud, you can't just say reboot the machines, right? Uh we have a lot of machines used by customers in critical workloads and we can't just say reboot everything. So we thought about the other alternatives and turns out there is a way that we tested that works which is something called hot load. Hot load basically means you can apply the CPU firmware update in memory without rebooting the machine like it's you don't affect the machines at all and the process is running on them transparently. We tested that and it seemed seemed to be very very successful. So we used it and we rolled out the microode update with this approach. Just if you want to

make sure if you want to get a little bit more clarity as to how this works. In a typical machine, you would have persistent storage and you have memory and then the firmware would go to memory when you turn on the machine and then the machine functions normally. Now let's imagine we have a vulnerable firmware the one in red. So what we can do is we can fix percent storage like I said earlier but then I have to reboot the machine right and I have to turn it off turn it on and now we update in memory. So this is the rebooting approach but with the other approach we thought about you just do this by just updating

memory and don't have to reboot at all. Right? So this was kind of the approach that we deployed at large scale. So when it comes to Chrome OS, it was kind of simple. With Chrome OS, uh you have the update. We upload it online on a website and then the client can download it and update the client devices themselves just like you update your Windows machine or Linux or whatever. And also we can push it to the client devices transparently and it's updated. It's fine. It's the server side that gets a little bit complicated. So this whole thing happens in a short time frame which as you can imagine it could introduce a lot of

effort and a lot of people involved and sometimes people can panic in a certain response. Now at Google when you we discover a vulnerability that is critical even though we didn't get hacked we still declare this an incident and we still deal with it like an hacking incident we already have a problem so let's follow the protocols we have at Google something called IMAC which is short for incident management at Google and it's a set of protocols that we follow at Google to respond to respond to critical incidents like this one like when you join Google and you're part of an incident response team you have to get trained on this for example to follow it and one of the most

important things of the IMAX set of protocols is you have a team with clear set of communication instructions and then you have a good plan on how you're going to address a problem. So I'm going to go through some of the details of how the response team looks like and this might inspire some ideas in regards to how to address vulnerabilities of of of this scale to also to give you an idea of how big this can be. So this is me. I was the incident commander of this uh vulnerability remediation. I mean it would have been awesome if I was the only one in the team. You could just press a button and be like we're done,

right? But unfortunately that's not the way it works. So we would have somebody calls the operations lead that leads all operations and then we would have data center response lead. This is somebody that deals with data center mitigations. We've got cloud response lead. So even there we've got somebody who is going to be leading cloud response mitigations and under that lead there is a team that's going to be working on the mitigations. So we nearly need to have dedicated people in all of these different parts of this response. We also have Chrome OS response lead to basically lead the Chrome OS part of it. And we have the corporate systems response lead. This is the the the team

that's going to lead internal systems remediation which also includes a platform security lead that deals with security. Tech research lead. This is the lead and and his team that basically are responsible for continuing research on this vulnerability and also development mitigations and making sure that we understand it uh like going forward if anything new happens to it like for example if somebody came up with an exploit outside or somebody found it and they can help in regards to the research aspects of it. Vendor engagement and tech lead. This is somebody that leads a lot of the technical aspect of the response but also is responsible for talking with Intel. So Google had an NDA with Intel.

We cannot actually talk about this outside of Google. But even within Google, not everybody can know about the vulnerability until it's published. It's only a specific team. And then there is in terms of communication. So now let's say that we want to talk about this publicly. Let's say we want to publish it. I can't just as an engineer go on Reddit for example or somewhere and or or or YouTube channel and say hey we fix this vulnerability everybody can we can't do that there is a specific way to communicate this publicly and that's why we really need to have uh experts in communication within Google to help with leading the communication aspect of it and then even that there's like internal

comms and external comms and then of course there's the legal lead there's of course a lot of legal aspects that also need to be considered And the last one is non- Googlele alphabets response lead. It's somebody that deals with something called non Google alphabets. Can anybody tell me what that means? If anybody knows. Okay. So basically Google is part of company called Alphabet. And Alphabet doesn't only have Google has other companies like for example there is a self-driving car company called Whimo is an Alphabet company but it's not Google. So the interesting part about this is that the NDA with Intel was directly with Google. So we could only fix the non- Googlele alphabet companies with everybody else when the

vulnerability was published. But within Google, we fixed everything before publishing.

So the last part is disclosure. So when we we at this point we basically fixed all of the our servers. We managed to do this on a schedule. The servers are in Google cloud are fixed. Servers in Borg are fixed and everything is tested and everything is fine. Now even before the embargo period ends. Now we need need to go through the process of disclosure which is basically telling the world hey world uh we discovered this vulnerability and here is the fix to it. You need to fix your devices that are running intell CPUs that are vulnerable. So, how do we do that? Well, luckily this isn't really the most difficult part. We fixed everything. So, we know

nobody's going to hack us. Now, it's just a matter of doing some paperwork and establishing the communication publicly. So, at this point, I was like, okay, good. We we've actually fixed everything and we're done. But now, let's talk about disclosure. Disclosure is is the form of communication that Google establishes tell the world what a vulnerability is and there is a specific document you might have come across which is security bulletin really just basically tells you the description of the vulnerability the impact of it and also how can you fix it if the customer needs to do anything or sometimes we just say in the security bulletin you don't need to do anything because we transparently fixed

everything for you now because Google cloud is used by customers customers, we have to tell the public that hey there was a vulnerability in Google cloud and we fixed it and you don't need to do anything. So at some point so after we finished the response and we did publish the vulnerability and the world knows about it we also published something called the icebreak tool. The icebreak tool is a program that you can run in your system and it will tell you if you're vulnerable. So what you can do is you can run it and then it will tell you your system is vulnerable because it's running the vulnerable Intel CPU. Then you can apply the batch and then

you can run it again and it will tell you well you're fine now. You're now vulnerable you can move on. So it's called the icebreak tool. So we published this to the public so that they can test their own systems. But somebody sent us this message after. By the way, to give a little bit of context, cloud shell is a service that runs on Google Cloud. So if cloud shell is vulnerable, that means Google Cloud is vulnerable to this. Somebody sent us this message and they said "Well just heard about Reptar. I tried running the icebreak tool on Google uh on cloud shell and it worked. At this point, we kind of panicked because we were like,

"What the heck?" Like, we thought we fixed everything and now the world knows about this. And you're telling me that Google Cloud is not fixed? What the heck? And we had a bit of a scare, a bit of a scary moment there. So, that's kind of how my face looked like at the time. But the good thing is that the good thing is that uh I was actually just about to go on holiday. I was like, "Okay, I'm done with this work. I can go on holiday." But then this happened. And the good thing is that I wasn't on a flight because otherwise I probably would have done something like this. But it's good. I was at home. I think I was in the

office actually, but it's fine. Uh but here is the funny thing. We investigated this. We were like, "Okay, did the fix not work? let's go and test these systems. What the heck is going on? Let's try to exploit them. Can we actually exploit them? The exploit didn't work. And then we were like, but the icebreak tool says we're vulnerable. So, we had some software developers looking into the code and be like investigating and find out what the problem is to find out the problem is. And what they came up with was there was a bug in the icebreak tool. So, things were fine. It's a very big fat scary false positive, right? It's like probably out of all the false

positive I've had in my life, this was probably the worst one. So the icebreak tool was fine. We did test like multiple types of systems and everything was okay. Uh not the icebreak tool was fine. The ice break tool was not fine. The systems were fine. But then we fixed the icebreak tool and then we told the public, hey, here is the icebreak tool that's okay and you can use it. So things are okay and uh now yeah of course he's he's going to think this is funny. Now what did we learn from this? One of the things that I personally learned is that it was very important to religiously follow the IMAX protocol that I talked about earlier because if

you're in a panic situation you really don't want to be panicking and you don't want to be focusing on the logistical non-important things. You want to focus on the important analytical stuff that need human analysis. And that's why the IMAC program was actually inspired by uh the same protocols followed by firefighters and medics because they have the same problem. You know, when there's a fire, you don't want to panic. You just want to follow set of protocols to address the issue. Right? So, that was one thing that I thought was really important to learn. we really need to follow this and even more so if we could. The other thing was automation. There was a lot of manual

work that we had to do. So they they went well but it would be better if humans can focus on the important things not the manual things that could be automated by a computer. Is Google prepared for the next one? I believe we do. We are prepared. It's just that whoever is going to be managing this work needs to follow them critical and make sure that they don't deviate from it. But in terms of tools and technologies, Google is very well prepared. What can we do better? I mean, I'm I always say there's nothing perfect and I don't say that in a negative way. I have to be clear. It's like things went very well, but there's

always a way to improve. And when it comes to automation, for example, we could improve it. Communication went very well, but there's also some room for improvement in communication as well. Uh, one of the challenges that we had with a response like this is the size of the team that we had to manage, but also the team was global. It's not like we're all sitting in the same room in an office be like, let's talk and address this. Is a global team. So, how are we going to be effectively going through a response to a vulnerability like this effectively in a team like that that is distributed globally?

I mentioned earlier that I do a lot of development work and one of the things I did after this response is develop an automation for Google to help automate the bootstrapping of responses like this and also automate some of the manual things that don't need to be done manually and also make the response a lot more effective. So I created a system that basically helps creating the artifacts for the response and make the response more effective. uh and easy uh and easy to manage and simple. So the automation would basically automatically create response artifacts will facilitate communication and setting up of the teams and also the identification of a mitigation and rolling out the mitigation.

I really hope you enjoyed the talk. I would love to connect with people in security community uh especially those that are here. This is my email. It's pretty easy. It's blackoutg.com. I'm also on LinkedIn. If you'd like to connect with me, that would be my pleasure. And thanks a lot for attending the talk.

Any question?

Hello. Is this vulnerability similar or related to the past vulnerabilities, Meldon and Spectre? >> Apologies. Can you repeat it? >> Is this vulnerability uh similar or related to two past vulnerabilities called Meldown and Spectre? Meltdown and Spectre is uh other to vulnerabilities on the processors.

>> So the question was is this similar to Spectre meltdown? Yes, it is similar. It has a lot of similarities. Yes, because affect CPUs. Yeah. Any other question? Okay. No. So, thank you so much for your presentation using what that was very interesting, very informative. We need take care with with the glow and with containing and cubernet allo that explain that conference. So, thank you so much.

[Music] Seeigoas

security policy.

Inside Google's Discovery & Remediation of a Critical CPU Vulnerability

Related talks