← All talks

Don't Panic! CrowdStrike, the Biggest PC Cyber Attack That Never Was & Its Lessons

BSides TLV 202427:3930 viewsPublished 2026-03Watch on YouTube ↗
Speakers
Tags
About this talk
On July 19, 2024, a faulty CrowdStrike Falcon sensor update crashed 8.5 million Windows PCs worldwide, disrupting hospitals, banks, airlines, and causing billions in damages. This talk dissects what actually happened—a simple out-of-bounds read in kernel code that bypassed testing—and extracts lessons for software architects, security engineers, and managers. The speaker also examines similar failures at Cloudflare and highlights how Intel's vPro technology enabled faster recovery for some organizations.
Show original YouTube description
July 2024, a regular day, until all at once, around the world, 8.5 million Windows PCs crash, taking with them endless businesses and government organizations, disrupting, and even harming the lives of millions, and causing billions of dollars of damage. In this lecture I will take you into the technical details of what actually happened. I will also share the very valuable lessons, relevant for SW, Test and CyberSecurity engineers, architects and managers. Finally, I’ll shortly shine a light on the vital role Intel and its vPro (AMT) technology played in the recovery efforts.
Show transcript [en]

for having you uh wait here a little bit. Now is actually the best time to be here because I don't know if you looked outside, but it's really raining. So now you're stuck with us until the end of the day. After lunch, we chose a very interesting subject for a talk uh about something that touched almost everybody. It was a huge crowd strike event that gave the internet a blue screen and that doesn't happen very often. Um since then of course he did twice to other companies. Um so we have Gal to tell us about that. Gal is a senior cyber security and software architect at Intel with more than 25 years spanning kernel development embedded systems and

security across platforms and operating systems. Bringing deep architectural expertise to dissect what really happened with Crowd Strike and what offender defenders can actually learn from it. Gal welcome.

Thanks, Inbar. And thanks everyone for joining me after your lunch. And I will try to give you a nice dessert with this uh talk about crowd strike and this cyber attack, the biggest one that never happened. And the only problem problem with me giving you uh dessert here is that this stock is supposed to kind of inspire a little panic for you. Um well that's of course if you haven't brought your towel with you. Anyone brought a towel with them? If not I can offer merchandise later. But if you uh want to enjoy this then try not to panic. And if you're thinking that there's you heard about cloudflare and why is he talking about cloud strike then don't worry

we'll also talk about cloudflare in as part of this. So kind of inba already introduced me I think um I've had a chance over the years uh given my background um to you know do black hating and white hating and software product development and I've have a a passion for software architecture and its aesthetics and I have a passion for baking in cyber security very early in the product development life cycle cycle and you know when I looked at this h it kind of resonated with me and I hope it will resonate with you and hopefully not create too much panic. So one thing that's very important when we're thinking about this lecture it's really not about crowd strike and it's

really not about cloudflare. They're just ones providing us with case studies and you can call it cautionary tales and they're giving us a very rare opportunity for a a look into the postmortem processes within largecale development organizations. And the point is and that's really what I want you to take from this is that the events that uh happened here and their conclusions are not specific to these companies. They could have happened in any large scale software or hardware development company or really any company that deploys software widely and the conclusions are much the same. So you know let's see what do we have here? So what happened? Uh so let's start with crowds like right and crowd crowd strike

on July 19th 2024 a Friday mind you decided it was a good idea to push a software update to a sensor software of there called Falcon that resides on Windows OS based end user machines and it tries to detect cyber uh threats and mitigate them and well they pushed the update And then all of that happened. Basically, they crashed 8 and a half million PCs into a blue screen. And not only a blue screen, but a persistent blue screen that continued forever. And it started with billboards in Time Square. It continued into a canceled operations in hospitals, banks, border crossings, courts, and thousands and thousands of canceled flights. The economic impact was was huge. Just in

Fortune, 500 companies, we're talking about roughly five billion dollars uh of damage. In the UK economy, it was estimated in two billion pounds. So, kind of a small thing. Um you know, if you look at the the timeline of all of this, it started on roughly at 4:00 a.m. Greenwich meantime. And when crowds like uh decided to release that file with that update to those machines all over the place and within as I said a few minutes they got the update they processed the update and they crashed and they stayed crashed as I mentioned. So and it it's it needs to be said in their favor that within 90 minutes crowd strike already understood where the issue was. They reverted the

problematic file. released a fixed file which didn't really help much all those blue screened machines right? And in general kind of the industry went into mayhem for several hours. It took roughly 12 hours for CISA, the US cyber organization to officially declare that this was not a cyber security attack. It took Crowd Strike 36 hours roughly to actually publish a a full manual of what to do and we'll talk about that a little more in a second. And uh Microsoft tried to help in the next day with a Windows PE release that was tuned for the recovery effort. But in general, things took about 10 days before crowd strike actually could declare that 99% of their customer base

actually came back online. So it was a big deal. Along the way, the uh US government, the US transportation department started an investigation with all the flights and by September Crowd Strike was invited to the US Congress. So a small big deal uh it's a nice anecdote that um in terms of the financial experts look at this in the first minutes there were a lot of predictions that this won't have any effect on the global economy the CEO of crowd strike kind of knew different already in those several minutes because he already started the public apologies. he kind of knew where his uh stock is going. But again, in their uh favor, it should be said that they were able to

resurrect their investor confidence and that's not something trivial given what uh occurred. So it was a big deal and for those unfortunate IT guys that needed to handle this. So they kind of realized they were in trouble, kind of on fire. So they waited patiently for crowd strike to release the instructions and they got a kind kind of a book with these are just the names of the chapters in that book with all the different permutations of how you have to deal with this uh issue and they kind of understood that well they'll have manual tedious on-prem activity to do basically going to each and every machine and taking care of it and I'll say this is

the only little point in the talk where I have that this uh pride in my own organization in Intel. And the small thing here is that uh small people that bought Intel VRO platforms, well, they had it better because if they had their active management technology running on the little subrocessor that runs in those platforms beneath the OS and works even if the OS does not. Well, all they had to do was remotely connect to those machines, reboot them, open a command line, uh change the file, and that's it. So they were done within a few minutes unlike all those other poor bastards. So that's a small point about Intel. Uh but and that by the way helped a lot of

organizations and today is helping Intel sell a lot of repo platforms. But more importantly let's get back to our thing here. So crowd strike what happened there? The Cloud Strike Falcon sensor is based on an AI model uh that is used to detect the threats and cloud strike they uh update that through a collection of data from all their sensor network and through human intelligence and then at that point they push updates from their cloud back to the sensor in the form of what's called channel files and those channel files get into an interpreter that's based on you know regular expression handling and it uses a template file to define the structure the legitimate structure of the data in

those files. So what happened or more precisely when happened? So it actually didn't start on July. It actually started way back in February uh with a release of a new version of the sensor that included a new template file that had 21 input uh parameters within it instead of the original 20. and it had a a wild card actually in that release that allowed to disregard that 21st uh parameter and all was fine until that Friday and then crowd strike released their new update. I'm sure you can guess what was that little update by now. they created a template that enforced the 21st parameter and that's all fine. So the content interpreter went ahead and read the 21st

parameter. The only problem was that the code of the interpreter was not updated for 21 parameters and it did an out of bounds read. the simplest cyber uh bug that we know, right? Only if it happens in kernel, right? So in Windows, you get a blue screen and if it's a driver that comes up on each boot, that's a persistent blue screen. So all fun for everyone. But how did this happen? How did it pass the integration test and the release test and the stress test? How did it pass the initial deployments which Crowd Strike have claimed uh they did? So again, CrowdStrike did uh a really deep dive into trying and finding where were

the problems and their reports and Microsoft reports uh because they pulled in Microsoft to help with this are very interesting and I encourage you to actually go and read the originals but I'll digest things here for you. So it started with uh understanding they missed some compile time checkers. Basically they generated files and without actually checking that the their assumptions about the consumer software how it's written they're not checked at the source. Right? So they could have basically stopped the whole thing before deployment if they had this. So they they added those checkers. they still understood it's also something that they need to do on the consumer software. So they need to add input validation code and they added

input validation code to check you know the boundaries of that array that they were reading and and then from that actually they understood it's not a discrete event with that specific line of code and they went into a more full-fledged secure code audit from that point on and then they looked at the testing and again they saw things that were familiar. All the testing was of positive flows and of hardcoded content. They had 12 test cases and they used a channel files and template files that were manually offered and were not updated when the actual code was updated. And they also saw that there was no test actually checking input validation code in any way. So they

added negative testing and they added fuzz testing first for the template files and channel files passing and later for more interfaces because they kind of created a road map of fuzzing and it was not the only uh conclusions. They had a couple more. One was around the deployment. They understood they needed a stage deployment with more control, even more awareness for the customers, even some control of the customers to allow them to time things. And they understood that there are some software architecture conclusions again kind of related to cyber security. The first of which was let's move activity from kernel to user space. Should be very familiar here, right? Uh lease privilege and all. So you can see there

the the new partition they did for that specific software. Of course they moved the parsing of files coming from the net to user space and then the conclusions can go to the kernel and Microsoft used that opportunity to push their UMDF architectural driver model and that's a very good thing. Uh, Intel by the way is also working with them on that. And they also pushed memory safe languages and not only them since then CISA and others have continued to push this and of course Google has been pushing this for a long time and Rust of course is the classic candidate for kernel. I was really hoping to tell you about safe C++ which which was an initiative that was

uh really picking up in the last year with some extensions but unfortunately the standards body of uh C++ rejected that in the last couple of months. So we are back to Rust as the main candidate and hopefully something will come up for C++ but it doesn't mean you can't solve the issues in C++ at all. Okay. So I think you all saw being you know cyber security people that if we had only followed cyber security best practices even one out of all that we've seen we could have averted that catastrophe right so we wanted them the cyber security best practices from the get-go and okay we saw it for cloud stock but let's look at another example a a

completely different example Um well maybe not a completely different example. Um Cloudflare is a completely different company. It's a content delivery network, a CDN provider and it basically spreads servers in the geos of the consumers so that when they uh ask for the websites they get better performance, better resiliency, they get some protection against distri distributed denial of service attacks and other security threats. So it's a very different company uh in what they do. But again, they needed to send out an update of their software. Fortunately, they didn't on Tuesday or not so fortunately. Uh this time for something called the bot management system, which is something running on their servers. It's not going to the end

user machines and they broke the internet. They broke the internet. Uh basically they blocked access to endless amounts of websites big and small starting from ChatGpt and Twitter and going to League of Legends and the New Jersey transit system and it was all fun and they of course apologized a lot and again looked deeply and they really did into what happened there. And you can see by the way how the the internet was blocked for those several hours. It actually took about three hours and 20 minutes for this to be mitigated. The even three hours came up to around a quarter of a billion dollars in econ economic impact. The only problem was that it happened

again just last week and this time on a Friday and just a different server. This time it was the web firewall and they pushed an update to that one and they again broke the internet. H luckily this time just for 25 minutes. So they kind of improved in their speed. Um but the amount of apologies increased and so did the discussion of the fragility of our internet infrastructure and the dependency on several large infrastructure providers and I'm sure this will continue to pick up and it will be a very interesting discussion but let's focus for now on what happened in those last couple of times. So in the first time we talked about an update to

the bot management system and just like in the crowd crowd strike case it's a cyber detection mechanism that's based on AI models and tries to monitor bot activity on the net and uh see malicious bot activity and prevent those bots from accessing those uh websites via the CDN and they continuously update that this data and they send out updates to the servers via a feature file, not the channel files like in the case of crowdstrike, completely different. Uh but this time uh they do it very rapidly every 5 minutes. These files are generated uh from queries into an internal database that Cloudflare has. And what happened actually here is a combination of two things. Their IT

guys, they did something very nice. they actually uh relaxed the permissions on an on that internal database for the sake of internal users and applications. But that coincided with the fact that this query for the bot system, it was permissive. It was not exact. And so once the permissions were relaxed, it kind of pulled out pulled in a lot more information than it needed. So luckily they didn't have any uh leak of security uh secrets or privacy items but they still wound up with a feature file that was roughly double the size of what it should have been. And as their CEO said it that larger than expected file actually found its way immediately to

all those servers. So when it did, luckily the software there on the servers it actually had a check. It identified the fact that the file was in the wrong size. However, it just decided to panic and disable the service. So we didn't get uh code execution or another crash, but we still disable the service. And let's look at the other example again. They sent an update this time of a file with the updates to the rule set of the firewall. They did a very small thing. They just disabled a single rule. But the code that passes the rules had a small line that did execute on an object of a rule. and it kind of assumed that

it would not be null and but nobody checked that it's not null and it was not written in any memory safe language and so that server crashed and as again the co said this was a straightforward bug that was there for many years nothing new and crowd cloudflare they kind of tried to build on the fact that they actually had already a new version of that server implemented in Rust. They just didn't get to deploy it. So actually kudos to them for actually moving in that direction. I'm not sure that means they should have had this originally, right? Because we can solve this without trust, right? And in the end you can see in their

conclusion the the summary of their conclusion conclusions. So they talked about hardening the inest ingestion of their files. So not just of external files coming from untrusted sources but of their own internal files. And that basically makes sense when you're talking about mission critical systems like that, right? They're are very sensitive and they'll enhance their rollout and deployment mechanisms again like cloud like crowd strike. In their case, they also had this thing about streamlining their break glass capability. Basically, they're talking about the methodologies and tools that are used in emergency recovery to, you know, actually adopt them and test them and see that they'll actually work when they you need them and work fast. They

also look at the design. So they'll review the failure modes and the error conditions and the patterns that you all pro probably know well they'll for mission critical systems they'll adopt a fail safe fail open approach wherever possible and only leave the fail secure approach to those place places that really really uh must have it and so those systems will default to operating rather rather than being disabled. And they found on the periphery some additional issues. For example, that the systems for log compilation and for collecting core dumps, they were uh stressed and overwhelmed when the actual event happened. And they were not originally stress tested for something like this. And when they actually had to recover,

they didn't just need to recover the main system, they also had to recover these systems. So it prolonged the time of the full recovery. So they were looking into this as well. I think a couple more things that should be said here is that just in the like in the crowd strike case, uh they could have looked at the input validation assumptions at the source, right? And not deployed the files. the co talked about it and in terms of the memory languages okay they are very important but they don't eliminate the need for a good design I mean the second incident maybe rust would have done it if it was deployed on time but in this first in

incident the the code was actually in rust but it still panicked it still disabled the service so again you need to be careful in your design look at the business objectives and the security objectives uh when you do it and again we come to the same basic conclusion that if we had followed cyber security best practices from the get-go we would have saved us from that catastrophe or couple of catastrophes. Um so very different but very similar and if we kind of try to wrap it up into takeaways and for a very large audience of roles in the organization. So the first thing is you can use these examples as great uh training material within your

organizations or if you're you know prov providing mentoring to to other organizations use these these are great from the cyber security perspective you know they exemplify how malware that's been widely distributed can cause a lot of harm. Um supply chain attacks can be very effective. It was seen here very nicely. But real malware, it won't stop necessarily with the OS. It can do much worse things, right? It can try to inject code to leak information and it can actually try and hamper any attempts at recovery. So a lot of work here to uh do and of course the IT organizations they really should adopt as fast as possible better recovery mechanisms and tools for their oss for platforms for

servers. Erh product software organizations they need to employ stage deployment. This is essential and it all it still boils down to that fact that cyber security best practices they are good. They're not just good for protecting us from cyber attacks. They actually work against ordinary bugs. They improve our resiliency from threat modeling to fast testing. It's all critical. And you know if you sum it up with AI of course. So AI can help you very much you know with coding best practices with test coverage with uh better and more smarter recovery mechanisms and those are the good things of AI and you know the last thing is that AI is can do also the harm right

we're in putting so much new AI generated code into production grade systems into the product code DevOps code test code that just like the regular cyber security best practices. Now we need actually to adopt the security for AI and AI for security mechanisms and methodologies to help us with this but not just for cyber attack protection but just to make our software more resilient and just to handle the ordinary bugs. So I hope it was interesting and I hope you enjoyed it and I'm leaving you with a list of links uh if you want to delve deeper into any of the subject that I've mentioned and my contact details if anyone wants to ask more questions or

give feedback and thank you