Probably anyone reading this post has heard of the CrowdStrike outage, which shut down IT systems globally as a result of an error in an update to its Falcon sensor program. In this article, we’ll try to cover the possible reasons that led to this mass event, countermeasures, and lessons to be learned.
1. Cybersecurity Is Everywhere
Technology is all around us. Every modern business or other process involves one or more information technology systems behind it—and with great power comes great responsibility. The vast number of information systems is a fertile ground for cyberattacks, increasing the need to integrate cybersecurity defensive systems as part of the infrastructure itself.
CrowdStrike is a cybersecurity company that created an Endpoint Detection and Response (EDR) product specially designed to identify and stop attacks on endpoints and servers. Due to its distributed nature, it is deployed on each and every machine it is protecting, making it part of the core of the critical infrastructure of any IT system that uses it.
When a critical bug is found in a key component of critical infrastructure, the consequences can be devastating. As a matter of fact, even days after the outage, dozens of organizations had yet to recover from this incident, as the fix requires the IT admin to apply the mitigation station-by-station.
2. The Outage Might Have Been Prevented
So what was the root cause for such a huge event, and what could have prevented it? Naturally, CrowdStrike is not sharing much information, but it seems that a faulty channel configuration file was deployed as part of the latest update, causing a critical exception in CrowdStrike’s Windows driver, which led to Microsoft Windows BSoD (“Blue Screen of Death”). This is probably not a result of a cyber incident, but a human error.
What went wrong?
From Microsoft’s perspective, a faulty design and behavior of the Windows Operating system triggers any unhandled exception in any driver to cause the whole system to shut down. Perhaps after this incident, Microsoft will change this to allow a graceful shutdown of the faulty driver alone, without shutting down the whole system. On the other hand, if this is a critical component such as an EDR, shutting it down may leave the station totally unprotected. However, this is still better than entering into a restart loop that cripples the availability of the whole system.
3. Never Underestimate the Power of QA
The biggest surprise of this entire incident was that this bug was not caught and stopped before being deployed to millions of computers around the globe. It seems that any station that receives that update immediately fails with Windows’ BSoD with no exceptions, so it is reasonable to think that this bug could have been easily detected by both automated testing as part of the CI/CD (Continuous Integration/Continuous Delivery) development pipeline and manual QA (Quality Assurance) testing. An automated test could have deployed the latest update on a Windows machine and checked for its health, and a manual QA test would look the same, but done by a human tester.
Are these processes absent from CrowdStrike’s driver development process?
4. Managers Must Invest in Quality and Security
Mature SDLC (Software Development Lifecycle), which includes mature automated and manual QA processes, usually goes hand in hand with mature Secure-SDLC (Secure Software Development Lifecycle), which includes all of the procedures, tools, and processes that ensures software security. Organizations that fail to invest the right amount of resources in software QA often fail to invest in security as well. If there’s one takeaway to be learned from the CrowdStrike outage, it’s that managers in every organization should ask themselves whether they invest the needed resources in quality and security.
5. The CrowdStrike Incident is Somewhat Like SolarWinds
Are the CrowdStrike and SolarWinds incidents similar? Yes and no.
They are different because first, the CrowdStrike incident is probably not the result of a cybersecurity attack but a bug, while the SolarWinds incident was a cybersecurity incident in which malicious attackers intentionally injected malicious code into SolarWinds’ Orion IT latest update, giving them a foothold in the network of all of SolarWinds’ customers.
Second, while the CrowdStrike incident is painful, its impact is limited to the system’s availability, while in SolarWinds the internal foothold led to the compromise of data confidentiality, integrity, and potentially also availability, resulting in far more impact and doing it silently over a much longer time period.
That said, the two incidents are somewhat similar, as they involved a faulty update, either intentional or not, of a third-party provider that the organization heavily relied on as part of its critical infrastructure, which caused massive business impact.
But here is a key difference: While there isn’t anything a CrowdStrike customer can do except apply the mitigation after the incident happened, in cybersecurity incidents, compensating controls can detect such an incident in its early stages and limit the impact. This is how cybersecurity firm FireEye eventually unearthed the SolarWinds attack.
In conclusion, the CrowdStrike incident should serve as a warning to every company to check whether they invest the needed effort and resources in automated and manual quality and security.
How CYE Helps
We at CYE understand that the amount of resources is limited and try to help our customers focus their cybersecurity efforts and resources on what really matters. This is why we developed Hyver, our optimized cyber risk quantification platform.
CYE can help both in finding potential and exploitable vulnerabilities in an organization and also assessing the maturity of its compensating controls by fully simulating malicious attackers while trying to stay under the radar, similar to the SolarWinds incident.
Want to learn how CYE can help you optimize your security? Contact us.