Opinion

Operational excellence in cloud security: Lessons from CrowdStrike outage and Amazon’s playbook

"At Amazon, the '5 Whys' was integral to the Correction of Errors process, addressing both immediate and deeper issues," writes Joshua Burgin, Chief Product Officer at Upwind Security.

Joshua Burgin | 09:08, 24.09.24

The recent CrowdStrike outage highlights the critical role of operational excellence in cybersecurity. In today’s hyper-connected world, even a brief service interruption can escalate into a major crisis. It’s often not just technical failures that matter but the operational practices behind them that determine whether an issue becomes a minor hiccup or a full-blown catastrophe.

This incident wasn’t merely a failure of technology but of deeply ingrained operational practices. Drawing on lessons from over a decade at Amazon and AWS, where I worked with some of the largest-scale live systems in existence, I want to explore how organizations can build a culture that not only prevents such failures but also shows resilience in the face of these inevitable challenges. At Upwind, we’ve embedded these principles into our daily operations, ensuring that our systems are resilient, and our teams are prepared.

CrowdStrike’s incident had a clear technical root cause, but stopping there misses the broader point. The failure was due to a “mismatch in the number of input parameters provided to a content interpreter within their Falcon sensor software.” Their new IPC Template Type defined 21 input parameter fields, but the integration code supplied only 20 input values. This mismatch bypassed validation and testing, leading to a system crash when the 21st input parameter was expected but not provided.

The deeper issue here was the operational processes that allowed this to slip through. It’s easy to say, “We found the bug, we’ll fix it, and we’ll do better testing next time.” But the real question is, why did this issue go unnoticed? Operational excellence is about creating systems that make such failures less likely to occur in the first place. At Amazon, especially at AWS, operational excellence was embedded in every decision. The key is to create an environment where failures are anticipated, prepared for, and quickly contained.

Understanding the deeper operational failures is only the first step and the best way to go about it is the “5 Whys” methodology. It’s a simple yet powerful tool for root cause analysis, asking "Why?" repeatedly to identify the underlying issue. Originally developed by Toyota, it’s now widely used in tech and engineering.

Instead of stopping at the immediate issue, like a server failure, the "5 Whys" digs deeper to uncover systemic problems—gaps in communication, flawed processes, or cultural issues—that allow such incidents to happen. At Amazon, the "5 Whys" was integral to the Correction of Errors process, addressing both immediate and deeper issues, leading to continuous improvement. Here is an example of the method – applied to the CrowdStrike Incident:

Why did the system crash? The content interpreter accessed an out-of-bounds memory location.
Why did it access that location? It expected 21 inputs but received 20.
Why were there only 20 inputs? The integration code was set for 20.
Why was it set for 20? New templates weren’t fully tested.
Why wasn’t this tested? Previous cases didn’t trigger the issue.

By the fifth "Why," the root cause is a gap in testing processes, not just a technical mistake. This method helps teams to think critically and address both immediate and systemic issues, driving lasting improvements.

Now that we have a methodology to understand why an incident happened, we must adopt key practices that embed operational excellence into daily operations – to make sure it never happens again. Those include:

Having an Incident manager means you always have someone to turn to at the moment of truth – so the responsibility for the incident doesn’t get overlooked when the weekend arrives, or when team members have a lot on their plate.
Creating effective crisis communication during incidents is crucial. Alternative channels kept everyone informed, including customer service and PR help manage expectations and protect reputations.
Constantly prioritizing Incidents using a clear system—critical, high, medium, or low. This ensures severe issues are addressed first.
Conducting thorough root cause analyses after an incident to prevent recurrence. Even a small bug fix can lead to significant system improvements.
Regular exercises and simulations to identify weaknesses in the systems. It’s the best way to find a gap in systems – and handle it before it ever becomes a real issue.
Incident response should involve coordinated efforts across different teams in the company, maintaining trust and minimizing churn during incidents.

Operational excellence is an ongoing journey, reflecting the strength of your culture. At Amazon, operational readiness reviews were essential before product launches. These rigorous checks ensured readiness; if a product didn’t pass, it didn’t launch – no matter how high-profile it was. Operational excellence was reinforced through cross-organization ops meetings involving all levels, from VPs to junior engineers, to review incidents, share lessons, and improve practices.

The CrowdStrike outage reminds us that even the best technology can fail without solid operational practices. Embracing the "5 Whys," thorough root cause analyses, and a culture of continuous improvement builds resilient systems. It’s about creating an environment where failures lead to meaningful improvements—a lesson I’ve carried from Amazon to every challenge, including my work at Upwind.

As I wrap up, I want to leave you with three simple steps that can help build a culture of operational excellence in your own organization:

Dig Deeper: Use the "5 Whys" to identify and fix the real problem, not just the symptom.
Be Ready: Have a solid incident management plan in place to respond quickly and effectively.
Keep Improving: Foster a culture of learning from mistakes and continuous improvement.

These aren’t just theoretical ideas—they’re practical steps you can take right away to strengthen your systems, protect your customers, and build a resilient organization.

The author is Joshua Burgin, Chief Product Officer at Upwind Security.

Operational excellence in cloud security: Lessons from CrowdStrike outage and Amazon’s playbook

"At Amazon, the '5 Whys' was integral to the Correction of Errors process, addressing both immediate and deeper issues," writes Joshua Burgin, Chief Product Officer at Upwind Security.

Related articles:

TAGS