Ensuring High Availability in a Multi-Cloud Environment: Lessons from the CrowdStrike Outage

On July 19, 2024, organizations across the globe found themselves staring at the Blue Screen of Death following an update pushed to CrowdStrike’s Falcon cybersecurity software. The update contained an error later determined to be a mismatch where 21 input fields were expected but only 20 were provided. An internal content validator failed to detect the issue, allowing the mismatch to bypass internal checks thereby causing Microsoft Windows to crash.

CrowdStrike’s response was quick. The root cause was found and a fix pushed within two hours, but the damage was already done. Falcon’s high-level privileges and deep integration with Windows OS exacerbated the problem, leading to widespread business disruptions that persisted for days costing their clientele who were largely Fortune 500 companies an estimated $5 billion, and an estimated $10 billion to businesses worldwide.

Asking Tough Questions

To the casual observer the CrowdStrike outage is ancient history. For IT leaders, however, the event looms ominously, raising questions about the reliability of the services on which they depend. The smart ones are asking themselves: Do I have any other potential single sources of failure? Am I too dependent on an outsourced service? Are my patch and upgrade practices at risk? What can I do to avoid this scenario in the future?

IT services aren’t a set-and-forget proposition. They require constant attention, and patch management is crucial for maintaining security, scalability, and compliance, and for minimizing operational risk. Many patches address vulnerabilities that could be exploited by threat actors, and so applying critical updates quickly is essential to avoiding zero-day exploits. Other patches promote system stability by ensuring that processes run without unexpected errors or interruptions. Patching is also essential for maintaining compliance with government regulations, industry standards, and contractual obligations related to things like security, operational resiliency, and service level agreements (SLAs).

Also Read: The Arbitrage Opportunity of Small Language Models: Unlocking AI Efficiency and Performance

Simple Strategies

The CrowdStrike outage stands as an illustration of how errors can upset the objectives of patch management by disrupting digital operations. But there are strategies that organizations can adopt for maintaining service and application uptime while meeting stringent patch and update schedules. These include:

Deploy patches in test or development environments before rolling them out to production. This helps to identify issues in a controlled setting, thus preventing them from affecting critical systems.
Rolling out updates on clustered instances that enable you to continue to run the application on a primary server while you apply a patch on a standby server. You can then let it run long enough to make sure that the patch doesn’t negatively affect the application before applying the update on the primary and any other remaining nodes.
Deploy patches in stages to small groups of systems initially, catching any problems before they reach all systems or servers, thus limiting potential negative effects on all systems at the same time.

Also, before rolling out any patches or updates, make sure to have settings backed up so that you have the option of restoring systems to the last known correct state should something go catastrophically wrong. And consider using automated tools that support patch testing and rollback management to increase efficiency and reduce the potential for human error.

Cloud-Dependent Challenges

Cloud-dependent organizations are not immune to the risks of downtime. Human error continues to be the leading cause of catastrophic downtime incidents. Every cloud service provider has suffered incidents resulting in outages affecting their customers. For cloud-dependent organizations the best way to minimize the impact of an outage is to take steps in advance to achieve high-availability and minimize the risk of disruption should a cloud vendor or cloud-based service fail.

One strategy is to leverage multi-cloud or hybrid cloud architectures to increase service redundancy and resiliency with a clustered environment capable of failing over not only across availability zones and regions, but from one cloud to another, or from on-premises systems to the cloud. To accomplish this, you need a SANless cluster. Unlike traditional clustering that requires shared storage (typically a SAN), SANless clusters are built using local storage, allowing you to configure cluster nodes in any combination of cloud, on-premises, or remote (disaster recovery) site. Advanced SANless clustering solutions enable seamless failover for critical applications across clouds and synchronizes local storage using real-time (synchronous or asynchronous), block level replication. When using clustering software, such as Windows Server Failover Clustering, SANless clusters appear as and function like traditional SAN-based storage hardware, but without the resource drain.

SANless clusters also afford the flexibility to configure nodes within geographically distributed data centers or in cloud availability zones, allowing you to protect single-site, multi-site, cloud, or mixed environments. In the event of an outage the standby server has immediate access to the most current data during failover.

Also Read: Making Microsoft SQL Server HA and DR Completely Bulletproof

Best Practices for Achieving High Availability

Establishing a failover protocol with SANless clustering ensures that, in the event of an outage with one cloud provider, services can continue uninterrupted through an alternate cloud provider. Distributing workloads across multiple providers creates system redundancy and high availability and supports operational stability by eliminating a single source of failure.

Adopting a multi-cloud approach and using SANless clustering to achieve high availability does more than mitigate the chance that an outage will affect performance and accessibility of your mission-critical systems. Evaluating cloud service providers’ unique strengths – whether performance, cost, or specialized services – allows you to align your organization’s operational priorities with what a provider does best. As a b**** this approach also avoids cloud lock-in, which can pay dividends when negotiating new service contracts.

This multi-cloud failover strategy is not for everyone.

Because not all clouds are built or operate alike, there are challenges to the multi-cloud strategy. These can be overcome by using several best practices:

Calculate and consider egress fees charged by cloud providers for data moving out of their cloud in the event of a failover/failback or manual switchover/switchback for testing.
Take an inventory of things like patch schedules, SLAs, architecture, and networking differences before choosing an additional cloud provider.
Invest in staff training to ensure they become as familiar with your new cloud service provider(s) as they are with your incumbent.
Identify and address how changes to data management and communication policies affect security and compliance before implementation.
Test new configurations (active-passive or active-active) and policies to ensure failover setups operate as intended to provide cross-cloud redundancy and smooth transitions.
Evaluate synchronous vs. asynchronous replication to determine which is better in the context of region, availability zones, disaster recovery, etc.
Adopt intelligent tools to detect failures and automate seamless failover policy enforcement.

Conclusion

It has been said that hope is not a plan, and so hoping that the cloud services and other third-party providers on which you depend are never going to fail is not a good approach for maintaining service availability. Even highly respected organizations have bad days, but you can take steps to mitigate the risks of experiencing a catastrophic outage by adopting SANless clustering to achieve high service availability in a multi-cloud environment.

Dave Bermingham

Quick Links

Visit Our Other Sites

Asking Tough Questions

Also Read: The Arbitrage Opportunity of Small Language Models: Unlocking AI Efficiency and Performance

Simple Strategies

Cloud-Dependent Challenges

Also Read: Making Microsoft SQL Server HA and DR Completely Bulletproof

Best Practices for Achieving High Availability

Conclusion

[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]

CIO Influence Interview with Randy Jeter, CEO of Procure IT

Uniphore Acquiring ActionIQ and Infoworks to Deliver the Industry’s First Zero Data AI Cloud

Dave Bermingham

Related posts

Leostream Delivers Secure Remote Computing with Zero-Trust Network Access for AWS

Mend.io Launches New Version of Mend for Containers

Lacework Extends CNAPP Capabilities with Attack Path Analysis and Agentless Workload Scanning