Global Outage: CrowdStrike’s Flashing Red Light

For a few years, experts in the field of artificial intelligence have been expressing concerns about the potential of their work to go disastrously wrong, leading to catastrophic events reminiscent of a blockbuster film. A stark reminder came to light last Friday, highlighting that disasters can happen subtly, possibly even stemming from an inconspicuous piece of technology that the majority are entirely unaware of.

Our existence is underpinned by a multitude of interconnected systems, which we often overlook. We typically take for granted the systems that allow us to travel by airplane, traverse bridges, settle our bills, fetch software updates, monitor our kids at summer camp, and smoothly navigate everyday life, until they break down.

A failure on a global scale was recorded last week. What was labelled as the largest software outage in history wasn’t down to a terror attack, troublesome AI or ransom-seeking hackers, contrary to scenarios often shown in Hollywood blockbusters. In fact, it wasn’t even a prank by a prodigiously intelligent adolescent. Rather, it was a conventional update that derailed.

The Texas-based company CrowdStrike, known for defending businesses against digital threats, ironically, became a threat itself, proving to be ill-equipped for this issue. The problem originated from a minor update to Windows software that the company disseminated to its customers one recent Thursday night. Somehow, each computer that received the update plummeted into a crash. Alerts such as, “Your PC ran into a problem,” or “Windows didn’t load correctly,” were displayed, framed against an ideal sky blue backdrop – ominously termed the “blue screen of death.”

Within any system there lies the propensity for failure, often arising in unforeseen ways. The complacency of such a statement was cemented by the Great Blackout of 1965, which is hailed as one of the largest technological blunders ever. The event cut power to 30 million residents across the US eastern seaboard. A fault at a Canadian power station triggered a domino effect of problems, breaking the system; a reminder that digital mishaps can be as ordinary as they are unexpected.

Residing in today’s advanced society requires a level of trust. Usually, it’s not something we ponder on, until we encounter moments of fear; like unsettling turbulence during our flight or hearing stories about plane accidents or doors being blown off mid-flight. There are times when getting on the plane itself becomes a challenge, as seen by the plight of passengers on thousands of flights last Friday. Numerous flights were postponed for hours due to Delta grounding flights while Ryanair’s online check-in system faced downtime, creating long lines at Irish airports. Unavailability of appointments at NCT centres due to tech glitches led to widespread chaos.

Aerospace technology breaking down is a major cause of anxiety for obvious reasons. However, even non-travellers found last Friday distressing. The computers’ inability to correctly assign responsibility for the system crash or rectify the issue was troubling and human intervention didn’t seem to provide immediate remedy.

Describing the situation, CrowdStrike executive Brody Nisbet said on a social media platform X, “It’s a mess”. Providing no further assistance, he expressed disappointment but soon deleted the communicated message. Critics pointed out that CrowdStrike could have demonstrated better due diligence before releasing the patch. By conducting trial runs of the patch on various Windows machines, the problem could have been detected earlier, they argued.

Matt Mitchell, a hacker and establishing member of CryptoHarlem, a cybersecurity teaching and advocacy institution stated, “Hypothetically, they should have tested on a machine similar to their clients’ older systems to foresee such errors”.

CrowdStrike, a leading cybersecurity firm, established in 2011 and home to 8,000 employees, faced serious criticism and loss of investor confidence after the incident, witnessing an 11% drop in shares last Friday. The crisis was monumental, impacting approximately 8.5 million machines, as per Microsoft.

Highlighting the gravity of the situation, Puneet Kukreja, cybersecurity head for EY UK and Ireland said, “If your software or system serve such a massive population, it becomes an essential service. Like water being available in your taps, failure of such software is not expected.”

Although the firm may not possess the same name recognition held by larger technology companies, it does have plenty of audacity. Taking potshots at its competitors, the company uses part of its website to belittle them. The firm questions Microsoft’s security capabilities by stating, “Microsoft’s safety products can’t even defend Microsoft. How will they safeguard you?” The company also urges users to stay clear of Palo Alto Networks, condemning it as a “high-cost, difficult to use, deploy and manage” platform.

CrowdStrike CEO George Kurtz made an attempt last Friday to downplay a recent service interruption, describing it as “a flaw in a single content update for Windows hosts”. Accusations that Kurtz was slow to apologise arose when an apology came hours after the incident. (“I want to truly apologise to all of you for today’s interruption”, he later stated). The company had no comment when asked for additional details.

After the interruption, IT staff at impacted businesses had to make a decision: manually remove the faulty code from each offline computer, or patiently await a resolution from CrowdStrike. Explaining the situation, cybersecurity company WithSecure’s Chief Research Officer and security expert, Mikko Hypponen stated, “You can work around the issue if you can walk to each laptop, manually type and reinitiate it”. He highlighted that the real challenge arises when dealing with big corporations, CrowdStrike’s clientele, who usually manage their suite of computers through central controls.

In summary, the age-old solution for a troublesome computer – a simple reboot – remains the only effective fix. This is true even as computers are becoming more integrated into global networks. But those stranded travellers at the airport were unable to restart the screens that barred them from their flights.

Describing the situation as “a flaw in a single content update” is emblematic of a contemporary risk. Until recently, software updates were a more complex and laborious task. Each computer system wasn’t connected to every other, leading to more localized failures.

Cybersecurity expert Mitchell explains, “When we discuss cybersecurity, we argue about having multiple layers of defence – a castle with a moat, archers, and a gateway. We advocate for systems where there isn’t a single point of failure. But unfortunately, we’re creating circumstances where there’s a single point of failure”. The latest outage illustrates the delicate position companies find themselves in and emphasises the need for contingency plans to help mitigate disruptions.

The potential for disturbances is perpetually looming in our contemporary digital society which is heavily entwined and dependent on various IT systems and networks, elucidates Kurkreja. The ripple effect of these disruptions can lead to widespread implications, possibly impacting not only a specific company but extending to its partners and customers within its supply chain. It underscores the vital role IT resilience plays in sustaining business operations, providing swift recovery and continuity during these disturbances.

Moreover, the paradigm towards security, and the consolidation of technologies as well as vendors, is another area that needs consideration, according to Richard Ford, the CTO of Irish cybersecurity firm Integrity 360. He spotlighted how this incident could incite more cautious corporations to scatter their security controls and risk among several vendors, and section off business areas. Although this is a possibility, it’s a certainty that all will recognise the faith we place in vendors as well as our recovery procedures.

Pondering back to the 1965 blackout, it’s apparent that disruptions can culminate but are not inevitably fatal, a recent example being the CrowdStrike outage. The disturbance, which hasn’t been correlated to any casualties currently allowed disrupted journeys to finish over the weekend. If CrowdStrike proves lucky, the mishap will fade and be forgotten in no time.

Yet, the ever-looming possibility is that somewhere down the line, we may not be fortunate, and some overlooked or carelessly set up piece of unexciting technology could command an authentic catastrophe. Renowned security expert Brian Honan spoke, illustrating the chaos that can transpire when entities within these sectors face an incident, either an IT issue or the outcome of a vindictive attack. As 100 per cent security is non-existent, it boils down to prudent risk management, risk identification, and managing these risks to lessen their impact on the ledger, should an event occur.

As the world continues to interconnect, the stakes are higher. A software breakdown that might catalyse a societal breakdown seems more probable as opposed to AI initiating global peace. As predicted by poets long ago, this would be an absurd end. To paraphrase T.S Eliot, he would, in today’s context, likely add a thumbs-down emoji to his famous lines about the world ending without a bang but a whimper.

Subscribe to our Business notifications to get the top news, insights and opinions sent straight to your mobile. Every week, we release a new episode of our Inside Business podcast, which can be found here.

Condividi