“What teachings can be derived from the most catastrophic IT failure in history?”

Critical business operations, global communication structures, health and transport services were brought to a standstill by what is labelled as ‘the world’s most disastrous IT breakdown’ on the 19th of July. The crisis led to grounding of flights and airports resorting to manual ticketing and delivering notices. Even broadcasters like Sky News faced temporary suspension while core services, such as health care, emergency facilities and public transport apps, including the one for Transport for Ireland were interrupted.

The mammoth global disruption is a result of a minor software upgrade, which is likely to require ‘some time’ for a full fix. Are such mishaps an exception or a reflection of the ‘new normal’ in this era of cloud computing? Could we learn from older communication systems to prevent such future chaos?

The catastrophic IT failure was prompted by two big, and potentially linked, IT failures. On 18th and 19th of July, disruptions occurred on Microsoft’s cloud computing platform, Azure, and a defective update from cybersecurity firm CrowdStrike commenced a ripple effect through global IT systems, leading to significant failures in IT infrastructure as well as making individual Windows computers temporarily incapacitated.

Although the exact timings and sequence of the incidents are yet to be confirmed, a defective update by CrowdStrike Falcon Sensor installed on Windows servers and computers appears to be the primary cause.

Falcon Sensor is a fusion of firewall, antivirus, and cybersecurity ecosystem, broadly used across different cloud computing services, designed to detect, monitor, and block cybersecurity threats in real time. The defective update triggering the issue was regarded as a minor content update, thus bypassing the standard robust testing before its deployment on many of the world’s Windows servers and computers.

From the time of the electrical telegraph network back in the mid-1800s, the cornerstone of our telecommunications systems has always been security and dependability.

The aftermath? Numerous “blue screens of death” made their appearance while the effects were felt utilising the globe. Countless transportation arrangements and flights found themselves cancelled or delayed. In Britain, Sky News temporarily faced broadcasting halts, the National Health Service and several GP scheduling systems became non-functional, and news updates from the London Stock Exchange were disrupted.

[ CrowdStrike is nearly ready to implement a remediation following an IT blackout that saw 8.5 million Windows devices worldwide malfunctionOpens in new window ]

Irish areas saw impacts too, with Dublin and Cork airports, Transport for Ireland, and NCT test centres feeling the effects.

Whilst this IT service interruption was not instigated by a deliberate or evil cyber operation, it does highlight the potentially devastating consequences resulting from inadequacies in cybersecurity measures, including phishing and malware offensives.

The Irish hospitality sector: ‘Customers perceive they aren’t receiving adequate value for their investment’

A recent noteworthy IT interruption was a ransomware offensive named ‘WannaCry’ in 2017, which incapacitated over 200,000 computer systems in more than 100 nations, including medical apparatus and computers within Britain’s NHS, leading to impacts on healthcare delivery and individual lives. This assault was suppressed quickly due to the quick identification of a ‘kill switch’ by cybersecurity professional, Marcus Hutchins, which slowed down the malware’s proliferation.

The foundation of our telecommunications systems, from their inception during the establishment of the electrical telegraph network in the mid-nineteenth century, through to the formation of the Internet in the atomic era, has always involved security and reliability.

As the telegraph system evolved into an extensive global network, traversing land and sea towards the late 1800s, the British ensured their message security by introducing the “All-Red Line” – a network of electric telegraph wires encircling the globe, which only traversed through territories and colonies controlled by Britain.

From the 1950s up to the 1970s, academia and the military began contemplating the potential structure of a global knowledge-sharing database or network.

Importantly, the original developers included fail-safes in the vast network, ensuring that if one cable went down, messages could still navigate through alternative paths. In 1911, given the potential threat of a European war, the British Committee on Imperial Defence determined that due to the network’s built-in safety net, it would be near impossible to sever Britain’s connection to the telegraph network. To completely isolate Britain, one would need to disrupt 49 cables, compared to 15 for Canada and 5 for South Africa.

Between the 1950s and 70s, academics and military personnel pondered the creation of a worldwide resource-sharing platform or network. Their vision was a decentralised setup that was secure, durable, and capable of enduring numerous failures – even those resulting from a nuclear offensive. Their creation, which now underpins the internet and subsequent worldwide web, has critical lessons we seem to have overlooked in today’s cloud computing age.

Major IT crises have been, encouragingly, quite uncommon up until now, one of the most recent instances being the ‘WannaCry’ ransomware attack witnessed in 2017. But with more businesses turning to cloud computing, the risk and likelihood of these incidents have only escalated. So, what steps can be taken to minimise the consequences of any future outages, which will hopefully be a rarity, whether these are accidental or intentional?

The insights we gain from the electrical telegraph network in the Victorian era and the internet’s birth in the 1960s reveal that the key to creating robust and reliable systems is to prioritise redundancy and decentralisation. In the current times, this principle can be applied through extensive testing and validation of third-party updates, spreading risk across varying platforms and cloud services, and something that has been around since the dawn of electronic computing in the 1940s: reliable backups.

Elizabeth Bruton, a technology historian, past curator at the Science Museum in London, and Honorary Research Fellow at the History of Science Museum at the University of Oxford, proposes these strategies.

Condividi