TAPAS.network | 1 August 2024 | Editorial Opinion | Peter Stonham

The Machine Stops

THERE ARE certain industries that, due to their time-critical nature, service delivery structure, and user characteristics and expectations, are particularly susceptible to any system downtime or unpredictable interruptions to service.

Transport and logistics have become a prime example in our modern digital world, meaning everything from passenger transport services to traffic control and freight distribution are in the front line for any IT system failure. Ever more so with the growth of digital dependency and reliance on the internet for communication, messaging and data transfer, both to manage service provision, and to support interaction with customers.

Last week a global IT outage brought chaos across airlines, airports, some train operators and other retail, financial and healthcare services. The cybersecurity company CrowdStrike has admitted responsibility from a faulty software update to Microsoft windows that affected 8.5 million computers that displayed ‘blue screens of death’.

Shortly after, incidents of a different kind, but similar consequence, occurred on the TGV rail network in France when arson attacks were made on lineside communications infrastructure bringing services to a halt. This was a physical intervention, but no less related to the dependence on digital capacity for systems to operate.

In our digital-dependent and highly connected society, such incidents are increasingly likely. One failure begets others and a break in connectivity disrupts and devalues the whole network. Last week we found out what happens if systems go offline, with the highly publicised incident demonstrating how a major system failure can cripple operations, frustrate passengers and tarnish reputations.

Whilst the cause in this case was specific, it was hardly unique. Under the right - or rather wrong- conditions, similar scenarios could potentially unfold in many situations at weak points within the digital chain — or more worryingly, from an organised contamination across key system providers.

Consequences can be more than just inconvenience, extending to safety, protection of transactions, critical security data, and contamination and decay of core support systems, not to mention public confidence in the operations in the future, all those involved incurring reputational risk.

A core truth is that Information Technology has radically changed the nature of transport and travel. Driverless cars and trains, automated traffic controls, and connected highways are just some examples of how the transport sector has adopted the Internet of Things, AI Systems instead of human ones, and the substitution of customer self-managed bookings, payments and journey planning at an astonishing rate. However, with each advancement and investment comes increased reliance on the Internet, which those familiar with its inner workings can affirm introduces another level of unpredictability that needs proactive management.

Transport companies have learned that they must both have diagnostic indicators to predict when things are going to fail before they happen, substitute systems in place to kick in when the main ones stop working, and insurance to both spread risk and offer redress when customers are left out in the cold. One key issue is to consider the question of liability and address the associated matter of denial of responsibility through the loophole of considering cyber disruption as outwith human control as an ‘Act of God’.

If a supplier — commercial or a public body — chooses to dispense with human actors and implement technological and artificial intelligence mechanisms, that is a decision with consequences, but not a removal of responsibility. A digital disruption has another significant impact in disabling the ability to monitor the performance and positioning of transport assets — and their payloads — i.e. planes, trains, cars, and trucks. These are moving assets which in normal times are increasingly tracked and visible on display systems, but suddenly ‘lost’ to view. This lack of connectivity means it becomes very difficult to determine if there is an operational issue with one or more vehicles, and to take appropriate action.

The transport sector is already susceptible to shocks and stresses (i.e., extreme weather, protests, and pandemics) that make resiliency fundamental to managing risk in operations. On top of that, any unreliability of the internet or communications carrying systems means potential severe disruption.

But it’s not just disconnection and suspension of operations and the corruption of internal systems that can be costly. When Google Maps went down in August 2022, it took out several apps that rely on its API to deliver directions. For example, Uber and Lyft rely on Google Maps data to provide real-time information about traffic conditions and other factors impacting drivers’ ability to pick up riders. With these apps affected by the outage, their drivers could not pick up customers.

The transport sector is already susceptible to shocks and stresses (i.e., extreme weather, protests, and pandemics) that make resiliency fundamental to managing risk in operations. On top of that, any unreliability of the internet or communications carrying systems means potential severe disruption.

One problem clearly demonstrated by last week’s outage is establishing the cause of the disruption. On this occasion, it was relatively easy to point the finger at CrowdStrike for its faulty update. But, it might not always be so simple. The problem is compounded when it is not clear if the issue is in communicating data, a contamination of the operating system, a network connectivity issue, or other mysterious or malignant cause.

In a detailed review of the incident CrowdStrike said there was a “bug” in a system designed to ensure software updates work properly. Crowdstrike said the glitch meant “problematic content data” in a file went undetected. The company said it could prevent the incident from happening again with better software testing and checks, including more scrutiny from developers.

This comes as affected businesses and customers are asking what financial compensation those impacted by the outage will be able to claim. According to insurance firm Parametrix, the top 500 US companies by revenue, excluding Microsoft, faced some $5.4bn (£4.1bn) in financial losses from the outage. It said that only $540m (£418m) to $1.08bn (£840m) of these losses were insured.

“This incident must serve as a broader warning about the national security risks associated with network dependency,” wrote the US House Committee on Homeland Security in a letter quickly sent to Crowdstrike, which it has called to a hearing.

Many IT experts have been drawing some obvious conclusions. Professor Omer Rana, Cardiff University Academic Centre of Excellence in Cyber Security Research & Education, said the outage had “clearly indicated that we need to consider the impact of wider ‘cyber disturbances’ – rather than just cyber attacks”. It is the impact on systems that is important, not just what has caused it, he said. “This shows how vulnerable we are to cloud-hosted services that we all rely on every day. This reliance has increased even more significantly since the Covid pandemic, when many workers were connected on-line and cloud-hosted services played a key role.” The cyber-disturbances that were now occurring have come in the context of ‘edge computing systems’, such as the internet of things devices, as our reliance on these continues to increase.

Much other valuable insight has been offered by similar IT and cyber systems experts in the wake of the outage as we report in this issue - but who will really sit up and listen ?

Steve Sands, Chair of the BCS Information Security Specialist Group, said “Working IT systems are a prerequisite for almost every aspect of modern life and indeed the global economy. We have made a number of key recommendations to improve service and software resilience to government. I sincerely hope that this CrowdStrike issues raise awareness and create some much-needed urgency to continue this vital conversation.”

Dr Inah Omoronyia, based in the Bristol Cybersecurity Research Group at University of Bristol’s School of Computer Science, said:“This outage points to the need to be constantly vigilant of the cloud infrastructures and other critical systems that we now depend on daily. Today’s infrastructures are a lot more complex, with extensive dependencies. Currently, our risk mitigation approaches are too reactive and therefore unsustainable for the current pace of technological innovation. Unless precautions are pro-actively taken to detect and mitigate risks throughout the whole software and systems supply chain our best effort may remain a security theatre.”

Beyond these experts, those involved in the transport sector in more traditional roles, or with responsibilities in planning and policy rather than operations, might be acknowledging that there are indeed some challenging and concerning matters of digital and cyber vulnerability and resilience, and even have worries about the implications. But perhaps they believe that ‘someone’ is looking after those problems on behalf of their organisations, and the country/society at large. That’s what they probably also thought about the preparation for a pandemic, and plans for responding to major weather events like flooding or extreme heatwaves.

Events have demonstrated that such confidence in other people and systems is often misplaced. It is in human nature to be most concerned about things that are immediately apparent, or deeply seated inside human experience — not to imagine and plan for the arrival of new threats and consequences that come unannounced alongside what look like unlimited beneficial advances in human ingenuity and technological invention.

Digital dependence, and the possibilities of system corruption or collapse, may be an existential threat - or at least a trigger towards a set of circumstances beyond anyone’s perception or control bringing serious immediate and long-term damage to society. Coupled with unknowns brought by climate change, the coming of widespread substitution of Artificial Intelligence for core human functions, and management systems that remove discretion from individuals in the field (if indeed there are any left), raise the prospect of more dramatic and chaotic chain reactions that will not have been thought through for their potentially very unpleasant consequences.

We have changed our way of life unrecognisably in just a few decades. In many ways for the better, but in other ways worse. Our ‘new order’ is potentially constructed upon expectations and beliefs of predictability and reliability of underpinning systems, which are, to a greater or lesser extent, misplaced.

The endemic weaknesses are not very visible, and hard to identify. Once upon a time, the nuclear threat brought a massive attention to civil defence, preparations for the collapse of government, and measures to deal with panic and injury that many ridiculed as unlikely to even scrape the surface of the serious nature of the feared catastrophe. It now seems that no current ‘external’ threat to our way of life- of which digital dependence is just one - is having the same chilling effect. Or, perhaps they are, but we prefer to embrace the comforts of believing that ‘somebody’ will be looking after things, or that it’s logical to simply ‘hope for the best’, as matters are now beyond anyone’s real control.

Peter Stonham is the Editorial Director of TAPAS Network

This article was first published in LTT magazine, LTT897, 1 August 2024.

d2-20220516-1

taster

The Machine Stops

Read more articles by Peter Stonham

Read more articles on TAPAS