30 August 2024
Duncan Swan, chief operating officer, British APCO
The Microsoft/Crowd Strike issue of 19th July got headlines across most of the world due to the impact of the loss of IT systems and services had on many people’s day-to-day lives. Businesses came to a standstill; transport was disrupted; and healthcare services reverted to pen and paper supporting only the most urgent cases. Away from the UK, it also impacted delivery of critical emergency calls.
IT & comms failures impacting delivery of emergency calls, from those in their time of greatest need to an emergency agency, are becoming ever more prevalent. And the underpinning reason why these failures are occurring is often software failure or upgrade/configuration errors. There is an air of inevitability that systems and services are likely, at some point, to fail. And often not in the way you might expect. Critical service providers must, therefore, take business continuity more seriously than ever before. It’s not just about knowing there is built-in resilience to the core infrastructure; it’s about having robust alternatives that minimise the impact to critical service delivery, with clear communication plans that let those needing emergency assistance know precisely what to do and expect.
The Collaborative Coalition for International Public Safety – of which British APCO is a founding member – has recently published a Best Practice Guide for emergency agencies to consider when three-digit emergency calls can’t be received. Most organisations have emergency plans should they need to evacuate their premises; or there is a major incident to deal with; or they need to move to a back up system. But many have yet to establish and rehearse plans as to how to allow the public to make contact when the primary emergency communication lines are down – and how this will be effectively communicated.
Back in November 2023, the Optus network in Australia suffered a national outage of all Optus internet, cellular and fixed-line services in Australia. Emergency services were compromised. Hospitals were hampered in their critical work. Businesses lost the ability to trade. And nearly 2,500 Optus customers were unable to get through to emergency services during the 16-hour blackout. The outage occurred when many Optus routers automatically self-isolated to protect themselves from an overload of IP routing information – all resulting from a software upgrade, where the network received changes in routing information from an alternate Singtel peering router located out in Singapore.
“IT & comms failures impacting delivery of emergency calls, from those in their time of greatest need to an emergency agency, are becoming ever more prevalent. And the underpinning reason why these failures are occurring is often software failure or upgrade/configuration errors.”
Some 10 million Optus customers had no way to get through to Triple Zero emergency services - “We didn’t have a plan in place for that specific scale of outage. I think it was unexpected,” the Optus MD told the Australian Senate not long afterwards.
Two recent network outages in Canada also have their root cause analysis in software upgrades. In April 2021, Rogers, a Canadian wireless provider had an 18-hour outage affecting ~ 11 million people; wireless and landline internet access was affected; and it was not possible to use 911, the emergency communication number. And again, in July 2022, the same provider suffered a similar duration outage affecting ~14 million subscribers, roughly a third of the population of Canada. This latter network failure was almost identical to that suffered by Optus in Australia a year later where, due to incorrect configuration, a flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information.
In Europe, both Ireland and France have seen prolonged carrier network outages – each of which impacted critical communications - due to a software upgrade or configuration error. In the UK, the 999/112 system went down for some 10 hours in June 2023 (the first time since its inception in 1937!) – and for which the public emergency service communications provider BT were recently fined £17.5 million. In their report, Ofcom the UK telecoms regulator, said the emergency call handling outage was caused by an error in a file on a BT server, which meant systems restarted as soon as call handlers received a call. It led to staff being left logged out and calls being disconnected or being dropped as they were transferred to the emergency services. The level of preparedness of BT, all the other UK network operators, and the emergency agencies themselves has all come under scrutiny; with work to resolve this situation put in place soon after.
And in the US the three major network carriers have all suffered recent outages. In February 2024, AT&T suffered a nationwide outage that disrupted wireless services for many customers. The issue was eventually attributed to an incorrect process during network expansion. T-Mobile experienced a major outage in June 2023, primarily affecting voice and text services due to a routing issue. This disruption also had knock-on effects on other carriers, leading to widespread connectivity problems. And on multiple occasions, Verizon has faced outages, often linked to network changes or issues with inter-carrier connections.
So where critical communications are concerned, it is critical that citizens facing a critical situation at a critical moment in their life can get through to the emergency agencies. Emergency agencies and network providers need to be critical as to how they manage risk and ensure business continuity in situations where the normal, trusted, method of emergency communication becomes unavailable. There is also a growing body of evidence underpinning that no level of investment or planning in technology resilience can survive the human elements of poor software upgrades or configuration errors. Neither scale nor ignorance can be an excuse.