Ensuring business continuity when disaster strikes

27 August 2024

Within the last month, two IT catastrophes have rocked the world. On 19 July, global TV channels, transport networks and banks were knocked offline in a massive outage causing Windows computers to suddenly shut down; a Crowdstrike update was ultimately identified as the source of the error. Then on 30 July, another widespread outage affecting Microsoft 365 and Azure services was reported; this was initially blamed on a VMWare update, but later attributed to complications from a DDoS attack.

The impact on organisations was significant. OAG, a provider of digital flight information, reported that the world’s 20 largest airlines cancelled nearly 10,000 flights between 19-21 July. Digital banks and financial companies also struggled to serve their customers during the outage; Visa received more than 64,000 user reports on 19 July compared to its typical daily average of just 1,500. The NHS’ healthcare platform EMIS was adversely affected, leaving many GPs unable to make appointments, prescriptions, or receive test results.

The outage showcased just how vulnerable today’s networks are to software glitches and updates.

“The CrowdStrike situation is a reminder that delivering software quality at scale is incredibly difficult,” reports Greg Notch, chief information security officer at Expel. “While it’s easy to pile on the criticism, the security industry and its customers should take this opportunity to reflect on our own practices and review our threat models to ensure that when things like this happen in the future – and they will – we have prevention and resilience strategies in place to mitigate the impact.”

Jack Porter, public sector specialist at Logpoint, warns of the risk associated with relying on single providers and complex cyber ecosystems: “long term, this has the potential to see such software dependencies regarded as an additional risk. Large cybersecurity vendors may now be included with the likes of digital service providers such as AWS, Microsoft and Google services as key suppliers by insurance companies as this has illustrated the devasting impact a security software failure can have.”

Ultimately, this demonstrates the need for more robust and resilient solutions, so that issues can be resolved quickly without causing such widespread chaos.
“Preparedness is key - every IT and security vendor must have a robust system in place across its software development lifecycle to test upgrades before they are rolled out to ensure that there are no flaws within the updates,” asserts Mark Jow, security evangelist, Gigamon.

According to Notch, one way for companies to help themselves avoid these situations is to diversify their security technologies: “adopting best-of-breed solutions for each organisation’s specific needs and ensuring they integrate with each other is a huge first step in achieving that diversity and avoiding unnecessary risk. And if a company already has a comprehensive security platform in place, it would be in its security team’s best interest to look at ways of reinforcing redundancy plans for when a software issue impacts their security capabilities. Resilience is a critical outcome security teams should be delivering and testing.”

“Service Disruption Management (SDM) has emerged as a crucial tool for addressing these challenges, and these internal systems can be enhanced by integrating them with crowdsourced service disruption management (CSDM) solutions that can assess the scale of an outage and provide real-time information to affected users,” adds Mark Giles, lead industry analyst, Ookla. “By integrating CSDM with existing network management systems, service providers can gain a more comprehensive view of their performance and take swift action to mitigate the impact of service disruptions on end users. Identifying priority areas allows for a more coordinated response, minimising impact and protecting the company’s reputation.”