Building resilient IT operations: lessons to learn

05 February 2025

Melissa Bischoping, Senior Director, Security & Product Design Research, Tanium

Melissa Bischoping, Senior Director, Security & Product Design Research, Tanium

Last summer’s IT glitch that led to unprecedented levels of digital chaos was a wake-up call on a global scale. But unlike other incidents, the outages were not the result of a security incident or malicious cyber activity.

In this case, it appears it was due to a routine software update process that didn’t quite go to plan. This latest headline-grabbing event showed how easily things can grind to a halt when a spanner is thrown into the works. In a shift from the usual storyline, many endpoints were patched and protected, but still fell to an exploited risk inherent in our current IT systems. Modern organizations may have yet-unknown single points of failure which can lead to widespread disruption. For many businesses, the outage resulted in significant operational delays, lost revenue, and a breach of customer trust.

The incident underscored a harsh reality: no IT system — no matter how advanced – is completely immune to failure.
As Gartner so eloquently put it in How to Prepare for Cloud Outages: “all systems are subject to failure. We cannot purchase hardware that never breaks, we cannot build software that is entirely bug-free – and, most importantly, we must always live with human error. It is impossible not to make errors that can potentially cause downtime, degrade service or result in data loss. However, we can try to reduce the impact of failures.”

The risks of relying on a single IT platform
One of the most glaring lessons from the Crowdstrike outage is the risk of not having robust recovery plans and diversified infrastructure to support business continuity. As our world becomes more interconnected, the failure of one system can have a domino effect, impacting a wide range of services.

The events this summer have, once again, made it clear that organisations need to diversify their IT infrastructures. By integrating multiple systems that can operate independently — yet support each other in times of need — businesses can build a more resilient framework. This approach not only mitigates the risk of a total system collapse but also ensures that services can continue operating even when one part of the system fails.

Data confidence is key
For me, effective management of IT systems, particularly during crises, hinges on one crucial factor: data confidence. If organisations want to respond swiftly and effectively to IT failures, then they must have complete visibility over their systems and immediate access to accurate data. Without these, the ability to diagnose and rectify issues promptly is severely compromised.

On the evidence of recent weeks, many organisations lack the necessary infrastructure to gather and analyse data in real-time. This gap in capability often results in delayed responses, finger-pointing and prolonged downtime, exacerbating the impact of IT failures.

If there’s any good to come out of the Crowdstrike glitch it’s that it is yet another reminder that if organisations want to build more resilient IT operations, then they must invest in systems that provide comprehensive visibility and allow for real-time data access.

In this regard, there are no shortcuts. This capability is essential not just for responding to incidents, but also for preventing them. By continuously monitoring their systems, insisting on controlled deployments of updates, and having real-time data at their fingertips, organisations can identify potential issues before they escalate into full-blown outages.

The same is true of a strategic approach to IT infrastructure. By planning ahead, thinking the unthinkable and accepting that when it comes to digital disasters, it’s ‘when’ — not ‘if’ — organisations can be much better prepared to prevent, respond to, and recover from incidents with minimal disruption.

But that means adopting a much more proactive stance towards IT management. It means not only focusing on preventing failures but also on ensuring that systems are in place to quickly address issues when they do occur. That’s why the concept of ‘defence in depth’ — where multiple layers of security and operational controls are deployed — is a key strategy in this regard. Often, organizations think of defense in depth as something to apply to adversarial behavior, but the same approach is equally beneficial in addressing operational business risks and critical system availability.

By diversifying IT platforms, ensuring data confidence through complete visibility and immediacy, and adopting a multi-layered approach to security and operations, businesses can build a more resilient IT infrastructure. For business leaders, it’s about putting in place a strategy to mitigate risk while building resilience to ensure services have the very best chance of remaining operational – even in the face of unexpected events.