How resilience keeps you in control, even when outages aren’t your fault

30 March 2026

By Ramtin Rampour, Principal Solutions Architect, Opengear.

Organisations used to treat resilience as an engineering preference: worthy of attention but rarely treated as a business differentiator. Today, it is an increasingly competitive capability.

When customers rely on always-available digital services, and when operations stretch across shared cloud platforms, dense data centre fabrics, and remote edge sites, the ability to stay in control during disruption becomes a mark of maturity. Given the complexity of modern network connectivity, large businesses inevitably struggle to avoid downtime altogether.

The Cloudflare outage on 18 November 2025 was a sharp reminder of how quickly dependencies can amplify an internal fault into widespread impact. It also highlighted an uncomfortable truth for enterprise teams: You can run a disciplined environment and still be affected by failures you didn’t cause and can’t directly fix.

The practical response is not to chase perfect uptime. It is to engineer networks so that when the unexpected happens, recovery is fast and repeatable.

That shift in thinking matters because the cost of disruption is changing. Networks are no longer only about transporting packets of data. They are the control plane for digital business, and increasingly the backbone for AI estates that don’t pause between business hours and overnight maintenance windows.

AI raises the costs of downtime

AI changes what an outage means. When systems fail, the impact is more immediate, and even brief delays can carry real consequences. They affect both main phases of the AI lifecycle: training, where models are built or fine-tuned, and inference, where they generate outputs in production. In training environments, disruption does not just slow progress. It leaves expensive compute capacity sitting idle, interrupts work already in motion, and forces teams to spend time repeating jobs just to restore trust in the outcome.

Inference changes the risk profile once again. Decisions are increasingly being made where data is generated, in environments where speed, certainty, and continuity matter. In that context, degraded performance can be nearly as damaging as a system that is fully down. When response times slip or confidence falls, operators may have no good option other than moving ahead with incomplete insight or pausing operations entirely.

This is one reason network operations are shifting toward greater automation. 94% of CIOs and CSOs polled for a recent Opengear survey reported implementing AI for network management. That is not a claim that tools alone prevent outages. It is an acknowledgement that estates are growing beyond what manual monitoring and ad hoc response can reliably govern. As a result, resilience is increasingly judged in more practical terms. The real test is how quickly control is restored and how confidently teams can say services are up and running again.

The control-path problem

Many organisations still try to meet these expectations with operating models built for more predictable network environments. Redundancy is in place, but resilience is typically assumed. The gap shows up most clearly in the moments that matter: when the production network is impaired and the team needs access to diagnose and recover.

In-band management remains common, which means the tools used to reach network devices depend on the same network path that may have failed or become unstable. When that happens, visibility drops and recovery slows. Engineers can be left waiting for partial restoration, relying on indirect signals or escalating to on-site intervention to regain access to equipment.

Distributed estates make this harder. An unmanned edge site turns a technical fault into a logistics problem, and preventative maintenance becomes inconsistent when every improvement depends on travel, access windows, and local hands. Over time, that creates the conditions for small degradations to accumulate until they become disruptive.

This is where out-of-band management becomes central to modern resilience. An independent management plane keeps a control path available even when the production network is down, misconfigured or congested.

Instead of being cut off from the devices needed to fix the outage, teams can continue to access infrastructure, diagnose faults, roll back changes, and restore service without relying on the failing production path.

Because incidents are often stressful, that control path must also be designed to be safe. Strong authentication, least-privilege access, and audit-ready logging are what allow teams to move quickly without creating a secondary security incident in the middle of recovery.

Designing resilience as an operating capability

If resilience is now a differentiator, it must be built into daily operations, not treated as an emergency-only feature. The goal is to reduce the number of incidents that become outages, and to shorten the time from disruption to confidence when outages do occur.

The first step is to pinpoint signs of instability early, before they end up causing an outage. Most network failures do not begin with a single event. They build quietly as performance degrades and the network moves outside its normal operating margin. Analytics can help teams identify that shift earlier across large estates. That matters because early visibility gives operators time to intervene directly, make informed changes, and restore stability before the issue spreads.

Automation then turns intent into consistency. Repeatable provisioning reduces configuration drift. Standardised recovery workflows reduce variance in how incidents are handled. Post-recovery checks reduce the period of uncertainty that often follows an incident when services may be reachable but not healthy. In AI environments, where partial failure can silently distort outcomes or degrade performance, verification matters as much as restoration.

Out-of-band management supports all these steps because it keeps the mechanism for action available. Without it, even strong analytics and automation can be undermined by a simple problem: the team cannot reach the systems that need attention. With it, resilience becomes a repeatable operating model that scales from core data centres to remote sites.

Why resilience matters

Resilience is about staying operationally in control during disruption. In the AI era, where training and inference raise the cost of outages and where estates span cloud, core, and edge, that capability becomes a practical advantage.

Resilience is also not just about recovery but about the first day your infrastructure is stood up. Having OOB means you can build, deploy, and automate on day zero – remotely, securely, and at scale.

The organisations that invest in independent access through out-of-band management, pair it with earlier detection and automation, and make recovery verifiable rather than assumed will not only reduce downtime but also recover faster when it matters most.