Roundtable: Resilience by design

05 November 2024

Achieving network resilience can take a lot out of an enterprise, placing pressure on known and unknown points. We chat with Chris McKie at Datto, a Kaseya company; Alex Grant at 24 Seven Cloud; and Rene Neumann at ZPE, for their insights on creating a truly resilient network.

When it comes to network resilience, what is the end goal for IT teams?

Chris McKie VP, Product Marketing Security and Networking Solutions at Datto, a Kaseya company: The end goal is to have a reliable, secure and available connectivity that enables workers, guests and apps that use the network maximum uptime, even if networking gear should fail or other outages occur.

Alex Grant, Director at 24 Seven Cloud: To create a network that can self-heal. A resilient network can be designed in such a way that it will essentially have the set-up and configuration that can be mitigated, meaning it can rectify issues autonomously. Designing a network right the first time means it can cope with future issues. The benefit of this also means that it will be able to solve issues itself without human intervention.

Rene Neumann, Director of Solutions at ZPE: The real question is whether resilience is high enough on their priorities list, especially given the fact that networks can fail due to configuration errors, cyberattacks, and even natural disasters. The answer largely depends on who you ask within an organization and their understanding of what resilience means.

How does resilience differ from redundancy?

Neumann: Whether critical systems are in the cloud, on-prem, or in a hybrid environment, when customers lose access to services, the business’s bottom line is at stake. In such cases, the network team is often the first to receive a call, with an urgent request to fix the issue immediately and ensure it never happens again. This is where the distinction between redundant and resilient systems becomes crucial. Redundancy involves adding backup components, like load balancers, to increase system availability. But, redundancy simply reduces the risk of failure in specific scenarios, and in many cases, it can add complexity to the network and complicate future changes.

McKie: Resiliency and redundancy are often conflated to mean the same, which is not true. Redundancy incorporates duplicative technologies to address concerns of failure. For example, duplicative power supplies are often seen as redundant components to a switch. Resiliency, on the other hand, addresses not just network components, but includes network management, network security, operations, back-up and recovery. Resiliency is much more than failover; it’s about hardening a network so that services, even if impacted, are restored quickly and reliably. A resilient network is easily managed and includes templates to automate deployment, as well as to track network configurations.

Grant: A business could have just one computer that does payroll; if this device fails then the company would lose its entire payroll function. However, if the firm had two computers that were able to do the payroll, then this is what is known as redundancy, which is essentially back-up, or in technical terms, N+1. In terms of a network, if you have a switch or battery back-up, then this would mean you have redundancy. Another example of this in action could be a firm going to a data centre and choosing where to host its servers, this would mean it then has redundancy one plus two. It’s important for any company to have extra hardware or software, should there be a fault with its existing network. In short, redundancy can give you resilience, but resilience can exist on its own as this is how a network functions.

What characteristics are intrinsic to resilient networks?

Grant: A resilient network should be able to cope with issues by itself. Whether that be a power failure at a data centre or a contractor cutting a fibre cable in the ground, if it has been designed to mitigate these types of issues without human intervention, then it will be able to endure any future problems that arise.

Chris McKie

Chris McKie

McKie: Intrinsic characteristics will include cloud-based management for quick recovery and ease of use. A resilient network will have a minimal learning curve for management. This eliminates the dependencies of the few experts who know the arcane, command-line only keystrokes to keeping the network up and running.

Neumann: Resilient networks are designed to mitigate disruptions by having the ability to adapt to changing environments and recover before a major prolonged outage can occur. A good analogy is the evolution of airplane wings. Early wings were rigid and broke, prompting engineers to design stronger ones. Eventually, they realized that flexibility was key to resilience. Similarly, resilient networks must be flexible and capable of adapting to changing conditions.

How important are periodic status audits?

Alex Grant

Alex Grant

Grant: Status audits are extremely important. When audits don’t happen, this is where glitches can creep in. You are more than likely to have an outage if you don’t audit your hardware. Planning also becomes a waste of time if a firm doesn’t keep up with its auditing process. Depending on the company size, disaster recovery should happen every 3-12 months where a company’s servers are switched off. A firm can plan for the worst either in a test lab or using its actual live equipment – although the latter is dangerous and not recommended. Ideally, a company would perform disaster recovery in a test environment, by replicating its live network.

McKie: Real-time monitoring, not periodic audits, are a must. Service providers like MSPs deliver networking as a service, which requires networks that are always available and secure. Real time auditing can’t be isolated and relegated to network management tools, it must be integrated into RMM and PSA tools because that’s where technicians spend all their time. When something happens, the system must react instantly and provide visibility within technicians’ tools. This automates the process of ticketing, tracking and troubleshooting.

Why is resilience built in from the network foundations superior to add-ons?

Neumann: There are two challenges to achieving network resilience. The first is a lack of organizational mindset. Most believe that resilience can come from deploying more cybersecurity products or disaster recovery solutions, when the reality is incidents continue to increase despite these markets being worth hundreds of billions of dollars.

The other challenge has to do with the diversity of network environments themselves, as most are unique ‘snowflake’ architectures that require their own specific tools and implementations. This makes them slow to deploy, extremely complex, and resistant to changes. Addressing this challenge involves rethinking how networks are built. The single most important component of achieving resilience is the underlying network management infrastructure and the capabilities it provides.

Grant: If you want something doing right, do it from the start. If you think about building a house, you want to do it right the first time. This creates a more stable platform to work from; the same goes for networks. The danger in adding add ons at a later date is that they aren’t properly tested or designed to work together, which can cause issues. Unexpected outcomes can cause unintended consequences through introducing more variables. Designing from the ground up is the best way to build resilience.

McKie: The issue with add-ons is that integrations are not always maintained by the various vendors. If one party stops support, then over time the integrations break or lose their functionality. Because of this, integrations from a single source tend to work better over time.

Does any single technology stand out for resilience?

Grant: Containerised software is the most advanced in terms of resilience and fit-for-future. A container is an independent software programme that lives in the cloud. These ‘containers’ can be created or destroyed on demand, which ultimately means a new network can be created instantly. When compared to physical hardware, containerised software is the way forward for the industry.

It all comes back to redundancy vs resilience; you could have N+ million just by pressing a button - instant on demand resiliency and redundancy. As we continue to use more data, using containerised software means the scale of a network can be increased on demand. It can also be switched on and off meaning firms can take a ‘spend what you need’ approach or ‘burstable spend’ rather than having to pay a lot of money up front.

Neumann: While it will take time for organizations to educate staff with a resilience-first perspective, implementing Isolated Managed Infrastructure (IMI) is a major step that any company can take right now. IMI is a separate network dedicated entirely to management tasks. It is not just a lifeline for recovery in case of outages or failures; it serves as the platform for organizations to re-tool, rebuild, and adapt their networks without having to change their physical infrastructure.

To build a resilient network, the IMI must have:

Security: The solution must address all supply chain vulnerabilities at the hardware and software levels and allow zero-trust enforcement on the management layer.

Isolation and Remote Access: The solution must use dedicated out-of-band management over multiple independent WAN links (MPLS, fibre, 5G, Starlink). As seen in the recent CrowdStrike incident, companies with resilient remote access recovered faster than those that required manual, on-site intervention.
Automation and Openness: The solution must be able to run automation tooling for routine operations and disaster recovery, as well as host VMs, Docker containers, apps, and services that can be spun up/down to adapt to changing environments.

McKie: Software Defined Networking is a front-runner, but it isn’t a panacea for all. Advanced network resiliency requires multiple touch points, which eliminates singular points of failure and dependencies on singular products or technologies. There is no singular technology that stands out to make networking resilient.

How can an IT team know they’ve ‘made it’ on the path to resilience?

McKie: The short answer is that you’ve never made it. Resiliency is an ongoing effort that changes over time. Because it’s a moving target, you’re always getting close, but no one should ever be satisfied that they’ve achieved 100% resiliency.

Grant: The ultimate goal is to have everything working correctly. Some would say they’ve ‘made it’ when they have nothing left to do. A car manufacturer could build a car that lasts for a million miles for example, but this may not be the most logical or sustainable option in terms of value for money for the end user.

Renee Neumann

Renee Neumann

It comes down to cost vs benefit. If you have unlimited money, you could have the ultimate resilient network with thousands of replicated cables and data centres. This could also depend on the industry and the severity of the situation, should a network go down. For example, healthcare pendants need to always be active, however, if the phone line to a car dealership goes down and customers can’t get through, the consequences aren’t as severe. It’s relative to the industry and the product that is being delivered to the end user. Some networks need to be more resilient than others. When deciding how resilient a network should be, it’s important to keep the end user in mind.