Shaping the future of cloud and HPC networking with optical circuit switches

30 April 2025

Michael Enrico, POLATIS Network Solutions Architect, HUBER+SUHNER

Michael Enrico, POLATIS Network Solutions Architect, HUBER+SUHNER

According to McKinsey, AI-ready data centre capacity is set to rise at an average rate of 33% a year until 2030, in response to client demand for AI applications like virtual assistants, autonomous vehicles, and facial recognition security systems. Hyperscalers in cloud computing and high-performance computing (HPC) providers are under pressure to scale their computing platforms while minimising capital expenditure (CAPEX) and reducing energy consumption.

The processing power required for AI workloads, particularly for training and inference, has skyrocketed, leading to mass deployment of GPUs and other accelerators — often numbering tens of thousands per installation.

To meet these challenges, network architects are adopting disaggregated architectures for scalable, efficient platforms for supporting both commodity cloud compute and more demanding AI/ML workloads.

Disaggregation for cloud compute

For commodity cloud compute services, traditional platforms often rely on monolithic hosting structures, such as standard server chassis, which can be rigid and inflexible. By embracing the process of ‘disaggregating’ underlying resources, operators can achieve greater efficiency, reducing both resource underutilisation and, crucially, power consumption.

In a disaggregated architecture, resources such as CPU, memory, storage, and acceleration hardware are interconnected through high-speed digital transceivers and a dedicated interconnect fabric, leveraging advanced transport media and switching technologies. The design allows resources to be flexibly combined and scaled independently, regardless of where they are hosted within individual racks or across the broader data centre, ensuring they meet the demands of expected workloads.

Several levels of disaggregation can be defined, related to the granularity at which the resource blocks can be accessed and consumed and the manner in which they are hosted. In the most granular form of disaggregation, atomic resource blocks (e.g., a bank of DRAM, a CPU, an accelerator) are combined. Other approaches use technologies such as PCI Express over fibre to ‘export’ resources within a server chassis for combination with resources hosted elsewhere in the same installation.

Building and operating ever larger AI clusters

The training of large language models, which form the foundation of generative AI, demands the rapid processing of vast input datasets. Additionally, these models are built on neural networks that are growing increasingly sophisticated, with numerous nodes, layers, and parameters. To meet these stringent requirements efficiently, the workload is distributed across multiple specialised processing elements (e.g., GPUs) employing advanced strategies for effective parallelisation.

During a training job, the data traffic between the multiple processing elements typically follows relatively static and predictable patterns. However, dynamically rearranging the connections between processing elements between jobs — or even between significant phases of the same job — can significantly enhance execution efficiency. This can arise because the particular structure of the neural network required can be more efficiently applied to the underlying processing elements. Furthermore, such reconfiguration can be used to bypass failed elements so they do not hold up the progress of the job and can be replaced without impeding the operation of the rest of the cluster. This is where a switching interconnect fabric comes into its own.

The interconnect fabric

An optical interconnect fabric with transparent optical circuit switching (OCS) provides deterministic, circuit-switched, fixed-bandwidth data paths, making it an ideal solution for connecting the aforementioned resource elements.

It promises significant reductions in power consumption of the fabric itself compared to an electrical, packet-switching fabric, much lower latencies associated with the data paths through it, and a better ability to physically scale the fabric up and out. It also enjoys significantly better future-proofing thanks to the inherent transparency of the fabric to the formats and line rates of the serialised data traffic between the optical transceivers associated with the disaggregated resource elements.

The lowest-optical switches allow for fabrics to be constructed with unprecedented scalability with up to four or more stages of switching, keeping within the optical loss budgets of typical transceivers used in the disaggregated resource or clustered processing elements.

The benefits of disaggregating data centre infrastructure with optical circuit switching

Underlying hardware platforms for cloud computing customers can be dynamically composed and scaled in real-time to match the specific size and resource ratio required for the workloads they support. Disaggregation provides additional flexibility to reconfigure or resize the platform during a workload’s lifecycle as its resource consumption evolves. Moreover, unused resources can be temporarily powered down when not needed, enabling operational expenditure (OPEX) savings.

Some optical circuit switches also contribute to network resilience by supporting ‘dark fibre’ switching. These are able to maintain connections without light being present on the fibre, which allows for faster network reconfiguration and fault tolerance. For example, when a component in a disaggregated system or AI cluster fails, OCS enables swift rerouting through pre-selected optical paths, minimising downtime and ensuring that resource utilisation remains optimal.

Finally, disaggregation helps to reduce the growing CAPEX burden by enabling operators to select and combine best-of-breed vendors for various component building blocks. It allows them to purchase resources in proportions that support the specific functions they need and incrementally upgrade different types and/or blocks of resource elements as required. This staged approach minimises disruption to the wider infrastructure. Additionally, since optical circuit switches are transparent to signal type and data rates, they are excluded from the need to upgrade as network speeds increase.

Shaping the future

It is not surprising that optical circuit switches have emerged as a cornerstone of disaggregated network design in AI-ready data centre. Their unique combination of low optical loss, minimal latency, transparency to signal types and transmission speeds, and extremely low power consumption compared to packet switches makes them indispensable in this new era.