How data centres will more swiftly and affordably use AI

31 January 2024

Earl Parsons, director data centre/intelligent building architecture evolution, CommScope

Earl Parsons, director data centre/intelligent building architecture evolution, CommScope

The engagement of millions with transformative ideas, through platforms such as ChatGPT, challenges a fundamental reality — an intricate interplay between curious minds and massive data centres distributed across the globe. When popular science fiction depicts the ‘rise of machine intelligence,’ it usually comes with lasers, explosions or, in gentler examples, mild philosophical dread. But there can be no doubt that interest in artificial intelligence (AI) and machine learning (ML) is indeed on the rise, and new applications are popping up daily.

Establishing AI clusters within data centres and carving out dedicated spaces for the construction, training, and refinement of AI models to align with evolving business goals is imperative. At the heart of AI cores lie racks upon racks of Graphical Processing Units (GPUs), vital for the processing power demanded by its exhaustive training of algorithms. Therefore, as users indulge in seemingly casual exchanges with AI, the underlying infrastructure must be a key consideration for forward-thinking enterprises.

It all starts with effective training

Beneath the veneer of seamless AI interactions lies a challenge — creating a genuinely useful AI algorithm demands enormous volumes of data for training, rendering the process not only costly but power-intensive. The crux of efficiency rests in the training process, meaning data centres must meticulously manage AI and compute clusters.

When training a large-scale AI, about 30% of the required time is consumed by network latency, and the remaining 70% is spent on compute time. Any opportunity to reduce this latency — even, say, 50ns’ worth by reducing the fibre run by just 10m — produces significant savings in time and cost. Considering that the training of such a large AI model can easily cost $10 million or more, the price of latency becomes very clear.

Fibre is an amazingly efficient, low-loss, low-latency infrastructure. AI clusters operate at 100G or 400G speeds but considering the sheer volume of data being moved across the AI cluster, every extra metre of fiber cabling becomes an expensive nuisance of latency and power consumption.

Watts, nanoseconds, and metres

Efficiency, therefore, becomes the cornerstone of progress. Data centres must grapple with the delicate balance between optimising physical layout, mitigating heat generation, and minimising fibre-optic cable lengths. Because fibre runs must be as short as possible, the optics cost will be dominated by the transceiver. Transceivers that use parallel fibre will have an advantage in that they do not require the optical multiplexers and demultiplexers used for wavelength division multiplexing. This results in both lower costs and lower power for transceivers with parallel fiber. This meticulous balance, aimed at trimming metres, nanoseconds, and watts, necessitates careful consideration of optical transceivers and fibre cables.

The debate between singlemode and multimode fibre further underscores the need for strategic decision-making. While singlemode fibre supports links up to 100m, multimode fibre applications present a cost-effective alternative with lower power consumption, making them a preferred choice for AI clusters.

CommScope’s ongoing market pricing research indicates that, for high-speed transceivers (400G+), the cost of a singlemode transceiver is double the cost of an equivalent multimode transceiver. While multimode fibre has a slightly higher cost than singlemode fibre, the difference in cable cost between multimode and singlemode is smaller since multifiber cable costs are dominated by MPO connectors.

High-speed multimode transceivers use one to two watts less power than their singlemode counterparts. With up to 768 transceivers in a single AI cluster, a setup using multimode fibre will save up to 1.5kW. This may seem small compared to the 10kW that each GPU server consumes, but for AI clusters any opportunity to save power can deliver significant savings over the course of AI training and operation.

AOCs versus transceivers

Many AI/ML clusters and HPCs use active optical cables (AOCs) to interconnect GPUs and switches. The downside of AOCs is that they do not have the flexibility offered by transceivers. Installing AOCs is time-consuming as the cable must be routed with the transceiver attached. Correctly installing AOCs with breakouts is especially challenging. The failure rate for AOCs is double that of equivalent transceivers. When one AOC fails, a new AOC must be routed through the network, taking away from compute time. Finally, when it is time to upgrade the network links, the AOCs must be removed and replaced with new AOCs. With transceivers, the fibre cabling is part of the infrastructure and may remain in place for several generations of data rates.

Most AOCs are used for short reaches and typically use multimode fibre and VCSELs. High-speed (>40G) active optical cables will use the same OM3 or OM4 fibre as fibre cables that connect optical transceivers. The transmitters and receivers in an AOC may be the same as in analogous transceivers but are the castoffs. Each transmitter and receiver don’t need to meet rigorous interoperability specs, they simply need to operate with the specific unit attached to the other end of the cable. Since no optical connectors are accessible to the installer, the skills required to clean and inspect fibre connectors are unnecessary.

To conclude

The era of AI/ML has ushered in a transformative age. While interfacing with AI services comes down to us, the foundation supporting this revolution lies within the expansive landscape of large-scale data centre infrastructure. Forward-thinking enterprises, aware of how important advanced fibre infrastructure is in driving efficient AI training and operation, are poised to see incredible results in our rapidly evolving, interconnected world. As the technology continues to grow and change, investments in these cutting-edge advancements today will promise unparalleled success tomorrow.