Can DCIM cope with the reality of GenAI workloads?

09 February 2024

Dean Boyle, CEO, EkkoSense

Dean Boyle, CEO, EkkoSense

With data centre management busy coming to terms with the realities of hosting high-density AI systems, it’s clear there’s going to be an increased pressure on data centre optimisation as teams work to make their operations as lean as is practical.

Generative AI applications present teams with some very practical engineering challenges. How do you continue to balance risk, capacity and cooling when you’ll be running racks at 60kW – and potentially up to 100kW? That’s a huge difference for halls that were originally designed to host traditional 3-5kW racks. What are you going to do about cooling? And how can you be sure you have the right solutions in place when you may be running multiple data centres worldwide?

Getting ready for GenAI workloads at 10x the power
Data centers are already looking to support 20% plus increases in workload levels even before GenAI. With this in mind it’s clear that managers need a comprehensive and sustained commitment to data centre performance optimization. With AI compute workloads estimated to consume around 10x the power of standard deployments, the ability to either save power or release stranded capacity to increase IT loads will be critical.

Operations teams will need to unlock every possible area of improvement across their own data centres, those of colocation service partners, and edge facilities. Achieving this will require new levels of insight into existing thermal performance, power provision and capacity management – levels of insight that simply cannot be achieved by relying on traditional legacy Data Centre Infrastructure Management (DCIM) and Building Management Systems (BMS) tools.

What action should managers be taking?
So what needs to change? Unfortunately, many data centres aren’t starting from a good place, and that’s hardly surprising when so many legacy facilities are over a decade old and often still operating to their original design parameters. At EkkoSense we believe it’s difficult to unlock the kind of cooling, power and capacity performance improvements that are needed to handle greater workloads and secure energy savings unless you know exactly what’s happening in your data centre in real-time. We also believe that this won’t be achievable unless data centre management commit seriously towards bridging the gap between their IT and M&E functions.

While the latest digital services and core business applications may run on leading-edge platforms, it’s still the traditional facilities management teams that manage and maintain the building and the critical supporting infrastructure within it. However, most IT teams have little interest in the underlying Monitoring and Evaluation (M&E) infrastructure that provides the power and cooling that enables their services to run. Because of this, it’s not unusual to see expensive power and cooling resources being used inefficiently. Excess energy usage not only gets in the way of corporate net-zero initiatives, but also potentially places organisations at risk when IT loads increase or critical resources suddenly become depleted or unavailable.

Need to make the invisible visible
This is perhaps why legacy DCIM data centre management tools, which largely come from the IT side, often fail to address the very real M&E needs of data centre operators – especially in terms of capacity management and overall energy efficiency. With rack load densities increasing, sites expanding, and expert resources stretched, the reality is that data centre teams with traditional DCIM systems in place may actually have very little time to react to problems that could be escalating quickly.

That’s why EkkoSense is focused on complementing the capabilities of traditional DCIM-style data centre optimisation with a distinctive AI-powered approach that enables true real-time visibility of cooling, power and capacity performance. We believe that the only truly reliable way for data centre teams to troubleshoot and optimise performance is to gather massive amounts of data from right across their estates – and then leverage the power of machine learning and AI to help teams understand not just what’s happening but also why.

The good news for data centre managers is that this latest generation of AI-powered data centre optimisation software can extend the potential of their existing BMS and DCIM investments – helping their operations teams to stay on top of their escalating workloads and ESG reporting requirements.