Specialized IT hardware designed to perform generative AI training and inference is much more efficient at the job than generic IT systems. However, it is becoming clear that GPU-heavy servers are subject to many of the same problems that negatively impact the efficiency of traditional enterprise IT: low infrastructure utilization, high idle power and wasteful peak performance states.
The data center industry is at the precipice of an escalating power problem. Infrastructure for generative AI consumes a substantial amount of power and accounts for a significant portion of new data center capacity. Training of large language models (LLMs) currently dominates AI infrastructure needs, with the objective of squeezing the most performance out of hardware to shorten training times. Even if total energy is often a secondary consideration for AI training runs, system utilization levels are high, which means relatively high (if suboptimal) power efficiency.
But the efficiency problem becomes greater with inference workloads, which will be responsible for a growing — and ultimately much larger — share of data center power consumption. In inference, AI-based services respond to users’ queries and workload behavior is less predictable. The priority is cost-efficient token generation, which is often tied to revenue or business functions. Businesses will want to optimize inference hardware for power efficiency (while meeting quality of service) to supress costs and avoid waste of energy — on a colossal scale globally.
Better GPU power management mechanisms can significantly reduce energy spent per token, capital costs and carbon emissions in AI data centers. However, the control mechanisms are not currently well developed or understood.
Generic server processor (CPU) power management is a well-established discipline in traditional IT. The theoretical goal is simple: to ensure the server only uses the necessary amount of energy for the job. The underlying mechanisms are complex, involving software components in the operating system and hardware features in the silicon, such as voltage-frequency controls. A key aspect of power management is addressing idle power: CPU power can be cut considerably during periods of low workload activity — even reduced to near-zero for large parts of the chip.
This technology is mature, well understood and clearly explained by vendors. It is known to reduce IT energy consumption effectively at certain points of load (see Understanding how server power management works). Still, it is not always used since it involves a trade-off between processing power and energy consumption, introducing latency into chip responses.
The process is more complex for GPUs. A current-generation high-performance data center GPU is not a single chip but a complex multi-chip package. It combines several compute and memory dies, which means at least two, but likely more clock domains whose operating frequencies impact power draw. Default power management tools offer only the most basic functionality: Nvidia's System Management Interface (nvidia-smi) supports power capping and offers three performance modes that are likely limiting clock frequencies. Little detail is available about how these affect overall efficiency.
Today, GPU vendors such as Nvidia and AMD are more interested in improving maximum performance of their products than in promoting efficient AI compute. This has left the door open for smaller AI accelerator vendors, such as Qualcomm, SambaNova, Groq, Blaize, Tenstorrent and others, to specialize in efficient inference. Several cloud vendors have developed their own inference chips, promising power-efficient operation. It is important to note that these companies do not make GPUs — their designs are simpler, cheaper and lack some of the features required to train large models. These hardware platforms might represent a more economic choice for inference workloads, at least in the near term.
On the software front, a wave of new products aimed at fixing GPU power management shortcomings is coming. American startup Neuralwatt is developing GPU power consumption optimization software powered by machine learning. Firmware specialist AMI has improved the DCM software platform (previously developed by Intel) to provide insights into GPU health, performance and power consumption to facilitate improved resource utilization. In time, integration of accurate GPU power monitoring capabilities into IT management tools will enable users to investigate energy-related aspects of AI compute in more detail.
What the industry needs right now is clarity from the GPU vendors: descriptions of the trade-offs between power and efficiency, efficiency benchmarking, and a simple admission that it might not always be desirable to drive these servers at 100% load.
Tools such as the SERT database enable easy comparisons of products from different CPU vendors, helping customers make more informed choices, and to pick the right chip for the job. If GPUs are to find a home in enterprise data centers, they need to offer the same opportunity to compare the efficiency of hardware — not just performance.
AI is going to be responsible for a rapidly increasing share of data center capacity and much of this capacity will rely on GPUs from Nvidia and AMD. Similarly to other types of IT resources, GPUs need to be managed effectively to achieve the most efficient operation, but there’s little information available on the features designed to conserve GPU power.
CPU power management features and CPU benchmarking via tools such as the SERT database, provide a blueprint for what is required of enterprise-grade silicon. GPU vendors will have to follow this blueprint — or risk customers moving to alternative hardware platforms that made efficient inference their sole mission.
Thumbnail photograph courtesy of Fritzchens Fritz / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/