The high capital and operating costs of infrastructure for AI mean an outage can have a significant financial impact due to lost training hours
The high capital and operating costs of infrastructure for AI mean an outage can have a significant financial impact due to lost training hours
Compared with most traditional data centers, those hosting large AI training workloads require increased attention to dynamic thermal management, including capabilities to handle sudden and substantial load variations effectively.
AI is not a uniform workload - the infrastructure requirements for a particular model depend on a multitude of factors. Systems and silicon designers envision at least three approaches to developing and delivering AI.
Operators and investors are planning to spend hundreds of billions of dollars on supersized sites and vast supporting infrastructures. However, increasing constraints and uncertainties will limit the scale of these build outs.
AI infrastructure increases rack power, requiring operators to upgrade IT cooling. While some (typically with rack power up to 50 kW) rely on close-coupled air cooling, others with more demanding AI workloads are adopting hybrid air and DLC.
A new wave of GPU-focused cloud providers is offering high-end hardware at prices lower than those charged by hyperscalers. Dedicated infrastructure needs to be highly utilized to outperform these neoclouds on cost.
The US government is applying a new set of rules to control the building of large AI clusters around the world. The application of these rules will be complex.
Hyperscalers design their own servers and silicon to scale colossal server estates effectively. AWS uses a system called Nitro to offload virtualization, networking and storage management from the server processor onto a custom chip.
This summary of the 2025 predictions highlights the growing concerns and opportunities around AI for data centers.
Power and cooling requirements for generative AI training are upending data center design and accelerating liquid cooling adoption. Mainstream business IT will not follow until resiliency and operational concerns are addressed.
Dedicated GPU infrastructure can beat the public cloud on cost. Companies considering purchasing an AI cluster need to consider utilization as the key variable in their calculations.
Uptime Intelligence looks beyond the more obvious trends of 2025 and examines some of the latest developments and challenges shaping the data center industry.
Supersized generative AI models are placing onerous demands on both IT and facilities infrastructure. The challenge for next-generation AI infrastructure will be power, forcing operators to explore new electrification architectures.
Nvidia's dominant position in the AI hardware market may be steering data center design in the wrong direction. This dominance will be harder to sustain as enterprises begin to understand AI and opt for cheaper, simpler hardware.
The cost and complexity of deploying large-scale GPU clusters for generative AI training will drive many enterprises to the public cloud. Most enterprises will use pre-trained foundation models, to reduce computational overheads.