UII UPDATE 309 | DECEMBER 2024

Intelligence Update

Most AI models will be trained in the cloud

The rapid rise of generative AI has changed the landscape of AI infrastructure requirements. Training generative AI models, particularly large language models (LLMs), requires massive processing power, primarily through GPU server clusters. GPUs are essential in this task because they accelerate the processing of matrix multiplication calculations that underpin the neural network architectures behind generative AI (see How generative AI learns and creates using GPUs).

GPU clusters can be difficult to procure, expensive to purchase and complex to implement. Cloud providers offer access to GPU resources and AI development platforms on a pay-as-you-go basis.

The cost and complexity of deploying large-scale GPU clusters for generative AI training will drive many enterprises to the cloud. Most enterprises will use foundation models, pre-trained by third parties, to reduce computational overheads. Cloud services will be used for short-term and infrequent fine-tuning and customization tasks.

Creating and managing large-scale GPU clusters on-premises presents enormous challenges for enterprises. The financial burden alone is substantial: for instance, a single Nvidia H100 server can cost hundreds of thousands of dollars, and when setting up a functional AI cluster with even a few servers, the cost can reach millions. Other factors, such as storage, networking, power, cooling and labor, add significantly to the overall expense. Beyond cost, there are also operational complexities, as AI clusters require specialized data center infrastructure and teams of highly skilled engineers for maintenance, management, and troubleshooting. Additionally, supply chain issues continue to impact the availability of AI hardware, making it difficult for enterprises to acquire and deploy clusters quickly.

These challenges make on-premises training of large generative AI models feasible only for a few organizations that can justify the high initial investment and ongoing operational costs. Consequently, many enterprises seek more accessible, scalable, and cost-effective ways of supporting their AI training needs. Cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure offer infrastructure as a service options that enable enterprises to access and utilize high-powered GPUs and other advanced AI infrastructure on a pay-as-you-go basis. A new breed of cloud providers, such as CoreWeave, have emerged to deliver large-scale GPU clusters as a service.

Hyperscalers also offer platform as a service (PaaS) options that enable access to AI capabilities without the responsibility of managing the model or the infrastructure. Pre-trained foundation models are also offered, reducing enterprises' burden of training.

Balancing cost, flexibility and customization

Given the prohibitive costs and technical complexities of on-premises GPU infrastructure, most generative AI training will be forced to take place in the cloud for the foreseeable future. With their massive, centralized data centers and extensive GPU resources, cloud providers are ideally positioned to fulfill the rising demand for AI training infrastructure. These hyperscalers have made substantial investments in high-performance computing hardware, providing access to cutting-edge GPUs and new AI frameworks without capital expenditure.

By using these cloud services, companies can develop their own AI models from scratch without purchasing, installing or managing hardware themselves. They can also utilize other companies' models for AI capabilities. These cloud services and foundation models will not necessarily be cheap, but — for most buyers — they will be cheaper than dedicated equipment.

Foundation models, which have been pre-trained by software vendors or cloud providers, can further reduce computational overhead. Companies will use these models as the basis for small-scale customization or fine-tuning, ideally suited for the cloud, where training infrastructure can be consumed for short periods without upfront purchase.

Inevitably, some organizations will need to compromise on their AI ambitions if using a cloud service. Dedicated infrastructure provides full customization across all levels of the stack, while cloud services and foundation models provide general-purpose capabilities.

Some niche use cases, such as governmental and health care industries prioritizing security and compliance, or companies that rely heavily on proprietary AI capabilities, will require full customization. In these niche cases, the perceived risks of using shared infrastructure may outweigh the benefits of the cloud, prompting some organizations to maintain their own GPU clusters. But these cases will be the minority. For most, cloud-based models (whether fine-tuned foundation models or PaaS) will offer a “good enough” solution, balancing capability and cost without requiring massive investments in dedicated hardware.

Hyperscalers lead the way

Most investments in AI infrastructure to support large-scale training will be made by hyperscalers and cloud providers rather than by enterprises.

Training is a non-interactive batch job that demands large-scale, cutting-edge infrastructure. It can be performed without regard to proximity to model end-users. However, inference — the process of using the model in production — needs to be integrated with applications and provide a quick response to end-users. As such, inference will occur near where end-user applications are hosted, whether in the cloud, on-premises data centers, consumer devices or at the edge. Enterprises will continue to require infrastructure to support inference.

Over time, the cost of GPU clusters will likely fall due to more efficient hardware and improved supply. This reduction will change the dynamic slightly in that more enterprises might consider training their models using their own infrastructure. However, cloud providers and hyperscalers will also benefit from these cost reductions, offsetting the dedicated cost advantage.

Hyperscalers will likely continue investing in advanced AI infrastructure for training and inference, such as their own AI application-specific integrated circuits (ASICs), potentially lowering prices over time and improving service offerings. Cloud providers may also expand their portfolio of foundation models and pre-built AI services, making it easier for enterprises to integrate AI capabilities with minimal customization.

The GPU cloud market could consolidate as hyperscalers acquire specialist GPU cloud providers to meet enterprise demands (see What is the outlook for GPU cloud providers).

Easier access to AI can enable even small and mid-sized enterprises to leverage powerful AI capabilities without needing extensive in-house resources. However, reliance on the cloud also introduces some challenges, particularly around data sovereignty and regulatory compliance. For companies managing highly sensitive information, cloud-based training might require strategies such as data anonymization, which could reduce the quality or specificity of model outcomes.

The Uptime Intelligence View

Few enterprises have the data center infrastructure, server hardware and skills to manage AI training effectively. Furthermore, only some enterprises have the internal demand (or clear return on investment) to justify installing a large-scale GPU cluster. Foundation models and cloud services are not ideal in all situations. However, considering the cost implications of self-build, many enterprises will compromise to deliver a “good enough” capability at a reasonable cost.

 

About the Author

Dr. Owen Rogers

Dr. Owen Rogers

Dr. Owen Rogers is Uptime Institute’s Sr. Research Director of Cloud Computing. Dr. Rogers has been analyzing the economics of cloud for over a decade as a product manager, a PhD candidate and an industry analyst. Rogers covers all areas of cloud, including economics, sustainability, hybrid infrastructure, quantum computing and edge.