UII UPDATE 494 | MAY 2026
In the early days of generative AI, much attention was focused on the process of training new models from zero. As the technology matures, organizations are increasingly looking for ways to adapt existing large language models (LLMs) to their specific requirements. Customizing existing models avoids the need to source training data, employ machine learning experts and endure long and expensive training runs.
A recent Uptime Intelligence report explored the options available to organizations that want to either customize existing models or train their own (see How AI training choices affect infrastructure costs). It demonstrates how using modern fine-tuning techniques — that further train an LLM on a smaller, specialized dataset to adapt it for specific tasks — consumes orders of magnitude fewer IT and facility resources than training a model from scratch, resulting in much lower costs.
The report also highlights that smaller (approximately 3 billion parameter) models can be trained in a few days using a single 10 kW system that fits comfortably into almost any data center configuration. Based on the findings, we can make three further observations.
1. The business case for training your own models is becoming weaker
Organizations that decide to train their own generative AI models need to be confident that the results outperform a growing number of open-source and open-weights models. These models can be deployed on-premises, fine-tuned for specific tasks and environments, and some models — such as those from Meta and DeepSeek — compete with proprietary models on performance and accuracy benchmarks.
Building brand new LLMs requires not just infrastructure but also highly sought-after developer skills. An enterprise-grade LLM requires machine learning engineers, data engineers, and specialists in security, compliance and quality assurance. Developer salaries have not been included in the analysis and would further increase the cost of training a full model.
Fine-tuning open, freely distributed models is the most cost-effective way to arrive at a specialized LLM. This is reflected in data center software: over the past 3 months, Uptime Intelligence has spoken to four vendors developing LLMs for data center operations. All four chose to build on top of existing open-weights models rather than develop their own (see AI in facility operations: three applications to watch).
2. Maintaining high utilization levels for inference hardware is hard. Maintaining high utilization levels for training hardware is harder
When training, a huge block of capacity is used as a single system and the number of GPUs is often fixed. If infrastructure is used exclusively for training, it is either fully engaged or completely idle. Demand is "lumpy", with developers seeking all-or-nothing capacity. As a result, any interruption in the availability of work significantly drives average utilization down and unit costs up (see The operational cost of AI training failures).
Inference demand is generally more continuous and steadier. Utilization may vary over time depending on the input queries, output responses and concurrency levels. However, in most cases, the hardware will always be performing at least some valuable work. As a result, with effective peak capacity planning, the average utilization of the hardware can be pushed higher — thereby reducing unit costs.
Using the same infrastructure for training and inference can help increase overall utilization, but it increases management and capacity planning complexity considerably. Mixing training and inference workloads on the same systems is justified in two cases: when the operator has sufficient scale to absorb the shifts between training and inference workloads (this includes hyperscalers, neoclouds and the world's largest enterprises); or when the requirements of the models are so modest that an organization can afford to overprovision capacity.
3. Parameter-efficient fine-tuning is entering the spotlight
Technologies that allow organizations to customize models at a low cost are seeing broad and rapid adoption. Retrieval-augmented generation (RAG), a technique that enables LLMs to ingest and incorporate additional information in their output that was not available during training, was introduced in 2020 and has become standard across enterprise deployments.
The same is likely to eventually happen to a family of parameter-efficient fine-tuning (PEFT) techniques, such as low-rank adaptation (LoRA). Rather than modifying the billions of existing weights in the model, LoRA inserts weights into certain layers of the neural network to adjust the model's behavior. Only a fraction (typically 0.1-1%) of parameters are updated. This means lower computational cost, memory usage and training time compared with conventional fine-tuning — and only a fraction of the resources needed to train a new model from scratch.
This type of re-training will be a common application for cloud and neocloud training infrastructure; tasking GPU-based systems to work at full capacity but infrequently, and for relatively short periods. Expect LoRA and other PEFT techniques to gain more attention in the coming years as cost- and energy-efficiency of AI become more pressing topics.