UII UPDATE 370 | MAY 2024

Intelligence Update

Cloud AI needs cost discipline now

The ability to scale on demand is a significant advantage of the public cloud over on-premises infrastructure. Cloud-native applications can grow or shrink depending on changing requirements, ensuring better service for end users. To allow this scalability, customers are billed in arrears based on the resources used. In contrast, on-premises infrastructure is typically purchased upfront. Applications hosted on-premises usually remain static, unable to consume more or fewer resources than those contained within a server.

The ability to access massive scale at small incremental cost is one of the reasons why Uptime Intelligence expects cloud infrastructure to be the main venue for AI training workloads (see Most AI models will be trained in the cloud).

The pay-as-you-go model brings both flexibility and risk. Customers are charged for all resources used during a given period, including those used accidentally. It is the customer’s responsibility to identify and terminate resources that are no longer needed. The term bill shock — originally used to describe a higher-than-expected cell phone bill — now also applies to monthly cloud computing invoices.

Organizations increasingly use frameworks such as technology business management (TBM, see Why technology business management does more than FinOps) and FinOps (see FinOps gives hope to those struggling with cloud costs) to control and manage cloud spend, aiming to balance scalability against cost control. With AI projects, the need to control costs is more critical than ever. If a cloud estate is not managed effectively, expenses can quickly exceed those of on-premises infrastructure — often while delivering the same services. More worryingly, runaway costs can strain budgets and undermine the profitability of AI projects, eroding their commercial viability.

AI is all about scale

AI workloads have three characteristics that make unexpected costs more likely and more impactful than non-AI workloads:

1. GPU infrastructure is significantly more expensive than traditional infrastructure

Consider a scenario where a Google Cloud developer accidentally leaves a virtual machine (instance) running for a week — something that can easily occur if they forget that the instance was started.

In a non-AI application designed to scale in small increments using shared core instances, the wasted cost might be as low as $1.40 — an amount unlikely to affect the business or its objectives. However, if the application requires an Nvidia H100 GPU for AI training or inference, this unused instance costs $1,848 — even if the smallest H100 instance available is selected.

Because the costs of AI infrastructure are so much higher, the financial impact of a provisioning error is correspondingly much greater. There is a marked difference in the price of AI and non-AI compute resources across all cloud providers.

2. AI workloads require substantial scalability

Virtual machines are easily controllable — an organization needs to provision them before they can be used. However, some costs may scale automatically in line with user or application demands — for example, data transfer or storage platform capacity.

Imagine an AI training workload that utilizes several instances. It can only use the compute resources available in those provisioned instances. In contrast, the same rules do not apply to storage: with object storage, employees can upload as much data as needed without concerns about provisioning more capacity.

Some models require huge datasets for training. Teams training such models may upload large volumes of data to the cloud without fully considering the repercussions. The customer is responsible for paying that bill, even though the employees have not actively created any resources or services — they have just added more data. Without proper controls, this data might be forgotten or unused once uploaded — continuing to incur more costs over time.

Storing a terabyte of data to train an AI model on Microsoft Azure’s premium blob storage costs around $150 per month, a figure similar to its competitors. This is not a significant outlay, but the compounding effect of forgotten datasets rapidly adds up. For example, if a terabyte of training data is added each month but not deleted once used, the total cost over the year is $12,000.

Non-AI applications may also require large datasets. The difference with AI is that large datasets often require long-term storage for retraining and fine-tuning. Also, there is an ongoing trend for larger models that require more data.

3. AI workloads use a broad range of capabilities beyond GPUs

Cloud providers meter and charge for many elements of a cloud service. With AI workloads, a vast range of non-obvious pricing items need to be considered when training and inferencing, alongside virtual machines and GPUs.

For example, AWS SageMaker is an AI development platform. As well as the costs related to the virtual machines used for inference and training, there are also charges for the development platform itself — these include the virtual machine that holds the development environment, tools for data management and debugging, and use of Jupyter notebooks (a popular tool for computing experiments). Cloud customers may also use services such as storage, data transfer, databases and serverless computing for their AI workloads. Each of these services has its own set of billing metrics. In a single cloud region, AWS SageMaker alone has a catalog of 1,800 line items (or SKUs — stock-keeping units) for sale. These items are not necessarily obvious when using the platform — not because they are intentionally obfuscated, but because a developer may not necessarily consider them in their day-to-day use.

Many cloud applications use a wide range of cloud services. However, AI applications are more likely to use a broader range, leading to larger bills with more metrics to track, and increased complexity.

Act now, and plan ahead

Most cloud providers offer tools to help manage, optimize and monitor spend. These tools can detect usage anomalies, identify unused resources, and send alerts when thresholds or forecasts are breached — typically for free. However, they need to be configured and set up before AI expenditure escalates out of hand. Identifying resources and data that have not been effectively tagged and monitored is difficult; hence, setting up tools in advance is essential.

Organizations concerned about overspending should consider using these free tools to monitor use, relate costs to departments, and plan ahead by purchasing cheaper resources where possible.

Third-party tools — such as Apptio (IBM Cloudability), VMware (Tanzu CloudHealth), Infracost and New Relic — can provide an independent view on cloud cost management. FinOps and TBM frameworks can help manage and optimize costs continuously and better relate those costs to revenue.

In the long term, organizations building AI models must grapple with a simple question: what is the return on investment in training? If more training results in more revenue, then greater expenditure is unlikely to be an issue. What is unclear today is whether investing in more training will help build a more desirable — and monetizable — product.

The Uptime Intelligence View

The risk of runaway costs has always been inherent in the public cloud, due to its pay-as-you-go pricing model. Over the past decade, a wide range of tools and frameworks have emerged to help prevent overspending and avoid bill shock. Organizations aiming to control costs should leverage these capabilities to manage cloud expenditure — regardless of how extensively AI is being adopted. However, the highly scalable nature of AI means that failing to control cloud costs now carries far greater financial consequences than with traditional workloads.

About the Author

Owen Rogers

Owen Rogers

Dr. Owen Rogers is Uptime Institute’s Senior Research Director of Cloud Computing. Dr. Rogers has been analyzing the economics of cloud for over a decade as a chartered engineer, product manager and industry analyst. Rogers covers all areas of cloud, including AI, FinOps, sustainability, hybrid infrastructure and quantum computing.

Posting comments is not available for Network Guests