The case for resiliency in AI inferencing is clear. Inferencing, where an end-user queries an AI model to make predictions or classifications, is typically embedded in an interactive application. End users expect a quick response to their queries. Any delay due to poor performance or an outage could impact user experience and, ultimately, productivity, revenue and reputation.
However, AI training — which teaches the model to perform inference — is a batch process involving little or no interactivity with end users. If it fails, end users are unlikely to notice. So, what is the case for building resiliency in AI training applications? This update explains how AI training failures can result in higher operational costs. Architecting resiliency into data centers, servers and software can help prevent these failures.
Apply for a four-week evaluation of Uptime Intelligence; the leading source of research, insight and data-driven analysis focused on digital infrastructure.
Already have access? Log in here