UII UPDATE 430 | NOVEMBER 2025

Intelligence Update

What the Azure outage revealed about internet fragility

On October 29, 2025, Microsoft Azure experienced a significant global service disruption triggered by a configuration change in its edge-delivery network fabric. The incident began at approximately 16:00 UTC, when the company's global content delivery network (CDN) service, Azure Front Door (AFD), began rejecting or timing out requests across its international fleet.

A modern CDN is a globally distributed network layer positioned between users and application backends. Instead of sending traffic across multiple independent internet networks, it terminates user connections at geographically dispersed points of presence and forwards them across a single private backbone under unified control. This enables consistent routing behavior and can improve performance, availability and security. Static and semi-static content may be cached at the edge to reduce latency and offload origin infrastructure. Advanced CDNs also typically incorporate distributed denial of service (DDoS) attack protection, application-layer firewalling and other security controls.

AFD implements this edge-delivery model using a network of approximately 190 edge points of presence across more than 100 metropolitan locations, which serve as the first termination point for client sessions. Requests are processed at these edge sites before being forwarded across Microsoft's private network backbone to the configured origin. AFD’s control plane distributes configuration and health-check policies to the edge fleet, enabling traffic steering and failover decisions. The service depends on consistent global configuration propagation and coordinated routing behavior.

It was the change in this behavior that caused the Azure outage.

The front door shut

Official status updates from Microsoft indicate an inadvertent configuration change” as the root cause. A deployment of a flawed configuration to AFD's control plane caused individual edge nodes worldwide to fail their health checks, refuse connections or return errors, despite the infrastructure's geographic redundancy. As these servers came offline, those remaining in service had to deal with increased load, exacerbating the issue.

Because many of Microsoft's end-user and enterprise services relied on AFD for ingress and routing, the failure propagated rapidly: products such as Microsoft 365, Minecraft and Xbox Live — as well as large corporate customers such as Starbucks, Alaska Airlines and Vodafone — reported errors, outages or degraded performance. In response, Microsoft deployed a rollback to the “last known good configuration”, blocked all further updates to AFD during remediation and gradually re-routed traffic away from the affected nodes.

The outage affected only AFD, with other cloud services remaining operational. Unfortunately, AFD was the entry point to those services. The applications were running, but users proxying via AFD could not reach them.

In practice, only a small subset of Azure's total customers was likely to be affected — only those using AFD. Most customers using AFD were also utilizing Azure’s broader public cloud capabilities. However, the outage demonstrated how networks and supporting edge services are often a single point of failure. Organizations using cloud are exposed to this risk, but so are those with on-premises IT deployments that rely on third-party network operators, CDNs and domain name systems (DNS) to connect users to applications.

Who is responsible?

A week before Azure’s incident, AWS suffered an outage across multiple services in the us-east-1 (North Virginia, US) region, due to an internal DNS error (see AWS outage: what are the lessons for enterprises?). Organizations that made the effort to deploy their applications across multiple availability zones in us-east-1 were not protected. Only those deployed across regions were unaffected by the outage.

In that case, AWS failed to meet its availability service level agreement (SLA), frustrating its customers. However, enterprises shared responsibility for their own downtime — they had chosen not to implement dual-region architectures, despite cloud providers' openness to the fact that outages will occur from time to time.

In the case of the recent Azure outage, Microsoft holds the brunt of the blame for two reasons:

  • In PaaS, the provider wholly manages resiliency. In the case of AFD, Azure is explicitly responsible for resiliency — AFD is categorized as platform as a service (PaaS), not infrastructure as a service (IaaS). As a result, the customer has no control of or visibility into AFD's underlying infrastructure. The customer outsources management of the hardware and software, as well as resiliency, to Microsoft as part of the service.
    This platform model contrasts with IaaS, where the customer pays for access to infrastructure resources: they choose how to use that infrastructure to build resilient applications. Table 1 shows the differences in responsibilities between IaaS and PaaS models.

 

Table 1 Resiliency responsibilities: IaaS and PaaS

image

  • Network control planes must span the globe. The second difference is that an edge-gateway service is, by definition, global. An IaaS cloud compute service is typically isolated within a specific region. If a region goes down, other regions should not be affected —in principle at least. In normal circumstances, an application could operate effectively in a single availability zone.
    For an edge-gateway service to operate correctly, however, it must work across regions. The service is embedded in the global network itself. As a global service, the control plane must coordinate routing across all nodes, regardless of location. As a result, a control plane issue is likely to affect all customers using that service, in all parts of the world. Given the global blast radius of a control plane failure, Microsoft should have implemented additional safeguards to reduce the likelihood and impact of such an outage. Table 2 shows the differences between global and region-based services.

 

Table 2 Global and regional service comparison

image

Could it have been mitigated?

Organizations could not have easily mitigated against this outage using a single cloud provider. If AFD is the single point of entry for all users and AFD goes down, all users will be unable to access those resources.

The most robust way to mitigate this risk is to use multiple gateway services, for example, AFD alongside competing products from Akamai or Cloudflare. In the event of a control-plane failure in one provider, the other providers should remain operational. A failover option could work but may take time to activate. An active-active option would require significant design efforts to ensure services from different providers can interoperate and communicate status and data. Regardless of the method, a multi-gateway approach introduces complexity and cost implications.

Implementing a multi-gateway approach is pointless, unless other shared network services are duplicated. For instance, a DNS, which maps IP addresses to domain names, needs to be able to redirect traffic during an outage. If DNS fails, users may not be routed to a new network. Again, this increases costs and complexity.

One option is to avoid the cloud entirely, including cloud-based network services such as gateways and DNS, but this is extremely costly and complex. To deliver similar performance, the organization would have to install infrastructure in closer proximity to every market it serves. It would need to manage this infrastructure across many jurisdictions. It would have to plan capacity to meet spikes in demand and utilize redundant network paths between this infrastructure. Even with just a few gateway points of presence and redundant connections, this would be a costly request.

Few organizations would build their own network infrastructure between locations. Instead, the vast majority would opt to use capacity on third-party networks, but reliance on these third-party networks introduces new risks. These third-party networks may too have single points-of-failure, which are not obvious to the organizations using them. This network infrastructure may be resilient, but the network control plane, infrastructure firmware updates and routing dependencies may all be centralized and, therefore, vulnerable to failure.

Some organizations with mission-critical requirements globally may choose to architect their own global network with supporting services, albeit limited in scope compared to a hyperscaler. It would be impossible for any non-hyperscaler to deliver a network as broad and with as many points of presence as Microsoft's network.

Many organizations may decide that these shared network services are an acceptable risk considering the cost of the alternative, even if they cannot tolerate using a public cloud for hosting their applications.

Concentration risk extends beyond cloud

Relying on hyperscalers presents a substantial concentration risk. One 2024 estimate puts Amazon Web Services (AWS), Microsoft Azure and Google Cloud as collectively holding 68% of the global cloud infrastructure services market. An outage of a public cloud provider’s entire infrastructure would have extensive repercussions globally for many organizations, industries and governments. So far, there has never been such a complete outage, although some services have experienced issues globally (as in this case).

Avoiding the cloud clearly avoids the risk associated with a public cloud outage, but there are still risks in using shared infrastructure — even when most of the application is hosted in private facilities. Avoiding hyperscaler cloud infrastructure for application development does not negate reliance on hyperscalers, telecoms and other providers for network-based services.

Market share information on CDN services is hard to find, especially since CDN revenue is likely to be buried within hyperscalers' overall revenue. However, one estimate puts 71% of the CDN market share among six providers, including AWS, Akamai and Cloudflare. Google’s DNS resolver handles 15-30% of all the internet’s DNS requests. A global outage of any of these entities would have huge repercussions — perhaps greater than those of a hyperscaler cloud, since both public cloud and on-premises deployments may rely on such network capabilities.

Uptime will continue to investigate these risks in greater depth in future reports.

The Uptime Intelligence View

Transcending cloud and on-premises infrastructure lies a vast, complex fabric of network infrastructure and services. Many organizations operate their IT estates without involving public cloud providers — but far fewer can reach their users without relying on these shared networks. Hyperscaler networks, CDNs and DNS present a concentration risk. As Azure’s recent outage demonstrates, a single failure of a global network service can impact a significant number of users worldwide, resulting in substantial financial repercussions. Organizations need to look beyond their data centers and cloud estates and ensure they have plans in place for the loss of hidden network infrastructure that is often out of sight and out of mind.

About the Author

Dr. Owen Rogers

Dr. Owen Rogers

Dr. Owen Rogers is Uptime Institute’s Senior Research Director of Cloud Computing. Dr. Rogers has been analyzing the economics of cloud for over a decade as a chartered engineer, product manager and industry analyst. Rogers covers all areas of cloud, including AI, FinOps, sustainability, hybrid infrastructure and quantum computing.

Posting comments is not available for Network Guests