5/5 - (1 vote)

Since we’re talking about Azure Front Door (AFD), it’s worth addressing the outage on October 29.

For those unfamiliar with AFD—it acts as a global, Layer 7 load balancer. It provides multiple points of presence (PoPs) worldwide.

Here’s how it works:

  • Clients connect to the nearest PoP.

  • Split TCP terminates the client’s TLS connection locally for faster response.

  • AFD then fetches content from the healthy backend.

  • It also supports caching and Web Application Firewall (WAF) integration.

It’s used widely by Microsoft first-party services (like Office 365, Xbox, Entra) and third-party services.

Root Cause and Resolution

A tenant configuration change inadvertently introduced an invalid configuration state, impacting a significant number of nodes and making them unhealthy.

This reduced capacity to serve requests. The remediation involved rolling back to the last known good configuration, which took several hours to complete.

Microsoft has since identified and remediated the software defect that allowed the invalid configuration to bypass safety checks, preventing future occurrences.

Customer Perspective: Mitigation Options

From an architecture perspective, what can you do to self-mitigate?

  • Azure Front Door is a global load balancer, but a possible backup could be Azure Traffic Manager.
    However, note:

    • Traffic Manager is DNS-based.

    • It lacks AFD’s WAF, caching, and split TCP features.

    • It also cannot serve private endpoints or non-public services.

So, while not equivalent, Traffic Manager could act as a “break glass” backup option in emergencies.

Microsoft is deeply invested in ensuring this type of outage doesn’t happen again—since it not only impacts customers but also core Microsoft services like M365, Minecraft, and more.