Facebook BGP outage 2021
How a single misconfigured BGP update command took Facebook, Instagram, and WhatsApp offline for 6+ hours and locked employees out of the buildings needed to fix it.
TL;DR
- On October 4, 2021, a routine maintenance command accidentally withdrew every BGP route for Facebook's Autonomous System (AS32934), making Facebook, Instagram, and WhatsApp unreachable from every ISP on the planet.
- The outage lasted approximately 6 hours, affecting 3.5 billion users and costing an estimated $100 million in revenue.
- DNS cascaded next: Facebook's authoritative nameservers lived in the same withdrawn IP ranges, so DNS resolution itself stopped working within minutes.
- The self-locking problem made recovery brutal: SSH, monitoring dashboards, badge-access systems, and internal coordination tools all ran on Facebook's own infrastructure.
- Engineers had to physically travel to data centers and manually re-enable BGP peering from the backbone routers.
- Transferable lesson: out-of-band management paths and blast-radius limits on network changes are not optional for any system at scale.
What Happened
On the morning of October 4, 2021, Facebook's backbone engineering team executed a routine maintenance command intended to assess the capacity of their global backbone network. The command contained a bug. Instead of evaluating backbone capacity, it withdrew all BGP route advertisements for AS32934 (Facebook's Autonomous System) from every peering router simultaneously.
Within 90 seconds, every ISP on the internet dropped its routes to Facebook. The IP ranges 157.240.0.0/16, 185.89.218.0/23, and dozens of other Facebook-owned prefixes vanished from the global routing table. Traffic destined for Facebook had nowhere to go.
The outage hit all Facebook properties: Facebook, Instagram, WhatsApp, Messenger, Oculus VR services, and all internal tools. It was not a partial degradation. It was a complete disappearance from the internet.
Timeline
| Time (Pacific) | Event |
|---|---|
| ~11:39 AM | Maintenance command issued to backbone routers |
| ~11:40 AM | BGP routes for AS32934 begin withdrawing globally |
| 11:41 AM | Facebook.com and all services start failing for external users |
| ~11:50 AM | DNS resolution for *.facebook.com begins failing as cached records expire |
| ~12:00 PM | Facebook engineers realize the scope; remote access (SSH, VPN) is down |
| ~12:15 PM | External monitoring confirms: zero Facebook prefixes in global BGP table |
| ~1:00 PM | Teams dispatched to primary data centers in Prineville, OR and other locations |
| ~2:30 PM | First engineers gain physical access to backbone routers |
| ~5:00 PM | BGP routes begin re-advertising from restored peering sessions |
| ~5:28 PM | DNS resolution starts recovering as BGP routes propagate |
| ~6:05 PM | Facebook.com returns for most users; full recovery follows over next hour |
I've seen plenty of outage timelines where the "time to recovery" looks embarrassingly long. Almost every time, the real bottleneck is not the fix itself but getting access to the systems that need fixing. This incident is the textbook example.
How the System Worked Before
To understand why a single command caused total failure, you need to understand how the internet finds Facebook.
BGP Fundamentals
The internet is not one network. It is roughly 70,000 independent networks called Autonomous Systems (ASes), each identified by a unique AS number. Facebook operates AS32934. Your home ISP, your mobile carrier, and every cloud provider each operate their own AS.
Border Gateway Protocol (BGP) is how these autonomous systems tell each other which IP addresses they can reach. When Facebook's routers announce "I can reach 157.240.0.0/16" to their upstream peers (Cogent, Telia, NTT, Hurricane Electric), those peers propagate the announcement to their peers. Within seconds, every router on the internet knows: "To reach 157.240.x.x, send traffic toward AS32934."
BGP operates on trust. If AS32934 says "I own these IP ranges," peers believe it. If AS32934 says "I no longer own these IP ranges" (a route withdrawal), peers believe that too and immediately drop the routes.
There is no built-in verification mechanism in BGP. A route withdrawal is treated as authoritative, and peers propagate it within seconds. This trust model is what makes BGP fast and efficient, but it also means a single misconfigured announcement can have global impact. Resource Public Key Infrastructure (RPKI) adds cryptographic validation of route origins, but adoption remains partial, and it would not have prevented this specific failure because Facebook genuinely did withdraw its own routes.
Facebook's Network Architecture
Facebook's network had three layers relevant to this incident:
- Backbone network: High-capacity fiber links connecting Facebook's dozens of data centers globally. Internal traffic (replication, inter-DC communication) travels here.
- Peering routers: Edge routers at Points of Presence (PoPs) that establish BGP sessions with upstream ISPs and Internet Exchange Points (IXPs). These routers advertise Facebook's IP prefixes to the world.
- DNS infrastructure: Facebook's authoritative nameservers (
a.ns.facebook.comthroughd.ns.facebook.com) that resolvefacebook.comto specific IP addresses. Critically, these servers were hosted on Facebook's own IP ranges.
The critical detail: Facebook's DNS servers lived inside the same IP address space that BGP advertised. This created a circular dependency. If BGP stopped advertising those IP ranges, the DNS servers became unreachable, and nobody could even look up where Facebook was supposed to be.
For your interview prep: this is the canonical example of a circular dependency in infrastructure. The system that tells the internet where to find Facebook (DNS) depends on the system that makes Facebook reachable (BGP). Neither can function without the other.
The Failure Cascade
The cascade unfolded in four distinct phases, each amplifying the one before it.
Phase 1: BGP Route Withdrawal (T+0 to T+2 minutes)
The maintenance command ran on Facebook's backbone routers and contained a bug in its capacity-assessment logic. Instead of evaluating whether the backbone could handle a configuration change, it issued the change directly: withdraw all BGP route advertisements from every peering router simultaneously.
Facebook's peering routers sent BGP WITHDRAW messages to every upstream ISP and IXP they were connected to. Within about 90 seconds, the withdrawal propagated globally. AS32934 effectively vanished from the internet's routing table.
Phase 2: DNS Collapse (T+2 to T+15 minutes)
DNS resolvers worldwide had cached records for facebook.com pointing to IPs like 157.240.1.35. Those cached records were still valid, but the IP addresses were now unreachable (no BGP route to get there). Users saw connection timeouts.
As DNS TTLs expired (typically 5-15 minutes for Facebook's records), resolvers tried to refresh by querying Facebook's authoritative nameservers. But a.ns.facebook.com resolved to 129.134.30.12, an IP in a withdrawn range. The authoritative servers were gone too. DNS resolution for all Facebook properties began returning SERVFAIL.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.