Facebook BGP outage 2021
How a single misconfigured BGP update command took Facebook, Instagram, and WhatsApp offline for 6+ hours and locked employees out of the buildings needed to fix it.
TL;DR
- On October 4, 2021, a routine maintenance command accidentally withdrew every BGP route for Facebook's Autonomous System (AS32934), making Facebook, Instagram, and WhatsApp unreachable from every ISP on the planet.
- The outage lasted approximately 6 hours, affecting 3.5 billion users and costing an estimated $100 million in revenue.
- DNS cascaded next: Facebook's authoritative nameservers lived in the same withdrawn IP ranges, so DNS resolution itself stopped working within minutes.
- The self-locking problem made recovery brutal: SSH, monitoring dashboards, badge-access systems, and internal coordination tools all ran on Facebook's own infrastructure.
- Engineers had to physically travel to data centers and manually re-enable BGP peering from the backbone routers.
- Transferable lesson: out-of-band management paths and blast-radius limits on network changes are not optional for any system at scale.
What Happened
On the morning of October 4, 2021, Facebook's backbone engineering team executed a routine maintenance command intended to assess the capacity of their global backbone network. The command contained a bug. Instead of evaluating backbone capacity, it withdrew all BGP route advertisements for AS32934 (Facebook's Autonomous System) from every peering router simultaneously.
Within 90 seconds, every ISP on the internet dropped its routes to Facebook. The IP ranges 157.240.0.0/16, 185.89.218.0/23, and dozens of other Facebook-owned prefixes vanished from the global routing table. Traffic destined for Facebook had nowhere to go.
The outage hit all Facebook properties: Facebook, Instagram, WhatsApp, Messenger, Oculus VR services, and all internal tools. It was not a partial degradation. It was a complete disappearance from the internet.
Timeline
| Time (Pacific) | Event |
|---|---|
| ~11:39 AM | Maintenance command issued to backbone routers |
| ~11:40 AM | BGP routes for AS32934 begin withdrawing globally |
| 11:41 AM | Facebook.com and all services start failing for external users |
| ~11:50 AM | DNS resolution for *.facebook.com begins failing as cached records expire |
| ~12:00 PM | Facebook engineers realize the scope; remote access (SSH, VPN) is down |
| ~12:15 PM | External monitoring confirms: zero Facebook prefixes in global BGP table |
| ~1:00 PM | Teams dispatched to primary data centers in Prineville, OR and other locations |
| ~2:30 PM | First engineers gain physical access to backbone routers |
| ~5:00 PM | BGP routes begin re-advertising from restored peering sessions |
| ~5:28 PM | DNS resolution starts recovering as BGP routes propagate |
| ~6:05 PM | Facebook.com returns for most users; full recovery follows over next hour |
I've seen plenty of outage timelines where the "time to recovery" looks embarrassingly long. Almost every time, the real bottleneck is not the fix itself but getting access to the systems that need fixing. This incident is the textbook example.
How the System Worked Before
To understand why a single command caused total failure, you need to understand how the internet finds Facebook.
BGP Fundamentals
The internet is not one network. It is roughly 70,000 independent networks called Autonomous Systems (ASes), each identified by a unique AS number. Facebook operates AS32934. Your home ISP, your mobile carrier, and every cloud provider each operate their own AS.
Border Gateway Protocol (BGP) is how these autonomous systems tell each other which IP addresses they can reach. When Facebook's routers announce "I can reach 157.240.0.0/16" to their upstream peers (Cogent, Telia, NTT, Hurricane Electric), those peers propagate the announcement to their peers. Within seconds, every router on the internet knows: "To reach 157.240.x.x, send traffic toward AS32934."
BGP operates on trust. If AS32934 says "I own these IP ranges," peers believe it. If AS32934 says "I no longer own these IP ranges" (a route withdrawal), peers believe that too and immediately drop the routes.
There is no built-in verification mechanism in BGP. A route withdrawal is treated as authoritative, and peers propagate it within seconds. This trust model is what makes BGP fast and efficient, but it also means a single misconfigured announcement can have global impact. Resource Public Key Infrastructure (RPKI) adds cryptographic validation of route origins, but adoption remains partial, and it would not have prevented this specific failure because Facebook genuinely did withdraw its own routes.
Facebook's Network Architecture
Facebook's network had three layers relevant to this incident:
- Backbone network: High-capacity fiber links connecting Facebook's dozens of data centers globally. Internal traffic (replication, inter-DC communication) travels here.
- Peering routers: Edge routers at Points of Presence (PoPs) that establish BGP sessions with upstream ISPs and Internet Exchange Points (IXPs). These routers advertise Facebook's IP prefixes to the world.
- DNS infrastructure: Facebook's authoritative nameservers (
a.ns.facebook.comthroughd.ns.facebook.com) that resolvefacebook.comto specific IP addresses. Critically, these servers were hosted on Facebook's own IP ranges.
The critical detail: Facebook's DNS servers lived inside the same IP address space that BGP advertised. This created a circular dependency. If BGP stopped advertising those IP ranges, the DNS servers became unreachable, and nobody could even look up where Facebook was supposed to be.
For your interview prep: this is the canonical example of a circular dependency in infrastructure. The system that tells the internet where to find Facebook (DNS) depends on the system that makes Facebook reachable (BGP). Neither can function without the other.
The Failure Cascade
The cascade unfolded in four distinct phases, each amplifying the one before it.
Phase 1: BGP Route Withdrawal (T+0 to T+2 minutes)
The maintenance command ran on Facebook's backbone routers and contained a bug in its capacity-assessment logic. Instead of evaluating whether the backbone could handle a configuration change, it issued the change directly: withdraw all BGP route advertisements from every peering router simultaneously.
Facebook's peering routers sent BGP WITHDRAW messages to every upstream ISP and IXP they were connected to. Within about 90 seconds, the withdrawal propagated globally. AS32934 effectively vanished from the internet's routing table.
Phase 2: DNS Collapse (T+2 to T+15 minutes)
DNS resolvers worldwide had cached records for facebook.com pointing to IPs like 157.240.1.35. Those cached records were still valid, but the IP addresses were now unreachable (no BGP route to get there). Users saw connection timeouts.
As DNS TTLs expired (typically 5-15 minutes for Facebook's records), resolvers tried to refresh by querying Facebook's authoritative nameservers. But a.ns.facebook.com resolved to 129.134.30.12, an IP in a withdrawn range. The authoritative servers were gone too. DNS resolution for all Facebook properties began returning SERVFAIL.
Phase 3: The Self-Locking Problem (T+5 to T+180 minutes)
This is where the outage became truly exceptional.
Facebook's internal tools for network management, monitoring, and incident response were hosted on Facebook's own infrastructure. When BGP routes were withdrawn, every remote recovery path broke simultaneously:
- SSH access: Engineers could not SSH into backbone routers. The management IPs were in the withdrawn ranges.
- Monitoring dashboards: Internal monitoring tools were unreachable. Engineers were flying blind.
- Out-of-band console servers: Even some out-of-band management paths depended on DNS resolution that was now broken.
- Internal communication: Workplace (Facebook's internal Slack equivalent) was down. Engineers resorted to personal cell phones and text messages.
- Physical access: Badge readers at Facebook data centers used a networked system that relied on Facebook's infrastructure. Some engineers could not badge into the buildings that held the routers they needed to fix.
I've worked in environments where the "break glass" procedure assumed network connectivity. This incident is the reason every organization should ask: "What happens if the network itself is the thing that's broken?"
The Self-Locking Problem in Detail
The self-locking dynamic deserves its own diagram because it is the most transferable lesson from this incident.
Every arrow points to the same conclusion: the system that needed fixing controlled access to the systems required to fix it. This is the definition of a self-locking failure.
Phase 4: Internet-Wide Collateral Damage (T+0 to T+360 minutes)
The outage stressed the global internet infrastructure, not just Facebook's users.
Approximately 3.5 billion devices had Facebook, Instagram, or WhatsApp installed. When these apps lost connectivity, they began retrying aggressively. Mobile apps typically use exponential backoff, but with short initial intervals and billions of devices, the aggregate retry volume was enormous.
DNS resolver operators reported 10-30x normal query volumes for Facebook-related domains. Some recursive resolvers became overloaded, causing slower DNS resolution for completely unrelated domains. Root nameservers and TLD servers saw elevated load as resolvers walked up the DNS hierarchy looking for Facebook's NS records.
Retry storms at scale are a second outage
When 3.5 billion devices retry simultaneously, the retry traffic itself becomes a denial-of-service attack on shared infrastructure. DNS resolvers, root nameservers, and ISP networks all absorbed load from a service they did not operate. Design retry strategies with global impact in mind, not just your own backend capacity.
Why It Wasn't Caught
Facebook had safeguards. They failed for specific, instructive reasons.
The Audit Tool Bug
The maintenance command was supposed to be gated by an audit tool that verified whether the backbone could handle the configuration change before applying it. The audit tool had a bug: it did not correctly evaluate the command's impact and approved the change.
The audit tool was testing the wrong condition. It checked whether the backbone had sufficient capacity (it did), not whether the command would withdraw routes (it would). A pre-flight check that validates the wrong property is worse than no check at all, because it creates false confidence.
No Blast-Radius Limit
The command affected every peering router globally in a single operation. There was no staged rollout, no canary region, no circuit breaker that would halt propagation if a certain percentage of routes disappeared.
In my experience, the most dangerous systems are the ones that let you affect everything at once. Every configuration management system should enforce "you can change at most N% of infrastructure in one operation." Facebook learned this at 100% blast radius.
In-Band Management
The primary management path for Facebook's network was in-band: management traffic traveled over the same network it managed. When the network went down, so did the ability to manage it. Out-of-band management paths existed but were incomplete, with some still depending on DNS resolution through the same broken infrastructure.
Slow Physical Access Procedures
The "send someone to the data center" fallback existed as a procedure, but execution was slow. Security clearances for the most sensitive network equipment rooms required specific personnel. Those people were not always near the data centers. Travel time from the Bay Area to Prineville, Oregon alone is roughly 8 hours by car.
The Fix
Recovery required physical human presence at multiple data centers simultaneously.
Step 1: Establish Communication
With Workplace down, engineers coordinated via personal cell phones, text messages, and eventually a hastily arranged conference bridge on a non-Facebook system. Some engineers reportedly drove to the Menlo Park campus in person to coordinate face-to-face.
Step 2: Gain Physical Access
Teams were dispatched to Facebook's primary data centers. At some facilities, badge access was degraded because the badge systems depended on the network. Engineers worked with on-site security to gain manual access to the network equipment rooms. This step alone consumed hours.
Step 3: Console Into Backbone Routers
Once physically present, engineers connected to backbone routers via serial console cables. Serial consoles are truly out-of-band: they use a direct physical connection (typically RS-232) that does not depend on any network. Engineers verified the router state and confirmed that all BGP peering sessions were down.
Step 4: Re-enable BGP Carefully
Engineers manually reconfigured the backbone routers to re-advertise Facebook's IP prefixes via BGP. This had to be done carefully, not all at once. Simply restoring all routes simultaneously would cause a traffic stampede as billions of cached-out clients reconnected within seconds.
The re-advertisement was staged: routes were brought back gradually, and engineers monitored backbone capacity at each step to avoid overloading the network with the flood of returning traffic.
Step 5: DNS and Service Recovery
As BGP routes propagated globally (1-5 minutes for full convergence), Facebook's authoritative DNS servers became reachable again. DNS resolvers refreshed their caches. Services came back in waves as DNS propagated and application-layer health checks passed.
Full recovery took approximately 45 minutes after BGP routes were restored. The long tail of recovery was driven by DNS cache expiration times at thousands of ISP resolvers worldwide. Some users in regions with aggressive DNS caching saw Facebook return within minutes. Others, where resolvers had cached the SERVFAIL response with a negative TTL, waited longer.
The staged BGP restoration was essential. If all routes had been restored simultaneously, billions of devices would have reconnected in a thundering herd pattern, potentially overloading the backbone and web servers. By bringing routes back gradually and monitoring at each step, engineers ensured the network could absorb the traffic surge.
The Root Cause
The proximate cause was a bug in the audit tool that approved an unsafe backbone configuration change. But the root cause was architectural.
Circular dependency: Facebook's DNS infrastructure, management tools, monitoring systems, and physical access controls all depended on the same network (and the same BGP-advertised IP ranges) they were supposed to monitor and manage. There was no truly independent control plane.
Unlimited blast radius for network changes: The maintenance system allowed a single command to affect all peering routers globally with no staged rollout, no automatic rollback trigger, and no circuit breaker.
Insufficient out-of-band management: While serial consoles existed at the routers, the end-to-end recovery procedure (communication, physical access, coordination) was slow and not regularly tested at full scale.
The root cause was not "someone ran a bad command." It was that the architecture allowed a single bad command to have unlimited, irrecoverable impact with no independent path to fix it.
Proximate cause vs. root cause
In post-mortems, always distinguish between the trigger (the buggy command) and the root cause (the architecture that allowed the trigger to have catastrophic impact). Blaming the trigger leads to "be more careful next time." Analyzing the root cause leads to architectural improvements that prevent entire classes of failure.
Architectural Changes After
Facebook published details about their post-incident improvements. The changes fell into three categories.
1. Blast-Radius Controls for Network Changes
Facebook implemented a staged rollout system for backbone configuration changes. Instead of applying changes to all peering routers at once, changes now roll out to a small percentage of routers first. The system monitors BGP route counts, reachability metrics, and traffic volumes after each stage. If any metric degrades beyond a threshold, the change automatically rolls back.
This is conceptually identical to how software deployments use canary releases. The same principle applies to network configuration: treat every config change like a code deploy with automated rollback.
2. Independent Out-of-Band Management
Facebook invested in a management network that is fully independent of the production network and its BGP advertisements. This out-of-band network has:
- Separate IP address space: Management IPs that are not in the same ranges as production services. These IPs are advertised via a different AS or through a dedicated management transit provider.
- Independent DNS resolution: Management DNS that does not depend on Facebook's production authoritative nameservers.
- Dedicated connectivity: Management traffic routes through separate physical links and separate transit providers, with no shared fate with the production backbone.
- Independent authentication: Access credentials and badge systems that function without the production network.
The goal: if production BGP withdraws every route again, the management network still works. Think of it as a separate nervous system for your infrastructure. The production network carries user traffic. The management network carries operator traffic. They share no common dependencies.
This pattern is standard in telecommunications networks, where out-of-band management has been a requirement for decades. Internet-scale companies are newer to this discipline because they grew up treating "the network" as a single abstraction.
3. Improved Physical Access Procedures
Facebook overhauled their "break glass" procedures for physical data center access during network outages:
- Pre-authorized personnel lists at every data center, with security teams briefed on emergency access protocols.
- Emergency access mechanisms that do not depend on networked badge systems (physical keys, pre-arranged security codes).
- Regular drills simulating total network loss, including the physical access and serial console steps.
- Pre-positioned equipment and documentation at each facility so that any authorized engineer can perform recovery without needing to bring specialized tools.
The key insight: disaster recovery procedures are only as fast as their slowest step. For Facebook, that slowest step was physical human presence at a data center. Every optimization in their post-incident plan targets reducing that time: pre-authorized access, pre-positioned tools, pre-documented procedures, and regular drills to keep the muscle memory fresh.
Interview anchor: out-of-band management
When an interviewer asks about reliability or disaster recovery, mention out-of-band management. The sentence: "Management and monitoring paths must be independent of the production network they manage, so you can still diagnose and fix the system when the system itself is down." Cite the Facebook BGP outage as the canonical example.
Architecture Decision Guide
Use this decision flowchart to determine when your system needs the same mitigations Facebook implemented after this outage.
Transferable Lessons
1. Out-of-band management is not optional for critical infrastructure
Management traffic (SSH, monitoring, incident response tools) must travel on a network that does not depend on the system it manages. This applies at every scale: if your monitoring runs on the same Kubernetes cluster it monitors, you lose visibility when you need it most. Facebook's 6-hour outage would have been a 30-minute outage if engineers could have SSH'd into the routers remotely.
2. Blast radius must be mechanically enforced, not just policy
"Don't run commands that affect everything at once" is a policy. Policies fail when humans are tired, distracted, or when tools have bugs. The system itself must enforce maximum blast radius: staged rollouts with automatic rollback on metric degradation. Facebook's audit tool was supposed to prevent this. It had a bug. The blast-radius limit should have been a harder constraint, not a soft check.
3. Circular dependencies are invisible until they activate
Facebook's DNS depended on BGP. Their management tools depended on their DNS. Their physical access depended on their management tools. Under normal operation, this circular dependency was invisible. It only manifested when the bottom layer (BGP) failed. Audit your infrastructure for circular dependencies before an incident reveals them for you.
4. Physical access is a recovery mechanism that must be rehearsed
"We'll just go to the data center" sounds simple until you factor in travel time, security clearances, finding the right person who knows the right procedure for the specific router in the specific rack. Facebook's data center procedure existed but was unpracticed for a total-network-loss scenario. Rehearse your worst-case recovery path at least quarterly.
5. Retry storms from client devices are a system design problem
When billions of devices retry aggressively, the retry traffic itself becomes a distributed denial-of-service attack on shared internet infrastructure. Client-side retry behavior (backoff intervals, jitter, maximum retry limits) is not just a client concern. It is a system design decision with global consequences. Facebook's outage caused collateral damage to DNS infrastructure worldwide because mobile apps retried too aggressively.
How This Shows Up in Interviews
When to cite this case study
Bring up the Facebook BGP outage when an interviewer asks about reliability, disaster recovery, network design, or blast-radius management. The sentence: "A single misconfigured BGP command took Facebook offline for 6 hours because their management tools, DNS, and physical access all depended on the same network, creating a self-locking failure."
This case study is also useful when discussing DNS architecture, retry strategies, or why configuration changes need staged rollouts. It is one of the few incidents that affected not just the company in question but global internet infrastructure.
Key technical terms to use
Drop these terms naturally and you signal depth:
- Autonomous System (AS): An independently operated network with its own routing policy.
- BGP route withdrawal: The mechanism by which a network tells its peers "I no longer own these IPs."
- In-band vs. out-of-band management: Whether management traffic shares fate with production traffic.
- Self-locking failure: When the system that needs fixing controls the tools required to fix it.
- Blast radius: The percentage of infrastructure affected by a single change or failure.
Interviewer Q&A
| Interviewer asks | Strong answer citing this case study |
|---|---|
| "How do you prevent a single config change from causing a global outage?" | "Enforce blast-radius limits mechanically. Staged rollout to N% of infrastructure, monitor health metrics after each stage, auto-rollback on degradation. Facebook's 2021 BGP outage happened because one command could affect 100% of peering routers at once." |
| "What is out-of-band management and why does it matter?" | "Management traffic (SSH, monitoring, dashboards) on a network independent of production. Facebook's engineers couldn't SSH into routers to fix the BGP withdrawal because the management path used the same now-dead network. Out-of-band cuts recovery time from hours to minutes." |
| "How do you design a system to survive a total network failure?" | "Three layers: (1) out-of-band management network with separate IP ranges, DNS, and transit providers, (2) pre-authorized physical access procedures that don't depend on networked badge systems, (3) staged rollout with auto-rollback so the failure can't happen in the first place." |
| "What are circular dependencies in infrastructure?" | "When system A depends on system B, which depends on system A. Facebook's DNS servers were hosted on the same IPs that BGP advertised. When BGP went down, DNS went down. When DNS was down, nobody could find the DNS servers to check if they were up. The fix: host DNS on infrastructure independent of your production network." |
| "How do you handle retry storms from client devices?" | "Three controls: exponential backoff with jitter on the client side, circuit breakers that stop retries after repeated failures, and server-side load shedding that returns 503 with Retry-After headers. Facebook's 3.5 billion retrying devices created a DNS tsunami that hurt the entire internet, not just Facebook." |
Quick Recap
- On October 4, 2021, a buggy maintenance command withdrew all BGP routes for Facebook's AS32934, making Facebook, Instagram, and WhatsApp completely unreachable from the internet for approximately 6 hours.
- Facebook's authoritative DNS servers lived in the same IP address ranges, creating a circular dependency that cascaded BGP failure into total DNS failure.
- Remote recovery was impossible because SSH, monitoring, and internal communication tools all depended on the now-unreachable network.
- Physical data center access was the only recovery path, but badge systems were networked, security clearances were required, and travel time added hours.
- Billions of retrying mobile devices created a DNS query tsunami that stressed global internet infrastructure, causing collateral damage to unrelated services.
- Post-incident, Facebook implemented blast-radius limits for network changes, independent out-of-band management, and rehearsed physical access procedures.
- The transferable principle: management and recovery paths must be independent of the system they manage, and configuration changes must have mechanically enforced blast-radius limits.
Related Concepts
- Networking fundamentals - BGP, DNS, and how traffic routing works across autonomous systems, which is the foundational knowledge behind this entire incident.
- DNS internals - How DNS resolution works, TTL caching, authoritative vs recursive resolvers, and why DNS failure cascaded so quickly when Facebook's nameservers became unreachable.
- Observability - Why monitoring systems must have independent infrastructure, and what happens to incident response when your dashboards are down alongside production.