Cloudflare BGP leak 2019
How a configuration error at a small ISP created a BGP route leak that caused Cloudflare, Amazon, and hundreds of other services to become unreachable for 6 hours in 2019.
TL;DR
- On June 24, 2019, a small Pennsylvania ISP (Allegheny Technologies, AS396531) misconfigured a BGP route optimizer and re-advertised ~20,000 Cloudflare routes through Verizon's backbone to the global internet.
- BGP's shortest-path algorithm preferred the leaked 3-hop path over the legitimate 7-hop path, funneling hundreds of gigabits of traffic through a link with ~100 Mbps capacity.
- Cloudflare's DNS resolver (1.1.1.1), Amazon, Google, and hundreds of smaller services became unreachable or severely degraded for approximately 6 hours.
- Three defense layers failed simultaneously: no prefix limit filters at Verizon, no RPKI validation, and no automated leak detection.
- The fix was entirely manual: Verizon removed the invalid routes after ~6 hours of escalation and verification.
- The transferable lesson: any system built on trust-based protocols needs defense-in-depth, because a single misconfiguration by a third party can take you offline.
What Happened
On the morning of June 24, 2019, a routine day on the internet turned into a 6-hour crisis that affected millions of users.
Allegheny Technologies (AS396531), a small ISP in Pennsylvania operating as a subsidiary of DQE Communications under Verizon's network, made a BGP configuration error. Their optimizer software leaked approximately 20,000 prefixes it had learned from Cloudflare's network (AS13335) and re-advertised them as customer routes to its upstream peers.
To put the scale in context: Allegheny was a tiny ISP, the kind that serves a local region with modest bandwidth. Cloudflare is one of the largest network operators in the world, with a global anycast network spanning 200+ data centers on six continents. The idea that a configuration change at Allegheny could affect Cloudflare's global reachability sounds absurd. But BGP doesn't care about your network's size. It cares about your AS path length.
Those routes propagated through Verizon's backbone (AS701) and out to the global internet. Because the leaked AS path was shorter than the legitimate path, BGP routers worldwide selected the invalid route. Traffic destined for Cloudflare, Amazon, and hundreds of other major services got funneled through Allegheny's tiny network link.
The speed of propagation is worth emphasizing. BGP updates propagate across the internet in seconds to minutes. By the time Cloudflare's monitoring detected the anomaly (roughly 10 minutes), the leaked routes had already been accepted by thousands of autonomous systems worldwide. There's no "recall" mechanism for a bad route. Once it's propagated, it stays until someone explicitly withdraws it.
I've dealt with BGP incidents in production, and the scariest part is always the same: the problem originates in a network you don't control, and there's nothing you can do except wait for someone else to fix it.
| Timestamp (UTC) | Event |
|---|---|
| ~09:00 | Allegheny Technologies' BGP optimizer leaks ~20,000 prefixes as customer routes |
| ~09:02 | Leaked routes propagate through Verizon (AS701) to global routing tables |
| ~09:05 | Cloudflare, Amazon, Google begin experiencing traffic blackholing and severe congestion |
| ~09:10 | Cloudflare detects the anomaly via external monitoring and begins investigation |
| ~09:15 | Internet routing community begins observing anomalous AS paths via looking glasses |
| ~09:30 | Cloudflare identifies the route leak source (AS396531) and contacts Verizon |
| ~10:00+ | Multiple NOCs (Network Operations Centers) begin filing abuse reports with Verizon |
| ~11:00 | Verizon acknowledges the issue internally and begins investigating |
| ~13:00 | Verizon identifies the specific peering session and misconfigured routes |
| ~15:00 | Verizon manually removes the invalid routes from their backbone |
| ~15:30 | Global routing tables converge back to legitimate paths; services recover |
The total duration was approximately 6 hours. Six hours of degraded or unreachable service for some of the largest properties on the internet, caused by a misconfiguration at a tiny ISP most people had never heard of.
The most frustrating part for every network engineer watching this unfold in real time: there was nothing Cloudflare could do. The misconfiguration wasn't in their network. The invalid routes weren't in their routing tables. They had to wait for Verizon to act.
What is a BGP route leak?
A BGP route leak occurs when an AS advertises routing information to a neighbor that violates the intended routing policy. The most common type: re-advertising routes learned from a provider to another provider, effectively becoming an unauthorized transit point. RFC 7908 formally defines seven categories of route leaks.
How the System Worked Before
To understand why this happened, you need to understand how BGP (Border Gateway Protocol) actually works. BGP is the routing protocol that holds the internet together. Every large network on the internet operates as an Autonomous System (AS), identified by a unique AS number. As of 2019, there were over 65,000 active ASes on the internet.
When Cloudflare wants the world to reach its network, it advertises its IP prefixes to its transit providers. Those providers propagate the advertisements to their peers and customers. The result is a distributed routing table where every AS on the internet knows how to reach every other AS.
An IP prefix is a range of IP addresses expressed in CIDR notation (like 104.16.0.0/12). Cloudflare owns thousands of these prefixes. When Cloudflare "advertises" a prefix, it sends a BGP UPDATE message to its transit providers saying: "I can deliver traffic for these IP addresses." The transit provider then sends its own UPDATE to its peers, prepending its own AS number to the path.
The AS path is the sequence of AS numbers that a route has traversed. Each time a route passes through an AS, that AS's number gets prepended. The result: every router on the internet can see the complete chain of networks a packet would traverse to reach a destination.
Here's what a normal BGP route looks like in a routing table:
// BGP routing table entry for Cloudflare prefix
Prefix: 104.16.0.0/12
Next Hop: 192.0.2.1 (transit provider router)
AS Path: [701, 3356, 13335]
Origin: IGP
Local Pref: 100
MED: 0
Valid: yes
Best: yes
The AS path [701, 3356, 13335] means: this route was originated by AS13335 (Cloudflare), passed through AS3356 (Lumen), and arrived via AS701 (Verizon). Three hops. A router comparing this against an alternative path of [174, 1299, 3356, 13335] (four hops) would prefer the shorter one.
BGP routing relies on three relationship types between autonomous systems:
| Relationship | Description | Route propagation rule |
|---|---|---|
| Customer to Provider | Customer pays provider for transit | Provider re-advertises customer routes to all peers and other customers |
| Peer to Peer | Two networks exchange traffic for free | Peer routes only advertised to customers, not to other peers or providers |
| Provider to Customer | Provider sends full routing table to customer | Customer should only re-advertise its own prefixes and its customers' prefixes upstream |
The critical rule: a customer should never re-advertise routes learned from one provider to another provider. That would make the customer a free transit point, funneling traffic it cannot handle. This is called the "valley-free" routing principle: traffic should flow up to a common transit provider, across at a peering point, or down to a customer. It should never go up, down, and up again.
In the Cloudflare leak, Allegheny violated this principle. They learned Cloudflare's routes from their transit relationship (downward flow) and then re-advertised them upward to Verizon as if they were Allegheny's own customer routes. Verizon had no way to know this was wrong because BGP carries no metadata about the relationship under which a route was learned.
BGP uses a simple algorithm to pick the best path: when multiple routes exist for the same prefix, prefer the route with the shortest AS path length. This is the specific property that made the leak so damaging.
BGP's best-path selection actually considers several attributes in order: local preference, AS path length, origin type, MED (Multi-Exit Discriminator), and several others. But in practice, AS path length is the attribute that matters most in leak scenarios because leaked routes often have artificially short paths.
Think of it like airport routing. Normally, to fly from New York to Sydney, you might go through Los Angeles or San Francisco (long but correct). A route leak is like a new airline suddenly advertising a "direct" flight from New York to Sydney that actually stops at a tiny airstrip with one runway. Every booking system prefers the "shorter" route, and the airstrip collapses under the traffic.
BGP predates modern security
BGP was designed in 1989 (RFC 1105) and standardized in 1995 (BGP-4, RFC 1771). The protocol assumes that every AS operator is competent and trustworthy. There are no cryptographic signatures on route announcements. Any AS can advertise any prefix, and neighbors will accept it unless they've explicitly configured filters.
The Failure Cascade
Ok, now let's trace exactly what went wrong, step by step. This is where the incident gets technically interesting.
Step 1: The misconfiguration. Allegheny Technologies ran a BGP route optimizer, a software tool designed to influence how traffic enters and exits their network by selectively advertising routes. The optimizer was supposed to manage Allegheny's own small set of prefixes (probably fewer than 100). Instead, a configuration error caused it to grab all ~20,000 prefixes it had learned from Cloudflare via its legitimate transit relationship.
BGP route optimizers are common in the ISP world. They work by selectively announcing routes to influence inbound traffic patterns, for example, making certain routes look less attractive via AS path prepending to shift traffic to cheaper links. The tool itself isn't the problem. The problem was that it was configured without proper guardrails to prevent it from announcing routes the ISP didn't own.
Step 2: The re-advertisement. The optimizer re-advertised these 20,000 Cloudflare prefixes as customer routes to Verizon. In BGP terms, this means Allegheny was telling Verizon: "I am the authorized path to reach all of these Cloudflare addresses." This was false, but BGP has no mechanism to verify the claim.
Step 3: Verizon propagates. Verizon's routers accepted these routes without question. No prefix limit filters. No RPKI validation. No IRR checks. Because Allegheny was a legitimate customer, Verizon treated these as valid customer routes and propagated them globally to all peers and other customers.
Step 4: Global adoption. As the leaked routes spread across the internet's default-free zone (the set of routers that carry a full routing table), routers everywhere compared the leaked path against the legitimate path and selected the shorter one. Within minutes, a significant portion of internet traffic destined for Cloudflare's prefixes was being routed through Allegheny.
Step 5: Capacity collapse. Allegheny's ~100 Mbps link was instantly saturated. Packets backed up in router buffers and were dropped. From the perspective of end users, Cloudflare's services simply stopped responding. DNS queries timed out. HTTPS connections failed. CDN content was unreachable.
The cascade amplification is worth quantifying. Not all internet traffic to Cloudflare was affected, only traffic that traversed networks which accepted the leaked routes. But Verizon is one of the largest backbone operators in the world, so a substantial portion of global internet traffic passes through their network at some point. The blast radius extended far beyond Verizon's direct customers.
The key amplification factor was the AS path length difference. Normal traffic to Cloudflare traversed 5-7 AS hops. The leaked path through Allegheny was only 3 hops: User's ISP, Verizon, Allegheny. BGP's shortest-path preference made the broken route look better than the working one.
Allegheny's network had roughly 100 Mbps of capacity. Cloudflare handles hundreds of gigabits per second of traffic across its global anycast network. The mismatch was catastrophic. Traffic didn't just slow down; it effectively vanished into a black hole.
The blast radius was enormous:
- Cloudflare DNS (1.1.1.1): One of the world's largest public DNS resolvers, serving billions of queries per day
- Amazon and AWS: Multiple Amazon services experienced degraded connectivity
- Google: Portions of Google's infrastructure saw routing anomalies
- Hundreds of smaller services: Any service whose routes passed through Cloudflare's advertised prefixes
The downstream effects were hard to quantify precisely because many services don't publicly disclose BGP-related outages. But monitoring services like Downdetector showed spikes in reported issues for dozens of major websites and services during the incident window.
My mental model for BGP leaks: imagine a city's GPS system suddenly showing that all highways route through a single-lane dirt road. Every car follows the GPS. The dirt road is instantly gridlocked, and nobody can get anywhere.
Let's do the math on the capacity mismatch. Cloudflare's global anycast network handles an estimated 25+ Tbps of traffic across 200+ cities. Even if only a fraction of that traffic was affected by the leak, we're talking about potentially tens or hundreds of gigabits per second being funneled through a ~100 Mbps link. That's a 1,000x to 10,000x oversubscription. Packets didn't queue; they were dropped immediately.
Why It Wasn't Caught
This is the part that should make every engineer uncomfortable. Three layers of defense should have stopped this. All three failed simultaneously.
Layer 1: Allegheny's own configuration. The BGP optimizer software should never have re-advertised learned routes as customer routes. This was the root misconfiguration. BGP software does exactly what you tell it to, and there was no validation layer between the optimizer and the actual route advertisements.
A simple safeguard would have been an export policy that only permits the advertisement of Allegheny's own prefixes to upstream providers. This is standard practice at well-run ISPs: maintain an explicit prefix list of your own address space, and filter everything else on export. Allegheny either didn't have this filter or the optimizer bypassed it.
Layer 2: Verizon's ingress filtering. Verizon should have had prefix limit filters on the peering session with Allegheny. A reasonable limit might be 100-500 prefixes for a small ISP. When 20,000 prefixes suddenly appeared, the session should have been torn down automatically. Verizon had no such filter.
This is the equivalent of an API gateway accepting unbounded input without rate limiting or schema validation. Allegheny's BGP session suddenly went from advertising a handful of routes to advertising 20,000, and Verizon's routers processed every one.
Layer 3: RPKI validation. If Verizon's routers had validated routes against RPKI (Resource Public Key Infrastructure), they would have seen that Allegheny was not authorized to originate Cloudflare's prefixes. In 2019, RPKI adoption was roughly 20% across the internet. Verizon was not validating.
The trust chain has no enforcement
BGP's relationship rules (customer, provider, peer) are conventions, not protocol-level enforcement. Nothing in the BGP protocol prevents a customer from re-advertising provider routes. The only thing stopping a leak is correct configuration at every hop. One mistake, at one ISP, breaks the chain.
I've seen teams assume that because their own network is correctly configured, they're safe. That assumption ignores the fundamental architecture of BGP: your reachability depends on every AS between you and your users making correct decisions. You control maybe 2% of that path.
There was also no real-time external monitoring that could have triggered an automatic response. Cloudflare detected the anomaly through their own monitoring, but they couldn't fix it. The invalid routes existed in Verizon's network. Only Verizon could remove them.
The gap between "we detected the problem" (30 minutes) and "the problem was fixed" (6 hours) tells you everything about the current state of BGP incident response. Detection is fast. Remediation depends on the cooperation of third parties who may not share your urgency.
There's a useful framework here for thinking about any dependency failure:
| Metric | This incident | Your system's equivalent |
|---|---|---|
| Time to detect | ~10 minutes (Cloudflare's own monitoring) | How fast does your monitoring catch upstream failures? |
| Time to diagnose | ~20 minutes (BGP looking glass analysis) | Can you distinguish "our bug" from "their bug" quickly? |
| Time to notify | ~30 minutes (contacted Verizon directly) | Do you have escalation contacts for critical dependencies? |
| Time to fix | ~5.5 hours (Verizon manual intervention) | How long does your worst-case dependency fix take? |
| Time to recover | ~30 minutes (BGP convergence) | How fast does your system recover once the dependency is fixed? |
The Fix
The immediate fix was entirely manual and entirely dependent on Verizon.
Cloudflare's NOC identified the leak source within approximately 30 minutes of onset. They contacted Verizon directly, along with multiple other network operators who were also being affected. The internet routing community mobilized through mailing lists and direct NOC-to-NOC communication.
The detection itself relied on BGP monitoring tools. When Cloudflare's traffic dropped unexpectedly, their engineers checked BGP looking glasses (public tools that show the BGP routing table from various vantage points on the internet). What they saw: their prefixes appearing with a new AS path that included AS396531, an AS number they had no relationship with. That's the smoking gun for a route leak.
Here's what the diagnostic process looked like:
// BGP looking glass output during the leak
// Normal path to Cloudflare:
BGP.next_hop: 192.0.2.1
BGP.as_path: [3356, 13335]
BGP.origin: IGP
// Leaked path (selected as best due to shorter AS path):
BGP.next_hop: 10.0.0.1
BGP.as_path: [701, 396531, 13335] // <-- AS396531 should NOT be here
BGP.origin: IGP
BGP.community: 701:100 // Marked as customer route to Verizon
The anomaly is obvious to a trained network engineer: AS396531 appearing in the path to Cloudflare's prefixes is wrong. Allegheny has no business being in that path. But identifying the problem and fixing it are two very different things.
Verizon took approximately 5-6 hours from initial reports to fully remove the invalid routes from their backbone. This delay was partly operational (identifying and verifying the problem across a massive global network) and partly organizational (escalation through Verizon's internal processes).
The communication challenge was significant. Cloudflare is not Verizon's direct customer in this context. They had to escalate through Verizon's peering and NOC teams, explain the technical issue, wait for Verizon to verify it independently, and then wait for the fix to be deployed across Verizon's global network.
Why 6 hours? Verizon operates one of the largest backbone networks in the world. Identifying which routes were invalid across their global routing table, coordinating with Allegheny to confirm the misconfiguration, and then safely withdrawing the routes without causing additional disruption all take time. Every change to a backbone router carries risk of further outages.
Once Verizon withdrew the leaked routes, BGP convergence happened relatively quickly. Most affected services recovered within 30 minutes of the route withdrawal as global routing tables updated to prefer the legitimate paths again. BGP convergence is typically fast for route withdrawals because routers immediately fall back to the next-best path in their routing table.
The fix exposed a brutal truth about BGP incidents: the affected party (Cloudflare) had zero ability to fix the problem. Their network was correctly configured. Their routes were correctly advertised. They just had to wait for someone else to clean up their mess. In software engineering terms, this is the ultimate "not my bug, but my problem" scenario.
There's a pattern here that shows up in distributed systems beyond networking. When your system's correctness depends on another system's correctness, and you have no enforcement mechanism, your incident response time is bounded by their incident response time. This is why circuit breakers, fallbacks, and redundant paths exist at the application layer too.
For your interview: this is a textbook example of why external health monitoring and multi-homing matter. If your service's reachability depends on a single transit path, a BGP leak on that path takes you completely offline.
The Root Cause
The trigger was a BGP optimizer misconfiguration at Allegheny Technologies. But the root cause runs much deeper than one operator's mistake.
BGP is a trust-based protocol operating in an environment where trust is assumed but never verified. The protocol has no mechanism to answer a basic question: "Is this AS authorized to advertise this prefix?" Every AS simply trusts that its neighbors are telling the truth.
This is the BGP design tradeoff: operational simplicity and flexibility over security. Adding mandatory verification to every route announcement would require coordination across all 70,000+ ASes. The protocol's designers chose to make BGP easy to deploy and flexible to operate, with security as an optional add-on rather than a mandatory feature.
This design made sense in the early internet, when BGP connected a few hundred networks operated by engineers who knew each other personally. It does not make sense in a modern internet with over 70,000 autonomous systems, many operated by organizations with minimal networking expertise.
The fundamental mismatch: BGP was designed for a small, high-trust network of cooperating institutions. It now operates in a massive, low-trust environment where any participant can accidentally (or maliciously) affect every other participant. This is the same tension that appears in any protocol that scales beyond its original design assumptions.
The specific technical failures that enabled this incident:
| Failure | What should have existed | Why it didn't |
|---|---|---|
| No prefix limits on Allegheny's session | maximum-prefix 500 on Verizon's router | Configuration oversight; not enforced by policy |
| No RPKI validation | ROV (Route Origin Validation) checking ROAs | ~20% global adoption in 2019; Verizon hadn't deployed |
| No IRR filtering | AS-path and prefix filters based on IRR databases | Manual process; not consistently maintained |
| No automatic leak detection | Real-time BGP anomaly detection systems | Limited tooling; most ISPs monitor reactively |
The deeper architectural issue is that BGP treats all route advertisements as equally trustworthy. There's no distinction between "I am the origin of this prefix" and "I learned this from someone who claims to be the origin." Every re-advertisement looks the same to downstream routers.
There's an important nuance about malicious vs. accidental leaks. The Cloudflare 2019 incident was accidental: a misconfigured optimizer. But the same vulnerability enables intentional BGP hijacking, where an attacker deliberately announces someone else's prefixes to intercept or blackhole traffic. Nation-state actors have used BGP hijacking to redirect traffic through their infrastructure for surveillance purposes. The defense mechanisms are identical for both cases. This is why the internet security community treats accidental leaks and intentional hijacks with equal urgency.
Compare this to the Facebook BGP outage of October 2021. Same underlying protocol vulnerability, completely different mechanism. Facebook accidentally withdrew all their own BGP advertisements (self-inflicted), while this incident was a third-party leak. Both exploited BGP's fundamental design: no verification, no enforcement, just trust.
The Facebook outage also lasted about 6 hours, for the same fundamental reason: BGP incidents require manual intervention by the network operator who caused the problem. There's no automated "undo" in global routing.
The pattern behind both BGP outages
Cloudflare 2019: a third party announced routes it shouldn't have (route leak). Facebook 2021: Facebook withdrew routes it needed (route withdrawal). Both caused multi-hour outages. Both required manual intervention. The common thread: BGP has no automated recovery mechanism for configuration errors. Every BGP incident is a "call the NOC and wait" situation.
Architectural Changes After
This incident (and similar BGP leaks) accelerated several industry-wide changes. The Cloudflare leak became a rallying point for the MANRS (Mutually Agreed Norms for Routing Security) initiative, which promotes four actions: filtering, anti-spoofing, coordination, and global validation. Before this incident, MANRS had modest adoption. After, major ISPs began signing on in greater numbers.
RPKI Adoption Acceleration
RPKI lets IP address holders create Route Origin Authorizations (ROAs), which are cryptographic certificates binding a prefix to the authorized origin AS. A router performing Route Origin Validation (ROV) can check whether a received route matches a valid ROA.
Before this incident, RPKI adoption sat at roughly 20%. The leak became a catalyst for major networks to prioritize deployment. By 2024, adoption had reached approximately 50%, with Cloudflare, Google, AT&T, and many others now validating.
The RPKI infrastructure works through a hierarchy. Regional Internet Registries (ARIN, RIPE, APNIC, AFRINIC, LACNIC) act as trust anchors. IP address holders create ROAs in their RIR's portal, specifying which AS numbers are authorized to announce their prefixes. Validating routers download the ROA database and check incoming routes against it.
// Simplified RPKI validation logic
function validateRoute(prefix, originAS):
roa = lookupROA(prefix)
if roa == null:
return UNKNOWN // No ROA exists, accept by default
if roa.authorizedAS == originAS:
return VALID // Origin matches, accept
else:
return INVALID // Wrong origin, reject route
The catch: RPKI only validates the origin AS (the first hop), not the full path. A route leak where the origin AS is correct but the intermediate path is wrong can still slip through. Full path validation requires BGPsec, which has seen almost zero deployment due to performance overhead.
My recommendation for any team operating internet-facing infrastructure: publish ROAs for all your prefixes today. It's free through your Regional Internet Registry (ARIN, RIPE, APNIC) and takes about 30 minutes. It won't prevent every attack, but it raises the bar significantly.
Prefix Limit Standards
Major transit providers began enforcing maximum-prefix limits on all peering sessions. The configuration is straightforward:
// Cisco-style prefix limit configuration
neighbor 192.0.2.1 maximum-prefix 500 warning-only 80
// Tear down BGP session if peer sends > 500 prefixes
// Log warning at 80% (400 prefixes)
If Verizon had configured maximum-prefix 500 on Allegheny's session, the session would have dropped the moment 20,000 routes appeared. Total propagation time: zero.
The warning threshold is equally important. A well-configured prefix limit triggers a log message at 80% capacity (400 prefixes in this example), giving operators advance notice that something unusual is happening. This early warning can catch gradual misconfigurations before they escalate into full leaks.
Prefix limits are the simplest defense to deploy and arguably the most effective for this specific type of leak. The configuration is a single line per peering session. The reason it wasn't deployed at Verizon in 2019 was not technical difficulty but operational complacency: large networks had thousands of peering sessions, and auditing each one for appropriate limits was a low-priority task until an incident like this made it urgent.
IRR Hygiene
Internet Routing Registry (IRR) databases (RADB, RIPE DB, ARIN) contain records of which ASes are authorized to announce which prefixes. More operators began generating prefix filters directly from IRR data, automating what had been a manual and inconsistent process.
Tools like bgpq3 and bgpq4 automate the generation of prefix lists from IRR data. An operator can run a command like bgpq4 -Jl cloudflare-in AS13335 to generate a Juniper-format prefix list of all prefixes that AS13335 (Cloudflare) is authorized to announce. Apply this as an inbound filter on every peering session.
The challenge with IRR: the data is only as good as the operators who maintain it. Many IRR entries are stale, incomplete, or outright incorrect. This is why RPKI is preferred over IRR for origin validation, but IRR filtering still catches a significant class of leaks that RPKI misses (particularly path-based leaks).
Multi-Homing and Anycast Resilience
Cloudflare already operated an anycast network, but the incident reinforced the importance of multi-homing for any service that depends on internet reachability. Multi-homing means connecting to multiple independent transit providers. If one path is corrupted by a leak, traffic can still reach you via other providers.
The key word is "independent." Having two transit connections through the same upstream provider (like two circuits through Verizon) doesn't help when Verizon is the one propagating the bad routes. True multi-homing requires transit through networks with different upstream paths, ideally through different tier-1 providers.
Cloudflare's response to this incident included strengthening their already-extensive multi-homing. They also became one of the most vocal advocates for RPKI adoption, publishing their ROAs and encouraging all their peers to validate. Cloudflare's "Is BGP safe yet?" tool (isbgpsafeyet.com) lets anyone check whether their ISP validates RPKI, creating public pressure on ISPs to deploy validation.
For most services, multi-homing at the network layer means:
- At least 2 transit providers from different tier-1 networks
- BGP sessions with each provider advertising your prefixes
- Monitoring to detect when one path becomes unavailable or suspicious
- The ability to deprioritize or withdraw routes through a compromised path
Defense layers for BGP incidents
No single mechanism prevents all BGP leaks. The defense stack is: RPKI for origin validation, prefix limits for volumetric anomalies, IRR filtering for path authorization, multi-homing for path redundancy, and external monitoring for detection. Deploy all five.
External Monitoring Services
After this incident, services like Cloudflare Radar, RIPE RIS, and BGPStream saw increased adoption. These tools monitor global BGP routing tables in real time and can alert operators when their prefixes appear with unexpected origin ASes or unusual AS paths.
The monitoring approach works like this: you register your expected prefixes and origin AS, and the service continuously compares what it sees in BGP feeds against your expected state. If a new origin AS starts advertising your prefix, or if your routes appear with an unexpected AS path, you get an alert within minutes.
| Monitoring service | What it detects | Response time |
|---|---|---|
| RIPE RIS | New origin AS for your prefix, AS path changes | Near real-time (2-5 min) |
| BGPStream | Route leaks, hijacks, outages across 100+ collectors | Near real-time feeds |
| Cloudflare Radar | Routing anomalies, internet outages, traffic shifts | Minutes |
| ThousandEyes | End-to-end path changes, BGP + traceroute correlation | 1-2 minutes |
The bottom line: no single change would have prevented this incident. Each defense layer catches a different class of failure. The industry's response was to stack multiple layers, accepting that any one of them might fail.
Summary of Post-Incident Changes
| Defense layer | Pre-incident state (2019) | Post-incident state (2024) |
|---|---|---|
| RPKI ROV | ~20% adoption | ~50% adoption; Cloudflare, Google, AT&T validating |
| Prefix limits | Inconsistently applied | Standard practice at most tier-1 providers |
| IRR filtering | Manual, stale data | Automated generation tools (bgpq4) widely adopted |
| BGP monitoring | Reactive, manual checking | Real-time alerting via RIPE RIS, Cloudflare Radar |
| Multi-homing | Best practice but not universal | Emphasized in every high-availability design guide |
The internet still isn't "safe" from BGP leaks. But the defense stack is significantly stronger than it was in June 2019. The key gap remaining: RPKI adoption still isn't universal, and there's no mechanism to validate the full AS path (not just the origin). BGPsec exists on paper but has near-zero real-world deployment.
Architecture Decision Guide
So when should you worry about BGP-layer resilience? If your service is accessible from the public internet and your SLA requires better than 99.9% availability, you need to think about this. A 6-hour outage per year already blows a 99.9% budget.
For most application engineers, the actionable takeaway is simpler than the protocol complexity suggests: use multiple transit providers, publish RPKI ROAs, monitor your routes externally, and have a runbook for "our prefixes are being announced by someone else." You won't fix BGP, but you can survive its failures.
Transferable Lessons
Five principles emerge from this incident that apply far beyond internet routing.
1. Trust-based protocols require defense-in-depth.
BGP trusts every route announcement by default. The internet works despite this because most operators configure their networks correctly most of the time. But "most of the time" is not "all of the time," and a single misconfiguration at one AS can affect millions of users.
Any system that relies on trust between independent parties needs multiple independent verification layers. This applies far beyond networking: API gateways that trust upstream headers, microservices that trust internal callers, databases that trust application-layer access control. Verify at every layer.
2. Your availability depends on systems you don't control.
Cloudflare did everything right. Their BGP configuration was correct. Their network was healthy. They still went down for 6 hours because a third party made a mistake.
When you design for high availability, you must account for failures in systems you cannot monitor, cannot configure, and cannot fix. Multi-homing, anycast, and external monitoring are not optional for critical services. I've seen teams achieve five nines within their own infrastructure and still suffer outages from upstream dependencies they never considered.
3. The absence of validation is itself a vulnerability.
Verizon's routers accepted 20,000 routes from a small ISP without checking whether those routes were legitimate. The lack of a filter was the vulnerability, not the presence of malicious code.
In any system, ask: "What happens if a component sends data it shouldn't?" If the answer is "we accept it and propagate it," you have a route-leak-shaped hole in your architecture. Input validation isn't just for web forms. It applies to every trust boundary in your system.
4. Shortest-path algorithms amplify misconfigurations.
BGP's preference for shorter AS paths turned a local misconfiguration into a global outage. The same pattern appears in application-level routing: if your load balancer routes to the "closest" backend without verifying health, a misconfigured instance can attract and drop all traffic.
Always verify, not just route. Health checks, circuit breakers, and anomaly detection exist for exactly this reason.
5. Manual remediation doesn't scale.
It took 6 hours to resolve this incident because a human at Verizon had to identify, verify, and manually remove the invalid routes. Automated RPKI validation would have rejected the routes in milliseconds.
Wherever your incident response depends on a human receiving a phone call and taking action, your MTTR is measured in hours, not seconds. Automate validation at ingestion time, not at incident response time.
Every lesson here maps to application-layer architecture too. Replace "BGP" with "API gateway," "AS" with "microservice," and "prefix" with "route" and the principles are identical. Trust without verification, single paths of failure, and manual-only recovery are architectural weaknesses regardless of the layer they appear in.
How This Shows Up in Interviews
When a system design question involves global availability, CDN design, DNS architecture, or multi-region deployment, this case study is directly relevant. Mention it when discussing why services need multi-homing or why you can't assume network paths are correct.
I've found this case study particularly useful when interviewers ask "what can go wrong that's outside your control?" or "how do you design for total failure of a dependency?" The Cloudflare BGP leak is concrete, recent, and demonstrates a failure mode that most candidates don't consider.
The sentence to use: "A single BGP misconfiguration at a small ISP took Cloudflare offline for 6 hours in 2019 because BGP has no built-in route verification, which is why defense-in-depth and multi-homing are essential for any internet-facing service."
Here's how to weave this into different question types:
- "Design a CDN": When discussing availability, mention that CDN reachability depends on BGP. Anycast distributes traffic but doesn't protect against route leaks. Add RPKI and multi-homing to your design.
- "Design a DNS service": DNS is the internet's most critical dependency. A BGP leak can make your DNS unreachable, which cascades to every service that depends on it. This is exactly what happened to 1.1.1.1.
- "How do you achieve 99.99% availability?": Point out that network-layer failures (BGP leaks, cable cuts, peering disputes) are often the hardest to mitigate because they're outside your control. Multi-homing is the primary defense.
- "What's a failure mode most people don't consider?": BGP route leaks. Most engineers think about server failures, database crashes, and deployment bugs. Network-layer routing corruption is invisible until it happens.
| Interviewer asks | Strong answer citing this case study |
|---|---|
| "How do you ensure global availability?" | "Multi-home with at least 2 independent transit providers. The 2019 Cloudflare BGP leak showed that a single corrupted transit path can blackhole all traffic, even when your own network is perfectly configured." |
| "What are the risks of depending on DNS?" | "DNS itself depends on BGP reachability. In 2019, Cloudflare's 1.1.1.1 became unreachable because a BGP leak redirected traffic through a 100 Mbps link. Publish RPKI ROAs, monitor routes externally, and multi-home your DNS infrastructure." |
| "How would you design a CDN?" | "Anycast for global distribution, but anycast alone doesn't protect against BGP leaks. Layer in RPKI ROAs, use multiple transit providers per PoP, and run external route monitoring. The Cloudflare 2019 incident proved that a CDN's reachability depends on correct BGP beyond your own network." |
| "What's the hardest part of operating at internet scale?" | "Dependency on third-party networks you can't control. Cloudflare's 6-hour outage in 2019 was caused by a tiny ISP's misconfiguration. Your MTTR for BGP incidents depends on someone else's NOC answering the phone." |
| "How do you handle a dependency that's outside your control?" | "Redundancy and monitoring. For network-layer dependencies, multi-home across independent transit providers. For application-layer dependencies, circuit breakers and fallback paths. The Cloudflare BGP leak is the canonical example: they needed an alternate path when their primary transit was poisoned." |
The 30-second interview version
If the interviewer hasn't heard of this incident, here's the ultra-short version: "In 2019, a tiny ISP misconfigured BGP and accidentally advertised a shortcut to Cloudflare's network. BGP routers globally preferred the shorter path, which funneled traffic through a 100 Mbps link. Cloudflare was offline for 6 hours because only Verizon could remove the bad routes. This is why we need RPKI validation, prefix limits, and multi-homing."
Quick Recap
- On June 24, 2019, Allegheny Technologies (AS396531) misconfigured a BGP optimizer and leaked ~20,000 Cloudflare prefixes as customer routes through Verizon's backbone.
- BGP's shortest-path preference directed global traffic through Allegheny's ~100 Mbps link instead of Cloudflare's multi-hundred-Gbps anycast network, a 1,000x+ capacity mismatch.
- Cloudflare DNS (1.1.1.1), Amazon, Google, and hundreds of services were unreachable or degraded for approximately 6 hours.
- Three defense layers failed: no prefix limits at Verizon, no RPKI validation, and no IRR filtering.
- Cloudflare detected the leak in ~30 minutes but had zero ability to fix it; only Verizon could remove the invalid routes.
- The fix was manual: Verizon removed the routes after approximately 6 hours of escalation and verification.
- The incident accelerated RPKI adoption from ~20% to ~50% by 2024 and pushed major ISPs to enforce prefix limits on all peering sessions.
- The transferable principle: systems built on trust-based protocols require defense-in-depth, because you cannot control every participant's configuration.
Related Concepts
- Networking fundamentals: The foundational concepts behind BGP, autonomous systems, IP addressing, and internet routing that explain why this vulnerability exists and how the internet's routing architecture works.
- Facebook BGP outage 2021: A complementary BGP failure where Facebook withdrew its own routes (self-inflicted) rather than a third-party leak. Same protocol vulnerability, different failure mode, equally devastating. Together these two case studies cover the two most common categories of BGP incidents.
This case study was sourced from Cloudflare's public blog post "How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today" (June 24, 2019), RIPE NCC routing data, NANOG mailing list discussions, and MANRS (Mutually Agreed Norms for Routing Security) documentation.