AWS US-East-1 cascade failure patterns
How Amazon's largest AWS region has experienced multiple large-scale cascading failures, and what the architectural patterns behind each reveal about distributed system failure modes.
TL;DR
- US-East-1 (N. Virginia) is AWS's oldest and most feature-rich region, meaning it ships new services first and encounters novel failure modes before anyone else.
- Three major outages (2011 EBS, 2017 S3, 2020 Kinesis) all followed the same cascade pattern: trigger, retry storm, 10x load amplification, dependent service failures.
- The 2020 Kinesis outage took down Cognito (auth) and CloudWatch (monitoring) because both depend on Kinesis internally, leaving teams blind and locked out during the incident.
- AWS's own status dashboard went dark during the 2017 S3 outage because the dashboard itself was hosted on S3.
- Building for "US-East-1 is down" requires multi-region active-active with pre-provisioned capacity and locally cached control plane data. Cold failover takes 30+ minutes, and that is 30 minutes too many.
What Happened
US-East-1 is not just another AWS region. It was the first region Amazon launched, it hosts more services than any other region, and it receives new feature launches before the rest of the fleet. That "first and biggest" status also means US-East-1 carries unique operational risk.
Between 2011 and 2020, US-East-1 suffered three high-profile cascading failures. Each one followed a recognizable pattern, yet each one exploited a different dependency chain that nobody had fully mapped.
I've worked with teams that picked US-East-1 because "it has everything" without considering that "everything" includes a higher concentration of correlated failure risk. The incidents below explain why that matters.
Each outage was triggered by something different: a network change, a typo, a capacity addition. But the amplification mechanism was always the same. Retries without backpressure, recovery mechanisms without resource budgets, and dependencies that nobody had mapped until they all failed at once.
2011: EBS Storage Network Collapse
| Timestamp (EDT) | Event |
|---|---|
| ~00:47 Apr 21 | Network configuration change triggers connectivity issues in a single Availability Zone |
| ~01:00 | EBS volumes in the affected AZ lose connectivity to their mirrors |
| ~01:15 | EBS re-mirroring kicks in at scale, consuming remaining network and I/O bandwidth |
| ~02:00 | Re-mirroring traffic amplifies congestion; EBS API queue depth grows rapidly |
| ~04:00 | EBS control plane becomes overloaded processing "stuck" volume requests |
| ~06:00 | RDS instances on affected EBS volumes become unreachable |
| ~08:00 | AWS begins manual intervention, throttling re-mirroring |
| ~12:40 | Full service restoration confirmed |
A routine network change in one Availability Zone caused EBS volumes to lose contact with their replica mirrors. EBS's automatic safety mechanism (re-mirroring) tried to rebuild those replicas immediately. The problem: re-mirroring consumed the exact same network and I/O resources that were already degraded.
The more volumes that lost their mirrors, the more re-mirroring traffic flooded the network, which caused more volumes to lose their mirrors. A self-amplifying loop that ran for roughly 12 hours before manual intervention broke the cycle.
2017: S3 Increased Error Rates
| Timestamp (PST) | Event |
|---|---|
| ~09:37 Feb 28 | Engineer runs playbook command to remove a small number of S3 billing subsystem servers |
| ~09:37 | Command removes far more servers than intended (no bounds check) |
| ~09:45 | S3 index subsystem and placement subsystem go offline |
| ~10:00 | S3 GET and PUT requests begin failing across US-East-1 |
| ~10:15 | Dependent services (Lambda, ECS, EC2 API, CloudFormation) start failing |
| ~10:30 | AWS status dashboard (hosted on S3) stops updating |
| ~11:00 | AWS switches to Twitter for status updates |
| ~13:30 | S3 index subsystem fully restored; error rates drop |
| ~13:54 | Full service restoration confirmed |
An engineer was debugging S3's billing system and executed a playbook command to remove a small set of servers. The command's input parameter had no upper bounds validation. The engineer accidentally specified a value that removed a critical mass of servers from two S3 subsystems.
S3 is so deeply embedded in the AWS ecosystem that its failure cascaded to Lambda, ECS, EC2 instance launches, and even AWS's own health dashboard. AWS literally could not tell customers what was happening because the tool they use to communicate was down. They resorted to posting updates on Twitter.
2020: Kinesis Thread Limit Cascade
| Timestamp (PST) | Event |
|---|---|
| ~05:15 Nov 25 | Kinesis capacity addition begins in US-East-1 |
| ~05:30 | New Kinesis front-end servers exceed OS thread count limits during boot |
| ~06:00 | Kinesis front-end fleet partially down; error rates spike |
| ~06:15 | CloudWatch (depends on Kinesis for event streaming) begins failing |
| ~06:30 | Cognito (depends on Kinesis for event logging) begins failing |
| ~07:00 | Applications using Cognito for auth start returning 5xx errors |
| ~07:00 | CloudWatch metrics and alarms stop updating (teams lose observability) |
| ~09:00 | AWS identifies thread count limit as root cause, begins manual remediation |
| ~13:30 | Kinesis fully restored; CloudWatch and Cognito recover |
A routine capacity addition to Kinesis Data Streams in US-East-1 triggered an unexpected failure. The newly added front-end servers exceeded operating system thread count limits during their initialization. This caused a partial failure of the Kinesis front-end fleet.
Here is where the cascade gets interesting. Kinesis is not just a customer-facing service. Internally, AWS uses Kinesis as the event streaming backbone for CloudWatch (metrics, logs, alarms) and Cognito (authentication events, user pool operations). When Kinesis went down, it dragged both of these services with it.
I still remember the chaos reports from that morning. Teams could not authenticate users (Cognito down), could not see their dashboards (CloudWatch down), and could not figure out why because the monitoring tool was the thing that was broken.
How the System Worked Before
AWS regions are divided into Availability Zones (AZs), each with independent power, cooling, and networking. The design promise: a failure in AZ-a should not take down AZ-b.
AZ isolation works well for compute and storage faults. But the control plane services (S3, Kinesis, CloudWatch, Cognito, IAM) operate at the regional level, not the AZ level. A control plane failure affects every AZ in the region simultaneously.
This is the architectural gap that each US-East-1 outage exploited. The data plane was designed for AZ isolation. The control plane was not.
Why US-East-1 Specifically?
US-East-1 is not inherently less reliable than other regions. It fails more visibly for three reasons:
- Scale. More customers, more traffic, more services. The same bug that goes unnoticed in ap-southeast-2 triggers a cascade in US-East-1 because the fleet is 5-10x larger.
- New service launches. US-East-1 gets features first. New code means new bugs. Services running their first month in production are more likely to hit edge cases.
- Customer concentration. Many companies default to US-East-1 because it was the only option when they started on AWS. They never migrated. This means a US-East-1 outage affects a disproportionate share of the internet.
The lesson is not "avoid US-East-1." The lesson is: build your architecture assuming any single region can have a bad day, and US-East-1 has the most data points proving that it will.
The Failure Cascade
Every major US-East-1 outage followed the same five-stage cascade pattern. The trigger differs, but the amplification mechanism is identical.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.