AWS US-East-1 cascade failure patterns
How Amazon's largest AWS region has experienced multiple large-scale cascading failures, and what the architectural patterns behind each reveal about distributed system failure modes.
TL;DR
- US-East-1 (N. Virginia) is AWS's oldest and most feature-rich region, meaning it ships new services first and encounters novel failure modes before anyone else.
- Three major outages (2011 EBS, 2017 S3, 2020 Kinesis) all followed the same cascade pattern: trigger, retry storm, 10x load amplification, dependent service failures.
- The 2020 Kinesis outage took down Cognito (auth) and CloudWatch (monitoring) because both depend on Kinesis internally, leaving teams blind and locked out during the incident.
- AWS's own status dashboard went dark during the 2017 S3 outage because the dashboard itself was hosted on S3.
- Building for "US-East-1 is down" requires multi-region active-active with pre-provisioned capacity and locally cached control plane data. Cold failover takes 30+ minutes, and that is 30 minutes too many.
What Happened
US-East-1 is not just another AWS region. It was the first region Amazon launched, it hosts more services than any other region, and it receives new feature launches before the rest of the fleet. That "first and biggest" status also means US-East-1 carries unique operational risk.
Between 2011 and 2020, US-East-1 suffered three high-profile cascading failures. Each one followed a recognizable pattern, yet each one exploited a different dependency chain that nobody had fully mapped.
I've worked with teams that picked US-East-1 because "it has everything" without considering that "everything" includes a higher concentration of correlated failure risk. The incidents below explain why that matters.
Each outage was triggered by something different: a network change, a typo, a capacity addition. But the amplification mechanism was always the same. Retries without backpressure, recovery mechanisms without resource budgets, and dependencies that nobody had mapped until they all failed at once.
2011: EBS Storage Network Collapse
| Timestamp (EDT) | Event |
|---|---|
| ~00:47 Apr 21 | Network configuration change triggers connectivity issues in a single Availability Zone |
| ~01:00 | EBS volumes in the affected AZ lose connectivity to their mirrors |
| ~01:15 | EBS re-mirroring kicks in at scale, consuming remaining network and I/O bandwidth |
| ~02:00 | Re-mirroring traffic amplifies congestion; EBS API queue depth grows rapidly |
| ~04:00 | EBS control plane becomes overloaded processing "stuck" volume requests |
| ~06:00 | RDS instances on affected EBS volumes become unreachable |
| ~08:00 | AWS begins manual intervention, throttling re-mirroring |
| ~12:40 | Full service restoration confirmed |
A routine network change in one Availability Zone caused EBS volumes to lose contact with their replica mirrors. EBS's automatic safety mechanism (re-mirroring) tried to rebuild those replicas immediately. The problem: re-mirroring consumed the exact same network and I/O resources that were already degraded.
The more volumes that lost their mirrors, the more re-mirroring traffic flooded the network, which caused more volumes to lose their mirrors. A self-amplifying loop that ran for roughly 12 hours before manual intervention broke the cycle.
2017: S3 Increased Error Rates
| Timestamp (PST) | Event |
|---|---|
| ~09:37 Feb 28 | Engineer runs playbook command to remove a small number of S3 billing subsystem servers |
| ~09:37 | Command removes far more servers than intended (no bounds check) |
| ~09:45 | S3 index subsystem and placement subsystem go offline |
| ~10:00 | S3 GET and PUT requests begin failing across US-East-1 |
| ~10:15 | Dependent services (Lambda, ECS, EC2 API, CloudFormation) start failing |
| ~10:30 | AWS status dashboard (hosted on S3) stops updating |
| ~11:00 | AWS switches to Twitter for status updates |
| ~13:30 | S3 index subsystem fully restored; error rates drop |
| ~13:54 | Full service restoration confirmed |
An engineer was debugging S3's billing system and executed a playbook command to remove a small set of servers. The command's input parameter had no upper bounds validation. The engineer accidentally specified a value that removed a critical mass of servers from two S3 subsystems.
S3 is so deeply embedded in the AWS ecosystem that its failure cascaded to Lambda, ECS, EC2 instance launches, and even AWS's own health dashboard. AWS literally could not tell customers what was happening because the tool they use to communicate was down. They resorted to posting updates on Twitter.
2020: Kinesis Thread Limit Cascade
| Timestamp (PST) | Event |
|---|---|
| ~05:15 Nov 25 | Kinesis capacity addition begins in US-East-1 |
| ~05:30 | New Kinesis front-end servers exceed OS thread count limits during boot |
| ~06:00 | Kinesis front-end fleet partially down; error rates spike |
| ~06:15 | CloudWatch (depends on Kinesis for event streaming) begins failing |
| ~06:30 | Cognito (depends on Kinesis for event logging) begins failing |
| ~07:00 | Applications using Cognito for auth start returning 5xx errors |
| ~07:00 | CloudWatch metrics and alarms stop updating (teams lose observability) |
| ~09:00 | AWS identifies thread count limit as root cause, begins manual remediation |
| ~13:30 | Kinesis fully restored; CloudWatch and Cognito recover |
A routine capacity addition to Kinesis Data Streams in US-East-1 triggered an unexpected failure. The newly added front-end servers exceeded operating system thread count limits during their initialization. This caused a partial failure of the Kinesis front-end fleet.
Here is where the cascade gets interesting. Kinesis is not just a customer-facing service. Internally, AWS uses Kinesis as the event streaming backbone for CloudWatch (metrics, logs, alarms) and Cognito (authentication events, user pool operations). When Kinesis went down, it dragged both of these services with it.
I still remember the chaos reports from that morning. Teams could not authenticate users (Cognito down), could not see their dashboards (CloudWatch down), and could not figure out why because the monitoring tool was the thing that was broken.
How the System Worked Before
AWS regions are divided into Availability Zones (AZs), each with independent power, cooling, and networking. The design promise: a failure in AZ-a should not take down AZ-b.
AZ isolation works well for compute and storage faults. But the control plane services (S3, Kinesis, CloudWatch, Cognito, IAM) operate at the regional level, not the AZ level. A control plane failure affects every AZ in the region simultaneously.
This is the architectural gap that each US-East-1 outage exploited. The data plane was designed for AZ isolation. The control plane was not.
Why US-East-1 Specifically?
US-East-1 is not inherently less reliable than other regions. It fails more visibly for three reasons:
- Scale. More customers, more traffic, more services. The same bug that goes unnoticed in ap-southeast-2 triggers a cascade in US-East-1 because the fleet is 5-10x larger.
- New service launches. US-East-1 gets features first. New code means new bugs. Services running their first month in production are more likely to hit edge cases.
- Customer concentration. Many companies default to US-East-1 because it was the only option when they started on AWS. They never migrated. This means a US-East-1 outage affects a disproportionate share of the internet.
The lesson is not "avoid US-East-1." The lesson is: build your architecture assuming any single region can have a bad day, and US-East-1 has the most data points proving that it will.
The Failure Cascade
Every major US-East-1 outage followed the same five-stage cascade pattern. The trigger differs, but the amplification mechanism is identical.
The math behind stage 3 is what makes this lethal. Suppose Service A normally handles 10,000 requests per second. When it starts returning errors, each client retries 3 times with no backoff. That is 30,000 requests per second. If clients have a 5-second timeout and retry immediately on timeout, you can hit 50,000+ requests per second within 30 seconds.
Service A was already struggling at 10,000. It has zero chance at 50,000. The retries are not helping recovery. They are preventing it.
For your interview: whenever you mention retries in a system design, immediately follow up with "exponential backoff with jitter and a circuit breaker." That one phrase shows you understand cascade prevention.
The 2011 EBS Self-Amplifying Loop
The 2011 outage had an additional twist beyond the standard cascade. The recovery mechanism itself consumed the resources it needed to succeed.
The re-mirroring process was doing exactly what it was designed to do: detect lost replicas and rebuild them. Under normal conditions (a single volume losing its mirror), this works perfectly. Under correlated failure (thousands of volumes losing mirrors simultaneously), the re-mirroring traffic consumed the very network bandwidth it needed.
I think of this as the "fire truck traffic jam" problem. One fire truck gets through easily. A thousand fire trucks dispatched simultaneously create gridlock on the roads they need to reach the fires.
The fix required manual intervention: AWS engineers had to throttle the re-mirroring rate, allowing the network to recover gradually rather than trying to rebuild all replicas at once.
The 2020 Kinesis Transitive Dependency Chain
The 2020 outage is the most instructive for system designers because it reveals hidden transitive dependencies.
Your application does not call Kinesis directly. You use Cognito for authentication and CloudWatch for monitoring. But Cognito depends on Kinesis, and CloudWatch depends on Kinesis. So your application has a transitive dependency on Kinesis that does not appear anywhere in your code or infrastructure configuration.
When Kinesis failed, your auth broke and your monitoring went dark at the same time. You could not log in users, and you could not see why.
This is the dependency chain most teams never map:
Your dependency graph says you depend on Cognito and CloudWatch. The real dependency graph says you depend on Kinesis too. If Kinesis has a bad day, so do you.
Why It Wasn't Caught
Each outage exposed a different gap in AWS's detection and prevention systems.
2011: Re-mirroring had no rate limiter. EBS's re-mirroring was designed to be aggressive because fast replica rebuilding minimizes data risk. Nobody had tested what happens when thousands of volumes need re-mirroring simultaneously in the same network segment. The safety mechanism had no circuit breaker on its own resource consumption.
2017: Playbook commands had no bounds checking. The command to remove S3 billing servers accepted any integer as input. There was no validation like "refuse to remove more than 5% of capacity in a single operation." A simple input validation check would have caught the error before it executed.
2017: The status dashboard depended on the service it monitors. AWS's health dashboard was hosted on S3. When S3 went down, the dashboard that was supposed to tell customers "S3 is down" also went down. This is a circular dependency in the monitoring stack, and it is more common than you would think. I've seen the same anti-pattern in internal monitoring systems at three different companies.
2020: Thread limits were not tested at the new fleet size. The Kinesis front-end servers had been running below OS thread limits for years. The capacity addition pushed the fleet past a threshold that had never been hit in production. Load testing at the new scale would have caught this, but the capacity addition was treated as a routine operation.
2020: Transitive dependencies were not documented. Most AWS customers (and apparently some AWS internal teams) did not have a complete map of which internal services depend on which other internal services. The Kinesis to CloudWatch to Cognito chain was not obvious from any customer-facing documentation.
Your monitoring cannot depend on the thing it monitors
If your alerting pipeline uses CloudWatch, and CloudWatch depends on Kinesis, then a Kinesis failure makes you blind at the exact moment you need visibility most. Always have an independent monitoring channel (a separate provider, a simple health-check endpoint polled from outside AWS) that does not share dependencies with your primary stack.
The Fix
Each incident required different immediate actions, but the pattern of "manual intervention to break the amplification loop" was consistent.
2011 EBS fix: AWS engineers manually throttled the EBS re-mirroring rate. Instead of allowing all affected volumes to rebuild replicas simultaneously, they queued re-mirroring operations and processed them in controlled batches. This freed up network bandwidth for normal I/O, which allowed healthy volumes to resume serving traffic. Full recovery took approximately 12 hours from the initial trigger.
2017 S3 fix: The removed servers needed to be brought back online, but S3's index and placement subsystems required a full restart sequence. The index subsystem (which tracks metadata for all S3 objects) took the longest to recover because it needed to rebuild its in-memory state. AWS engineers worked through a careful restart sequence over roughly 4 hours. Meanwhile, they switched to Twitter for status communication since their dashboard was down.
2020 Kinesis fix: AWS engineers identified the thread count limit as the root cause and began manually remediating affected front-end servers. They rolled back the capacity addition, restarted servers with corrected thread limits, and gradually restored the Kinesis fleet. CloudWatch and Cognito recovered automatically once Kinesis stabilized. Total time from trigger to full restoration: approximately 8 hours.
The common thread: automated recovery failed in all three cases. Humans had to step in, diagnose the amplification loop, and manually break it. Automation is excellent for known failure modes. Novel cascades require human judgment.
The 30-minute rule for cascading failures
If your automated recovery has not resolved the issue within 30 minutes, the automation is probably part of the problem. Escalate to human operators who can assess whether the recovery mechanism itself is consuming the resources it needs. This pattern (automation amplifying the failure it is trying to fix) appeared in all three US-East-1 outages.
The Root Cause
The trigger for each outage was different (network misconfiguration, human error, thread limit). But the root cause, the reason a small trigger became a multi-hour regional outage, was the same in all three cases: lack of backpressure in recovery and retry paths.
Recovery systems with no resource limits. The 2011 EBS re-mirroring had no cap on how much network bandwidth it could consume. The system was designed to rebuild replicas as fast as possible, which is the right goal in isolation. But "as fast as possible" without a resource ceiling means "consume everything available," which under correlated failure means "make the problem worse."
Retry behavior with no circuit breakers. In all three outages, client retries amplified the load on already-degraded services. AWS's internal service-to-service calls had retry logic but lacked adaptive circuit breakers that would stop retrying after sustained failures. The retry storms turned partial failures into complete outages.
Shared control plane with no blast radius limits. S3, Kinesis, CloudWatch, and Cognito all operate as regional singletons. A failure in one propagates to everything that depends on it within the same region. There is no AZ-level isolation for control plane services. This is a fundamental architectural choice (regional consistency is simpler), but it means a control plane failure has region-wide blast radius.
The deeper lesson: every safety mechanism (re-mirroring, retries, health checks) needs a resource budget. Unbounded safety mechanisms become the amplification vector during cascading failures.
This is why I always ask teams: "What does your system do when this dependency times out?" If the answer is "retry," the follow-up is "and what happens when 10,000 instances retry simultaneously?" If they have not thought about that scenario, they have not thought about cascade failure.
| Outage | Trigger | Amplification Mechanism | Duration |
|---|---|---|---|
| 2011 EBS | Network misconfiguration | Re-mirroring consumed resources it needed | ~12 hours |
| 2017 S3 | Operator command error | No bounds check on capacity removal | ~4 hours |
| 2020 Kinesis | Capacity addition | Thread limit violation, transitive deps to CloudWatch/Cognito | ~8 hours |
Architectural Changes After
AWS made specific architectural changes after each outage. These changes are public and instructive.
Post-2011: EBS Re-mirroring Rate Limits
AWS added rate limiting to the EBS re-mirroring process. Instead of allowing unbounded re-mirroring, the system now caps the number of simultaneous re-mirror operations per network segment. This prevents the self-amplifying loop where re-mirroring traffic causes more volumes to need re-mirroring.
They also improved AZ-level isolation for EBS storage networks, reducing the blast radius of network configuration changes.
Post-2017: S3 Operational Safeguards
AWS added bounds checking to all capacity-removal commands. No single command can remove more than a defined percentage of a subsystem's capacity. The change also introduced a staged removal process: remove a small batch, verify health, then proceed.
They moved the AWS health dashboard off S3 to eliminate the circular dependency. The status page now runs on independent infrastructure that does not share dependencies with the services it monitors.
S3 also added staged restarts
After the 2017 outage, AWS re-engineered S3's subsystems to support faster cold restarts. The index subsystem, which took the longest to recover, was redesigned to checkpoint its state more frequently, reducing restart time from hours to minutes.
Post-2020: Kinesis Fleet Management
AWS increased operating system thread limits for Kinesis front-end servers and added monitoring for thread count relative to limits. Capacity additions now go through a staged rollout process rather than fleet-wide simultaneous deployment.
AWS also improved internal documentation of transitive dependencies between services. The goal: any team operating a service should know the full dependency chain, not just direct dependencies.
What You Should Change in Your Own Architecture
These AWS-side fixes are great, but you do not control AWS's internals. Here is what teams should do on their side after studying these incidents.
Cache control plane data locally. AWS credentials from STS, feature flags from AppConfig, DNS resolution results, and configuration from Parameter Store should all be cached locally with a TTL. If the control plane goes down, your application should keep running with stale but functional data for at least 30 minutes. The alternative is hard-failing the moment a control plane API returns a 5xx.
Consider Instance Store for boot-critical data. EBS volumes depend on the EBS control plane for attachment and I/O. Instance Store (local NVMe drives on the physical host) has no such dependency. For boot-critical data, AMI artifacts, and local caches, Instance Store gives you independence from the EBS control plane. The tradeoff: Instance Store data does not survive instance termination, so it is only suitable for ephemeral or reproducible data.
Pre-provision capacity in secondary regions. Cold failover to a region where you have no running instances takes 30+ minutes at minimum: spinning up instances, warming caches, propagating DNS changes. That is 30 minutes of downtime on top of however long it takes to detect the primary region failure. Pre-provisioned capacity (even at reduced scale) in us-west-2 or eu-west-1 cuts failover to under 5 minutes.
My recommendation: if you run anything customer-facing with an SLA above 99.9%, treat multi-region active-active as a requirement, not an optimization.
Cross-Cutting Changes
After all three outages, AWS invested in:
- Cell-based architecture for control plane services, breaking regional singletons into smaller isolated cells with independent failure domains.
- Shuffle sharding to reduce correlation between customers sharing infrastructure components.
- Static stability principles: services should continue operating with cached/stale data when a dependency is unavailable, rather than failing immediately.
Static stability is the key principle
A statically stable system continues operating correctly even when a dependency is unavailable. It does not need to call a control plane to serve traffic. Cache credentials, pre-resolve DNS, and keep local copies of configuration. Design for the dependency being gone, not just slow.
Architecture Decision Guide
Use this flowchart when deciding how to protect your system against AWS cascade failures.
The rule of thumb: if your SLA requires 99.99% availability, you need multi-region active-active, cached control plane data, independent monitoring, and circuit breakers on every retry path. There is no shortcut.
Transferable Lessons
1. Every retry loop needs a circuit breaker. In all three US-East-1 outages, retry storms turned partial failures into complete outages. Retries without circuit breakers and exponential backoff are an amplification vector, not a reliability mechanism. If a service is returning errors, hammering it harder does not help it recover. Apply this to every service-to-service call in your architecture.
2. Safety mechanisms need resource budgets. EBS re-mirroring was a well-designed safety feature that became the primary failure vector under correlated load. Any automated recovery process (replica rebuilding, cache warming, failover promotion) must have a cap on the resources it can consume. Without a ceiling, the recovery process competes with the service it is trying to save. I've seen this same pattern in cache stampede scenarios: every server trying to rebuild the cache simultaneously overwhelms the database.
3. Map your transitive dependencies before the outage. Your application does not call Kinesis. But if you use Cognito and CloudWatch, you depend on Kinesis. This hidden dependency chain is invisible until the moment it fails, and by then you have no auth and no monitoring simultaneously. Sit down with your team and draw the full dependency graph, including AWS internal dependencies. Do this exercise before an incident forces you to.
4. Your monitoring must not share dependencies with your application. If your alerting runs on CloudWatch, and CloudWatch depends on the same infrastructure as your app, you lose visibility at the exact moment you need it most. Run a secondary monitoring channel (external health checks, a different cloud provider's monitoring, even a simple cron job hitting your health endpoint from a VPS) that has zero shared dependencies with your primary stack.
5. Inputs to operational commands need bounds checking. The 2017 S3 outage happened because a command that removes servers had no upper limit on its input parameter. Every operational command that modifies infrastructure should validate: "Is this request reasonable given the current state of the system?" A simple check like "refuse to remove more than 10% of capacity in one operation" would have prevented a 4-hour outage for millions of customers.
How This Shows Up in Interviews
Cascade failures in AWS are relevant whenever an interviewer asks about high availability, multi-region architecture, or failure handling. Citing specific US-East-1 incidents shows you understand real-world distributed systems, not just textbook theory.
The sentence to drop: "I'd add circuit breakers on every retry path because the 2020 Kinesis outage showed that retry storms turn partial failures into regional cascades."
When to bring it up proactively:
- Any system design that uses AWS services (which is most of them)
- When the interviewer asks about failure modes or availability targets
- When you are explaining why multi-region matters
- When justifying circuit breakers or backoff strategies
Depth expected at senior/staff level:
- Name at least one specific AWS outage and explain the cascade mechanism
- Explain why multi-AZ is insufficient for control plane failures
- Articulate the transitive dependency problem with a concrete example
- Propose specific mitigations: circuit breakers, cached control plane data, independent monitoring
| Interviewer Asks | Strong Answer Citing This Case Study |
|---|---|
| "How do you handle a dependency being down?" | "Circuit breaker with exponential backoff and jitter. Without it, retries become a DDoS on the degraded service, exactly what happened during the 2020 Kinesis cascade in US-East-1." |
| "Why multi-region?" | "US-East-1 has had three 4-12 hour outages where regional control plane services (S3, Kinesis, CloudWatch) failed simultaneously. Multi-AZ does not protect against regional control plane failures." |
| "What's a hidden risk in cloud architectures?" | "Transitive dependencies. In 2020, Kinesis failed and took down Cognito and CloudWatch because they depend on Kinesis internally. Your app never calls Kinesis, but it fails when Kinesis fails." |
| "How do you monitor during an outage?" | "Independent monitoring outside the blast radius. AWS's own status dashboard was hosted on S3 and went dark during the 2017 S3 outage. Your monitoring cannot share dependencies with the thing it monitors." |
| "How do you design operational tooling safely?" | "Bounds checking on every input. The 2017 S3 outage started because a capacity-removal command had no upper limit. A simple validation, 'refuse to remove more than 5% at once,' would have prevented it." |
Quick Recap
- US-East-1 is AWS's oldest and largest region, and its regional control plane services (S3, Kinesis, CloudWatch, Cognito) have been the source of three major cascading failures.
- The universal cascade pattern is: trigger, elevated errors, retry storm, 10x load amplification, dependent service failures.
- The 2011 EBS outage demonstrated that safety mechanisms (re-mirroring) without resource limits become the amplification vector under correlated failures.
- The 2017 S3 outage demonstrated that operational commands without bounds checking can take down global infrastructure, and that monitoring should never depend on the service it monitors.
- The 2020 Kinesis outage demonstrated that transitive dependencies (Kinesis powering CloudWatch and Cognito internally) create invisible blast radius that no customer-side architecture diagram reveals.
- Multi-region active-active with pre-provisioned capacity, locally cached control plane data, and independent monitoring is the only reliable defense against regional cascade failures.
- Every retry loop in your system needs a circuit breaker with exponential backoff and jitter, because retries without backpressure are the primary amplification mechanism in every cascade failure.
Related Concepts
- Circuit breaker pattern: The primary defense against retry storms that amplify cascading failures. Every service-to-service call in a cascade-prone architecture needs one. The 2020 Kinesis outage is a textbook case where circuit breakers would have limited the blast radius.
- Bulkhead pattern: Isolates failure domains so that a degraded dependency does not consume all resources. Relevant to preventing a single AWS service failure from taking down your entire application. Think of it as AZ isolation applied to your own service's thread pools and connection pools.
- Missing backpressure anti-pattern: The 2011 EBS re-mirroring and 2020 Kinesis retry storms are textbook examples of what happens when systems lack backpressure mechanisms under load. Every recovery path and retry loop needs a resource budget.