Chaos engineering

TL;DR

Chaos engineering intentionally injects controlled failures into production systems to discover weaknesses before real outages do. It applies the scientific method to reliability.
Every experiment starts with a steady-state hypothesis: measurable conditions that define "the system is healthy." Without this, you're just breaking things.
Blast radius control (canary chaos, automated halt conditions, incremental scope) makes production experiments safe, not reckless.
The failure injection taxonomy spans six layers: infrastructure, network, application, dependency, resource, and data. Start small and work up.
Organizations progress through a chaos maturity model: manual one-off experiments, then automated scheduled runs, then continuous chaos integrated into CI/CD pipelines.

The Problem It Solves

Your microservices platform runs 40 services. Every service passes CI/CD. Staging is green. Code coverage is 85%+. Your SRE team is confident.

Then a Redis cluster node gets terminated during routine maintenance. Failover takes 8 seconds. During those 8 seconds, 12 services that cache session data in Redis start throwing connection errors. Five have circuit breakers, but two have timeouts set to 30 seconds (above the 8-second window), so the breakers never trip. Three services have no circuit breakers at all.

Those three services retry aggressively: 5 retries, 100ms backoff, 200 RPS each. That's 3,000 extra Redis connections per second during recovery. Redis recovers at second 8, but the retry storms keep it saturated until second 22.

Meanwhile, two session-dependent services fall back to the primary PostgreSQL database. The database absorbs an extra 4,000 reads/second for 6 seconds, then starts timing out under connection pool exhaustion. Now all 40 services lose their primary data store.

An 8-second Redis failover becomes a 47-minute cascading outage.

The postmortem identifies the root cause: untested failure modes. Every component was tested individually. Nobody tested what happens when Redis is unavailable for 8 seconds under production load with the actual retry configurations deployed. Unit tests don't cover cross-service failure propagation. Integration tests run against mocked dependencies. Staging has 1% of production traffic.

I've seen this exact pattern at three different companies. The failures that cause real outages are almost never the ones you tested for. They're the interactions between components when one degrades unexpectedly. Traditional testing covers the happy path and the predicted failure path, but not the emergent failure path.

What Is It?

Chaos engineering is the discipline of experimenting on a distributed system to build confidence that it can withstand turbulent conditions in production. It applies the scientific method to reliability: form a hypothesis, run an experiment, observe the results, learn from the outcome.

Think of it like a fire drill. A fire drill doesn't prevent fires. It verifies that when a fire happens, people know where the exits are, the alarms actually work, and the sprinkler system activates. Nobody questions the value of fire drills even though they disrupt the building for 20 minutes. Chaos engineering runs fire drills for your infrastructure.

Netflix coined the practice in 2011 with Chaos Monkey, which randomly terminated production EC2 instances. The hypothesis was simple: if a service can't survive random instance termination, your architecture has a reliability problem. If it can, you've built real resilience, not just the illusion of it.

The lifecycle is iterative. Each experiment generates findings that inform the next hypothesis. Over time, the system's resilience grows not because of any single experiment, but because the practice continuously surfaces and eliminates hidden failure modes.

For your interview: if someone asks about chaos engineering, say "controlled failure injection with a steady-state hypothesis." That single phrase separates you from candidates who only know "Netflix kills random servers."

How It Works

Running a chaos experiment follows a strict protocol. Here's the walkthrough for a single experiment: testing whether your checkout service survives a payment provider outage.

Define the steady-state hypothesis. Before touching anything, define what "healthy" looks like in measurable terms: "When the payment provider is unreachable, the system queues payments for retry, returns a 202 Accepted to the user within 500ms, and processes the queued payment within 5 minutes of provider recovery."
Capture the baseline. Record current SLIs: error rate, p99 latency, queue depth, successful payment rate. This is your control group.
Configure the fault injection. Specify the fault type (dependency failure), target (payment provider API), scope (5% of traffic), duration (60 seconds), and halt conditions (abort if error rate exceeds 10%).
Inject the failure. The chaos tool blocks outbound requests to the payment provider for the targeted traffic cohort. Only 5% of checkout requests are affected.
Observe. Compare real-time SLIs against the baseline. Does the error rate stay under 5%? Does the payment queue grow as expected? Is p99 latency within bounds?
Halt or conclude. If halt conditions trigger, the experiment stops automatically and the team investigates. If the experiment runs its full duration, compare final metrics against the hypothesis.
Learn and share. Document what happened. Did the system meet the hypothesis? If not, what failed? File tickets for discovered weaknesses. Share findings in a team retro.

# Chaos experiment specification (Gremlin / Litmus style)
experiment:
  name: "payment-provider-outage"
  hypothesis:
    description: "Checkout survives payment provider unavailability"
    conditions:
      - metric: "checkout.error_rate"
        operator: "<"
        threshold: 5.0
      - metric: "checkout.p99_latency_ms"
        operator: "<"
        threshold: 500
      - metric: "payment.queue_depth"
        operator: ">"
        threshold: 0  # Queue should be growing

  fault:
    type: "dependency-failure"
    target: "payment-provider.external.svc"
    method: "block-outbound"
    scope:
      percentage: 5       # 5% of traffic
      duration: "60s"

  halt_conditions:
    - metric: "checkout.error_rate"
      operator: ">"
      threshold: 10.0     # Abort if error rate exceeds 10%
    - metric: "system.p99_latency_ms"
      operator: ">"
      threshold: 2000     # Abort if p99 exceeds 2 seconds

  schedule:
    window: "Tuesday 14:00-16:00 UTC"  # Off-peak
    notify: ["#sre-chaos", "@oncall"]

Without a steady-state hypothesis before the experiment, you're just breaking things. The hypothesis forces you to define "recovery" and "acceptable degradation" in measurable terms before you inject any fault. I'll often see teams skip this step and run vague experiments like "let's kill a pod and see what happens." The result is an hour of staring at dashboards without knowing what to compare against.

Chaos experiments without observability are just outages

If you can't measure the metrics in your steady-state hypothesis, you can't run the experiment. Before adopting chaos engineering, verify you have: (1) request-level error rates by service, (2) latency percentiles (p50/p99), (3) dependency health checks, and (4) alerting on SLI degradation. Skipping observability is the number one reason chaos programs fail.

Key Components

TL;DR

Chaos engineering intentionally injects controlled failures into production systems to discover weaknesses before real outages do. It applies the scientific method to reliability.
Every experiment starts with a steady-state hypothesis: measurable conditions that define "the system is healthy." Without this, you're just breaking things.
Blast radius control (canary chaos, automated halt conditions, incremental scope) makes production experiments safe, not reckless.
The failure injection taxonomy spans six layers: infrastructure, network, application, dependency, resource, and data. Start small and work up.
Organizations progress through a chaos maturity model: manual one-off experiments, then automated scheduled runs, then continuous chaos integrated into CI/CD pipelines.

Define the steady-state hypothesis. Before touching anything, define what "healthy" looks like in measurable terms: "When the payment provider is unreachable, the system queues payments for retry, returns a 202 Accepted to the user within 500ms, and processes the queued payment within 5 minutes of provider recovery."
Capture the baseline. Record current SLIs: error rate, p99 latency, queue depth, successful payment rate. This is your control group.
Configure the fault injection. Specify the fault type (dependency failure), target (payment provider API), scope (5% of traffic), duration (60 seconds), and halt conditions (abort if error rate exceeds 10%).
Inject the failure. The chaos tool blocks outbound requests to the payment provider for the targeted traffic cohort. Only 5% of checkout requests are affected.
Observe. Compare real-time SLIs against the baseline. Does the error rate stay under 5%? Does the payment queue grow as expected? Is p99 latency within bounds?
Halt or conclude. If halt conditions trigger, the experiment stops automatically and the team investigates. If the experiment runs its full duration, compare final metrics against the hypothesis.
Learn and share. Document what happened. Did the system meet the hypothesis? If not, what failed? File tickets for discovered weaknesses. Share findings in a team retro.

# Chaos experiment specification (Gremlin / Litmus style)
experiment:
  name: "payment-provider-outage"
  hypothesis:
    description: "Checkout survives payment provider unavailability"
    conditions:
      - metric: "checkout.error_rate"
        operator: "<"
        threshold: 5.0
      - metric: "checkout.p99_latency_ms"
        operator: "<"
        threshold: 500
      - metric: "payment.queue_depth"
        operator: ">"
        threshold: 0  # Queue should be growing

  fault:
    type: "dependency-failure"
    target: "payment-provider.external.svc"
    method: "block-outbound"
    scope:
      percentage: 5       # 5% of traffic
      duration: "60s"

  halt_conditions:
    - metric: "checkout.error_rate"
      operator: ">"
      threshold: 10.0     # Abort if error rate exceeds 10%
    - metric: "system.p99_latency_ms"
      operator: ">"
      threshold: 2000     # Abort if p99 exceeds 2 seconds

  schedule:
    window: "Tuesday 14:00-16:00 UTC"  # Off-peak
    notify: ["#sre-chaos", "@oncall"]

Chaos experiments without observability are just outages

Chaos engineering

TL;DR

The Problem It Solves

What Is It?

How It Works

Key Components

Continue Reading with Premium

Comments

Chaos engineering

TL;DR

The Problem It Solves

What Is It?

How It Works

Key Components

Continue Reading with Premium

Comments