Bulkhead pattern
Learn how the bulkhead pattern isolates resource pools to contain failures—so one slow dependency can never exhaust your thread pool and take down every unrelated feature.
TL;DR
- The bulkhead pattern partitions your shared resources (thread pools, connection pools, semaphore permits) into isolated compartments — one per dependency or feature — so exhaustion in one compartment cannot spread to another.
- Without it, a single slow downstream service can consume every thread in your process. Your search, checkout, auth, and home feed all return 503 — not because they are broken, but because the broken payment service ate all 200 threads.
- Three implementation flavours: thread pool isolation (separate executor per call type), semaphore isolation (permit cap per feature), and connection pool isolation (separate pools per workload).
- Pair with a circuit breaker: bulkheads contain blast radius inside your process; circuit breakers contain blast radius across the network. You need both.
- The fundamental tension: resource efficiency vs. isolation. Bulkheads pre-allocate resources that sit idle when their workload is light — the price you pay for guaranteed protection.
The Problem
It's 11 p.m. Your on-call phone lights up. Every page on your e-commerce platform is timing out — product listing, checkout, user profile, search. The CEO's evening browse session is dead. You pull up dashboards and see something strange: CPU is fine, memory is fine, network is fine. But the Payment Service is throwing 500s due to a database failover.
Wait. Why is payment taking down search?
Your Order Service calls three downstream services on every request: Inventory, Recommendations, and Payment. They all share one thread pool — 200 threads total. Payment's database is in failover; queries hang for 30 seconds. With 200 concurrent checkout requests, all 200 threads are occupied waiting on Payment. The thread pool is exhausted. A new request for the innocuous homepage arrives — it needs Inventory data, has nothing to do with Payment — but there are no threads to serve it. It times out too.
The fix isn't more threads — it's isolation. The ship didn't sink because of a single hull breach. It sank because the water could flow freely between compartments.
One-Line Definition
The bulkhead pattern partitions shared resources into isolated pools so that exhaustion in one pool is physically contained and cannot cascade to other pools.
Analogy
A ship's hull has compartments separated by watertight bulkheads. If one compartment floods — say, from a torpedo hit — the water cannot flow to adjacent compartments. The ship stays afloat with partial functionality — the flooded compartment is lost, but the rest of the vessel continues operating.
Without bulkheads, water entering anywhere flows everywhere. One breach sinks the whole ship.
Your application's thread pool is the hull. Each downstream dependency is a potential flood point. If one of them starts hanging, threads accumulate waiting for its response — and without compartment walls, they drain the entire pool until there's nothing left for any other compartment to float on.
Solution Walkthrough
There are three mechanisms to implement this isolation. Which one you reach for depends on your runtime and what kind of resource you're protecting.
Thread Pool Isolation
Assign a dedicated, fixed-size thread pool (executor) to each downstream service you call. Requests for Payment go to the Payment executor. Requests for Inventory go to the Inventory executor. If the Payment executor's 20 threads are all stuck waiting for a slow DB, that's the payment executor's problem — the Inventory executor's 20 threads are untouched.
// thread-pool-bulkhead.ts — SKETCH using async concurrency control
// Important: Node.js is single-threaded. p-queue limits *concurrent* async operations
// on the event loop — it behaves like a semaphore, not a true thread pool.
// For genuine thread pool isolation in Node.js, use Piscina (worker_threads pool).
// In JVM, use Resilience4j's ThreadPoolBulkhead or Hystrix command groups.
// Shared error class — define once, use across all bulkheads
class BulkheadFullError extends Error {
constructor(message: string) {
super(message);
this.name = 'BulkheadFullError';
}
}
import PQueue from 'p-queue'; // npm install p-queue
const paymentQueue = new PQueue({ concurrency: 20 }); // max 20 concurrent
const inventoryQueue = new PQueue({ concurrency: 30 });
const notifQueue = new PQueue({ concurrency: 10 });
async function callPayment(orderId: string): Promise<PaymentResult> {
if (paymentQueue.size >= 50) {
// Queue depth guard: fail-fast if backlog is already deep
// p-queue: .size = tasks WAITING in queue; .pending = currently running (bounded by concurrency)
throw new BulkheadFullError('Payment bulkhead queue full');
}
return paymentQueue.add(() => paymentClient.charge(orderId));
}
async function callInventory(productId: string): Promise<InventoryResult> {
// Inventory pool unaffected even if payment pool is saturated
return inventoryQueue.add(() => inventoryClient.checkStock(productId));
}
Thread pool isolation has real overhead — don't apply it to everything
Each call context-switches to a worker thread. In JVM runtimes (Hystrix, Resilience4j ThreadPoolBulkhead), that's ~1ms per call. At 50K req/s with 5 downstream calls per request, you're adding 250K context switches per second. Thread pool isolation is for calls to slow, unreliable dependencies — not fast in-process calls or calls with sub-millisecond round-trip time. Apply it surgically.
Semaphore Isolation
A semaphore is a permit counter. You pre-allocate N permits for a feature. Each incoming request acquires one permit before proceeding. When all N permits are in-use, the next request gets an immediate rejection — no waiting, no thread spin. When a request completes, it releases its permit back to the pool.
// semaphore-bulkhead.ts — lightweight permit-based concurrency limiter
class SemaphoreBulkhead {
private inFlight = 0;
constructor(
private readonly name: string,
private readonly maxConcurrent: number
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.inFlight >= this.maxConcurrent) {
// Fail-fast: no blocking, no waiting. < 0.1ms.
throw new BulkheadFullError(
`${this.name} semaphore full (${this.inFlight}/${this.maxConcurrent} in-flight)`
);
}
this.inFlight++;
try {
return await fn();
} finally {
this.inFlight--; // always release, even on error
}
}
get utilisation(): number {
return this.inFlight / this.maxConcurrent;
}
}
// Usage
const paymentBulkhead = new SemaphoreBulkhead('payment', 10);
const searchBulkhead = new SemaphoreBulkhead('search', 30);
async function chargeOrder(orderId: string) {
return paymentBulkhead.execute(() => paymentClient.charge(orderId));
}
The critical difference from thread pool isolation: semaphores don't move work to a different thread. The caller's thread does the work and holds the permit. This means semaphores cannot enforce an independent timeout on the downstream call — if the downstream hangs, the caller's thread hangs too (just with a permit count that prevents overlapping this). Use semaphores for fast-fail concurrency capping; use thread pools when you need genuine thread-level timeout enforcement.
Connection Pool Bulkhead
This is the most frequently overlooked form — and often the one that bites production systems hardest. You almost certainly already have database connection pools. The question is whether they're segmented.
# HikariCP configuration — separate pools per workload type
# application.yml (Spring Boot)
spring:
datasource:
# OLTP writes — latency-sensitive, must never be starved
primary:
jdbc-url: jdbc:postgresql://primary.db:5432/app
hikari:
pool-name: oltp-write-pool
maximum-pool-size: 50
minimum-idle: 10
connection-timeout: 3000 # fail fast: 3s max wait for connection
idle-timeout: 600000
# User-facing reads — medium priority
replica:
jdbc-url: jdbc:postgresql://replica.db:5432/app
hikari:
pool-name: read-replica-pool
maximum-pool-size: 100
minimum-idle: 20
connection-timeout: 5000
# Analytics / reporting — low priority, can wait
analytics:
jdbc-url: jdbc:postgresql://analytics.db:5432/app
hikari:
pool-name: analytics-pool
maximum-pool-size: 5 # hard cap: analytics never gets more than 5
minimum-idle: 0
connection-timeout: 30000 # analytics can wait longer
idle-timeout: 60000
A single analytics query that runs a 30-second GROUP BY across 500M rows uses one connection for 30 seconds. With a 5-connection analytics pool, that's a maximum of 5 concurrent long-running queries — after which the 6th analyst gets a pool timeout, not a service outage. Without the partition, that analyst's 5 connections come from the 50-connection OLTP pool, and write transactions start waiting.
For your interview: naming the connection pool as a bulkhead boundary is the move most candidates miss. When you draw a "database pool" in your architecture, always note it's separated: write pool, read pool, analytics pool. That specificity signals you've operated systems at scale.
Container and Kubernetes Bulkheads
At the infrastructure level, bulkheads manifest as resource limits on pods and namespaces. This is how you prevent one team's batch job from starving another team's user-facing API — even when they share the same Kubernetes cluster.
# k8s-bulkhead.yaml — namespace-level ResourceQuota as a bulkhead
apiVersion: v1
kind: ResourceQuota
metadata:
name: batch-analytics-quota
namespace: batch-analytics
spec:
hard:
requests.cpu: "10" # namespace can request at most 10 CPUs total
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
count/pods: "10" # max 10 pods — prevents runaway horizontal scaling
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-api-priority
value: 1000000 # preempts low-priority pods when node is under pressure
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-priority
value: 100 # batch jobs can be evicted to free resources
globalDefault: false
The PriorityClass combination is the part most Kubernetes articles skip: if a node is memory-pressured, the scheduler evicts low-priority pods first. With this setup, a batch job OOMing at midnight evicts the batch pod, not the order-api pod sleeping next to it.
Multi-Tenant Bulkhead
Multi-tenant SaaS is where bulkheads become a product guarantee, not just an operational nicety. Without tenant-level isolation, one tenant running a script can degrade service for all other tenants.
// tenant-bulkhead-middleware.ts
// Each tenant gets their own semaphore based on their tier
type TenantTier = 'enterprise' | 'professional' | 'starter' | 'free';
const tierLimits: Record<TenantTier, number> = {
enterprise: 200,
professional: 50,
starter: 30,
free: 10,
};
const tenantBulkheads = new Map<string, SemaphoreBulkhead>();
function getTenantBulkhead(tenantId: string, tier: TenantTier): SemaphoreBulkhead {
if (!tenantBulkheads.has(tenantId)) {
tenantBulkheads.set(
tenantId,
new SemaphoreBulkhead(`tenant:${tenantId}`, tierLimits[tier])
);
}
return tenantBulkheads.get(tenantId)!;
}
// Express middleware
export function tenantBulkheadMiddleware(req: Request, res: Response, next: NextFunction) {
const { tenantId, tier } = req.tenant; // set by auth middleware
const bulkhead = getTenantBulkhead(tenantId, tier);
bulkhead.execute(async () => {
await new Promise<void>((resolve) => {
req.on('close', resolve);
next();
});
}).catch((err) => {
if (err instanceof BulkheadFullError) {
res.status(429).json({
error: 'too_many_requests',
message: 'Your request limit is currently reached. Upgrade your plan for higher concurrency.'
});
} else {
next(err);
}
});
}
The teardown trick is the elegant part: the semaphore's permit is held for the duration of the entire request (from middleware entry to req.on('close')), not just the DB query portion. This counts concurrent HTTP requests per tenant — which is exactly the right unit of isolation for a noisy-neighbour problem.
API Criticality Tier Bulkhead
Not all endpoints are equal. A recommendation engine failure and a payment processor failure have very different revenue implications. Tiered bulkheads let you allocate disproportionately more resources to revenue-critical paths.
I'll always sketch this exact tiering at the staff level when designing an e-commerce, streaming, or fintech system. The interviewer hears "I'm protecting revenue paths first and letting non-critical features degrade gracefully" — which is exactly the right prioritisation.
Implementation with Resilience4j (Production Library)
In practice, you shouldn't hand-roll bulkhead logic. Resilience4j is the standard JVM library; for Node.js, cockatiel is the production-grade choice. Here's both:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.