Scatter-gather
How scatter-gather fans out a request to multiple services in parallel, collects results, and merges them. Covers partial response handling, timeout strategies, and when to gate on all vs. some responses.
TL;DR -- Scatter-gather fans one request to N workers in parallel, collects their responses, and merges them into a single result. Total latency equals the slowest shard, not the sum of all shards. The hard part is handling stragglers, partial failures, and merging heterogeneous results under a deadline.
The Problem
You have data spread across multiple shards, services, or nodes. A single request needs information from all of them. The naive approach is to query each one sequentially.
Sequential fan-out is a latency disaster. If each of your 10 shards takes 50ms on average, sequential queries cost 500ms total. Even worse, your p99 latency becomes the sum of individual p99 latencies, which compounds rapidly.
The problem gets worse as you add shards. Every new shard adds its full latency to the total. You need a way to query all shards simultaneously and pay only the cost of the slowest one.
One-Line Definition
Scatter-gather fans one request to N downstream services in parallel and merges their responses into a single result, paying the latency cost of only the slowest responder.
Analogy
Think of a teacher collecting homework. The slow approach: walk to each student's desk one by one, wait for them to hand it over, then move to the next desk. The scatter-gather approach: announce "everyone put your homework on my desk." You wait only for the slowest student, not the sum of all students.
Now imagine one student is absent. Do you wait indefinitely, or do you grade what you have and mark one submission as missing? That is the timeout decision every scatter-gather system faces.
Solution Walkthrough
The Two Phases
Scatter-gather has exactly two phases.
Scatter phase: The coordinator receives a request and immediately dispatches sub-requests to all N downstream services. These calls happen concurrently. The coordinator does not wait for any response before sending the next request.
Gather phase: The coordinator collects responses as they arrive. Once it has enough responses (all N, or a quorum, or a timeout fires), it merges the results and returns a single response to the caller.
Timeout Strategies
The most critical design decision is what happens when a shard is slow or unresponsive. You have three options.
Wait-for-all gates on every shard completing. This guarantees complete results but your latency matches the slowest shard. Your p99 latency becomes the p99 of the worst-performing shard, which means one slow shard tanks your entire query.
Timeout with partial results sets a deadline. When the deadline expires, the coordinator returns whatever it has collected so far and marks the response as partial. The caller must handle incomplete data gracefully.
Hedged requests are the most sophisticated approach. After a threshold (for example, the p95 latency of a typical shard), the coordinator sends a duplicate request to an additional replica. Whichever copy responds first wins. This trades a small amount of extra load for significant tail-latency reduction.
Google's "The Tail at Scale" paper showed that hedged requests can reduce p99 latency by 40-50% while adding only ~5% extra load. The key insight: most stragglers are caused by transient issues (GC pauses, disk seeks), so a retry to a different replica almost always returns faster.
Straggler Mitigation
Stragglers are the enemy of scatter-gather. With N shards, the probability that at least one is slow grows quickly. If each shard has a 1% chance of being a straggler, then with 100 shards you are almost guaranteed at least one straggler on every request.
Straggler causes include garbage collection pauses, disk I/O contention, network congestion, hot partitions with uneven data distribution, and noisy neighbors in shared infrastructure. Hedged requests handle transient stragglers well but do not fix structural imbalances like hot shards.
Result Merging Strategies
The merge step depends entirely on what you are aggregating.
Top-K merge (search): Each shard returns its local top-K results sorted by score. The coordinator performs a K-way merge using a min-heap of size K. Time complexity is O(N * K * log K) where N is the shard count. This is how Elasticsearch and Solr merge distributed search results.
Aggregation merge (analytics): Each shard computes partial aggregates (sum, count, min, max, distinct count). The coordinator combines them. Sums and counts combine trivially. Distinct counts require probabilistic structures like HyperLogLog to merge without overcounting.
Set union (collect all): Gather all items from all shards, deduplicate by ID. Used when you need the complete dataset, not a ranked subset.
Intersection (all must agree): Only return items that appear in every shard's response. Used in multi-index search where results must match all criteria.
Top-K merge has a subtle correctness issue. If you ask each shard for top-3 and merge, you might miss a globally-top-3 document that ranked 4th on its local shard. The safe approach: ask each shard for top-K where K is the final result size, then re-rank globally. For deep pagination (page 100 of results), each shard must return K*page items, which gets expensive.
Implementation Sketch
A scatter-gather coordinator in TypeScript with timeout and hedged requests:
interface ShardResponse<T> {
shardId: string;
data: T;
latencyMs: number;
}
async function scatterGather<T>(
shards: string[],
query: string,
opts: { timeoutMs: number; hedgeAfterMs: number }
): Promise<{ results: ShardResponse<T>[]; partial: boolean }> {
const controller = new AbortController();
const results: ShardResponse<T>[] = [];
const pending = new Set(shards);
// Scatter: dispatch all requests in parallel
const promises = shards.map(async (shardId) => {
const start = Date.now();
const res = await queryShard<T>(shardId, query, controller.signal);
results.push({ shardId, data: res, latencyMs: Date.now() - start });
pending.delete(shardId);
});
// Hedge: after threshold, retry slow shards on replicas
const hedgeTimer = setTimeout(() => {
for (const shardId of pending) {
const replica = getReplicaFor(shardId);
promises.push(
queryShard<T>(replica, query, controller.signal).then((res) => {
if (pending.has(shardId)) {
results.push({ shardId, data: res, latencyMs: Date.now() });
pending.delete(shardId);
}
})
);
}
}, opts.hedgeAfterMs);
// Gather: wait for all or timeout
const deadline = sleep(opts.timeoutMs).then(() => "timeout");
const all = Promise.allSettled(promises).then(() => "done");
await Promise.race([deadline, all]);
clearTimeout(hedgeTimer);
controller.abort(); // cancel remaining in-flight requests
return { results, partial: pending.size > 0 };
}
The critical details: the abort controller cancels in-flight requests after timeout, the hedge timer only fires for shards that have not responded yet, and the return value explicitly signals whether results are partial.
Top-K Merge Implementation
function mergeTopK<T extends { score: number }>(
shardResults: T[][],
k: number
): T[] {
// Flatten and sort -- simple for moderate K
// For large K, use a min-heap for O(N*K*log K)
const all = shardResults.flat();
all.sort((a, b) => b.score - a.score);
return all.slice(0, k);
}
In production, replace the sort with a min-heap when K is large or the number of shards is high. The heap approach avoids sorting the entire combined result set.
When It Shines
Scatter-gather is the right pattern when:
- Data is partitioned across shards and a query must touch all partitions. Distributed search engines, sharded databases, and multi-region data stores all fit this model.
- Sub-requests are independent. Each shard can answer its portion of the query without needing data from other shards. If shards need to coordinate, you have a distributed join problem, not a scatter-gather problem.
- Partial results are acceptable or completeness can be recovered asynchronously. Search is the canonical example: returning 9 of 10 shards' results is almost always good enough.
- Latency matters more than throughput. Scatter-gather trades extra parallel work for lower wall-clock time. If your bottleneck is CPU or network bandwidth rather than latency, batching might serve you better.
Failure Modes & Pitfalls
1. Tail Latency Amplification
With N shards, your effective p99 latency approaches the worst single-shard outcome across all N shards. For 20 shards, a 50ms single-shard p99 can become a 200ms+ system p99. Many teams underestimate this effect during capacity planning.
2. Shard Hotspots
Uneven data distribution causes some shards to take much longer than others. If your search index has one shard with 10x the documents, that shard becomes a permanent straggler. Scatter-gather does not fix data imbalance; you need to rebalance shards separately.
3. Fan-out Amplification
Each scatter-gather request multiplies into N downstream requests. If your coordinator handles 1,000 QPS and scatters to 20 shards, your shard layer sees 20,000 QPS. Teams routinely forget to account for this amplification in capacity planning, load testing, and rate limiting.
4. Timeout Cascades
If the coordinator's timeout is longer than the caller's timeout, the caller gives up while the coordinator is still waiting. The coordinator finishes the gather, merges results, and sends a response that nobody is listening for. Always propagate deadlines from caller to coordinator to shards.
5. Incomplete Results Without Signaling
Returning partial results without telling the caller which shards are missing is dangerous. The caller cannot distinguish "no results from shard 3" from "shard 3 has no matching data." Always include metadata about which shards responded and which timed out.
6. Hedged Request Overload
Hedged requests can amplify load during incidents. If shards are slow due to overload and you hedge all slow requests, you double the load on already-struggling infrastructure. Add circuit breakers to disable hedging when error rates exceed a threshold.
Back-pressure is essential. If your gather phase detects that most shards are slow or failing, it should shed load early rather than sending even more hedged requests. A circuit breaker on the scatter phase prevents hedge storms from making outages worse.
Trade-offs
| Dimension | Upside | Downside |
|---|---|---|
| Latency | max(shards) instead of sum(shards) | Tail latency grows with shard count |
| Throughput | Parallel utilization of all shards | N-fold fan-out amplifies backend load |
| Completeness | Can return partial results fast | Partial results may confuse downstream consumers |
| Complexity | Conceptually simple two-phase pattern | Timeout tuning, merge logic, and hedging add real complexity |
| Resource cost | Better utilization of distributed resources | Hedged requests burn extra CPU and network capacity |
| Failure handling | Graceful degradation with partial results | Must define "good enough" threshold per use case |
The Deep Pagination Problem
Top-K scatter-gather works well for the first few pages of results. But deep pagination (page 1000) requires each shard to return K*1000 results to guarantee correctness. This makes deep pages vastly more expensive than shallow pages.
Production systems handle this with three approaches. They cap pagination depth (Elasticsearch defaults to 10,000 results). They use cursor-based pagination with scroll contexts. Or they switch to a different retrieval strategy for deep result sets entirely.
Real-World Usage
Elasticsearch Distributed Search
Elasticsearch is the textbook scatter-gather implementation. Every search query goes through two phases: a scatter phase (query phase) where the coordinator sends the query to all relevant shards, and a gather phase (fetch phase) where the coordinator merges shard-local top-K results into a global ranking.
The coordinator keeps only document IDs and scores during the query phase. In the fetch phase, it requests full documents only for the final top-K results. This two-phase approach minimizes network transfer by avoiding full document payload from every shard.
Google Web Search
Google's search infrastructure scatters queries across thousands of index shards simultaneously. Jeff Dean's "Achieving Rapid Response Times in Large Online Services" describes their strategy: set aggressive timeouts, return partial results, and use hedged requests for stragglers. Google showed that hedging at the 95th percentile latency reduced their BigTable 99th percentile latency from 1,800ms to 74ms.
Apache Cassandra Reads (Quorum)
When Cassandra performs a read at consistency level QUORUM, it scatters read requests to a majority of replicas for the requested partition. The coordinator gathers responses, compares them for consistency, and returns the most recent value. If replicas disagree, read repair kicks in to reconcile the divergence.
MapReduce
MapReduce is a large-scale, batch-oriented scatter-gather. The map phase scatters work to mappers across the cluster. The shuffle redistributes intermediate results by key. The reduce phase gathers and merges results per key. The difference from online scatter-gather: MapReduce tolerates much higher latency because it processes data in batch.
How This Shows Up in Interviews
Scatter-gather appears in almost every sharded system design question. Designing a search engine, a distributed analytics platform, or a recommendation aggregator all require it. Interviewers expect you to identify when scatter-gather applies and discuss its latency implications.
What strong candidates cover: tail latency math (how p99 degrades with shard count), hedged request strategy with load overhead, the fan-out amplification problem, and how to handle partial results gracefully.
What weak candidates miss: they describe scatter-gather as "just parallel calls" without addressing timeouts, stragglers, merge complexity, or the trade-off between completeness and latency. They also forget fan-out amplification when sizing the shard layer.
Common follow-ups: "What happens when one shard is consistently slow?" (Rebalance data, not just hedge.) "How do you handle deep pagination?" (Cursor-based scrolling or capped offsets.) "How does scatter-gather interact with rate limiting?" (Account for fan-out multiplication at the shard layer.)
Test Your Understanding
Quick Recap
- Scatter-gather issues N parallel sub-requests and merges their responses, paying the latency cost of only the slowest responder.
- Tail latency amplification means your system p99 is much worse than any single shard's p99, especially at high shard counts.
- Hedged requests counter stragglers by sending duplicate requests to replicas after a latency threshold, adding minimal extra load.
- The merge strategy depends on the data: top-K heap merge for search, sum/count for aggregations, deduplication for set union.
- Always propagate deadlines from caller through coordinator to shards to prevent timeout cascades.
- Fan-out amplification means shard-layer QPS equals coordinator QPS multiplied by shard count. Size and rate-limit accordingly.
- Deep pagination is uniquely expensive in scatter-gather because each shard must return proportionally more data per page.
- Circuit breakers should disable hedging during overload to prevent hedge storms from worsening outages.
Related Patterns
- Fan-out patterns: The broader family of patterns for distributing work. Scatter-gather is a specific fan-out-then-aggregate variant.
- Circuit breaker: Protects scatter-gather from cascade failures when shards become unhealthy. Essential for disabling hedged requests during overload.
- Message queues: An alternative to synchronous scatter-gather when latency tolerance is higher and reliability matters more.
- CQRS: Separates read and write models, pairing well with scatter-gather reads against pre-built read-optimized projections.
- Bulkhead: Isolates scatter-gather fan-out from other traffic to prevent a slow gather phase from consuming all connection pool resources.