MapReduce internals: shuffle sort, combiners, and speculative execution
How MapReduce splits data, moves it between map and reduce phases via the shuffle and sort stage, how combiners reduce network I/O, and how speculative execution handles stragglers in large clusters.
The problem
You have 10 TB of web server logs. You need to count how many times each URL was accessed. On a single machine, this takes 28 hours. Your deadline is 30 minutes.
You could split the data across 100 machines, have each one count its portion, and merge the results. But now you have a distributed coordination problem: which machine gets which data? How do you handle a machine that crashes mid-count? How do you merge partial results when the same URL appears on multiple machines? Who detects which machine is slow and reassigns its work?
Writing that coordination code by hand takes months and is different for every job. MapReduce solves this by providing a fixed execution framework: you write just the counting logic (a map function and a reduce function), and the framework handles splitting, distributing, shuffling, sorting, fault tolerance, and result merging automatically. This is the problem MapReduce solves.
What it is
MapReduce is a programming model and execution framework for processing large datasets in parallel across a cluster. You define two functions: Map (which transforms each input record into intermediate key-value pairs) and Reduce (which aggregates all values for a given key into a final result). The framework handles everything else.
Think of it as a census. The government does not send one person to count every citizen. It divides the country into districts (splits), assigns a census worker to each district (map), each worker tallies their district's population by category (intermediate output), then a central office merges the tallies from all districts (reduce). If a worker gets sick, the office reassigns that district to a replacement.
How it works
A MapReduce job flows through five stages: split, map, combine (optional), shuffle and sort, and reduce.
Here is the word count example that illustrates the model:
function map(docId, content):
for word in content.split():
emit(word, 1)
function reduce(word, counts):
emit(word, sum(counts))
Step-by-step execution:
- Split: The framework divides the input dataset (stored in HDFS) into splits of ~128 MB each. One map task is assigned per split.
- Map: Each mapper reads its split record by record, applies the user's map function, and writes intermediate key-value pairs to local disk (not HDFS). Output is partitioned into R buckets (one per reducer) using
hash(key) % R. - Combine (optional): If a combiner is configured, it pre-aggregates the mapper's output locally before the shuffle. This reduces network traffic.
- Shuffle and sort: The framework transfers each mapper's partitioned output to the appropriate reducer node. On the reducer side, all incoming data for that partition is merge-sorted by key.
- Reduce: The reducer iterates through sorted key-value groups, calling the user's reduce function on each group. Output is written to HDFS.
The key insight is the separation of user logic from execution machinery. The user writes only the map and reduce functions. The framework manages all the hard distributed systems problems: task scheduling, data transfer, failure detection, retry logic, and resource allocation. This is why MapReduce was transformative: it let engineers without distributed systems expertise process petabytes of data across thousands of machines.
Interview tip: why intermediate data goes to local disk
Map output is written to local disk, not HDFS. This is intentional: intermediate data is temporary and does not need replication. If a mapper fails, the framework re-runs the map task (the input split is still in HDFS). Writing intermediate data to HDFS would triple the I/O cost for data that is thrown away after the job completes.
The shuffle and sort phase
The shuffle is the most expensive stage in a MapReduce job. I think of it as the "hidden cost" that most people underestimate when they first learn MapReduce.
Inside the mapper: Before the shuffle starts, each mapper sorts its output by key within each partition. This sort happens in memory (with spills to disk if the buffer fills). The mapper runs a multi-pass external merge sort: it fills an in-memory buffer (io.sort.mb, default 100 MB), sorts it, spills to disk, then merge-sorts all spill files into a single sorted partition file.
function mapperSpillAndSort(buffer, partitionCount):
// buffer is full of (key, value) pairs
sortedBuffer = sort(buffer, by: (partitionId, key))
spillFile = writeToDisk(sortedBuffer)
spillFiles.append(spillFile)
function mapperFinalize(spillFiles):
// merge all spill files into one sorted output per partition
for partition in 0..R-1:
mergedOutput[partition] = mergeSortStreams(
spillFiles.map(f -> f.partitionSlice(partition))
)
writePartitionedOutput(mergedOutput)
Inside the reducer: Each reducer fetches its partition from every mapper via HTTP. It then merge-sorts all fetched chunks into a single sorted stream. Because each chunk is already sorted (by the mapper), the merge is efficient: it uses a priority queue over the chunk heads.
The sorted stream is critical: it lets the reducer process one key group at a time with O(1) memory per key. Without sorting, the reducer would need to buffer all values for all keys in memory, which is impossible for large datasets.
| Shuffle parameter | Default | Effect |
|---|---|---|
io.sort.mb | 100 MB | In-memory sort buffer per mapper |
io.sort.factor | 10 | Max merge streams during external sort |
mapreduce.reduce.shuffle.parallelcopies | 5 | Concurrent fetch threads per reducer |
mapreduce.reduce.shuffle.input.buffer.percent | 0.70 | Fraction of reducer heap for shuffle data |
Combiner optimization
The combiner runs on the mapper node after sorting but before network transfer. It applies the reduce function locally, collapsing duplicate keys into fewer records.
Mapper emits for document "War and Peace":
(the, 1), (the, 1), (the, 1), ... (the, 1) β 34,000 occurrences of "the"
(war, 1), (war, 1), ... (war, 1) β 300 occurrences of "war"
Without combiner: 34,300 pairs cross the network for just these two words.
With combiner:
(the, 34000), (war, 300) β 2 pairs cross the network
The combiner is not always applicable. It works only when the reduce function is commutative and associative: the order and grouping of inputs do not affect the result.
| Operation | Combiner safe? | Why |
|---|---|---|
| Sum / Count | Yes | sum(a, b, c) == sum(sum(a, b), c) |
| Max / Min | Yes | max(a, b, c) == max(max(a, b), c) |
| Average | No | avg(avg(1,2), avg(3,4)) != avg(1,2,3,4). Use sum + count, then divide in reduce. |
| Distinct count | No | Merging partial distinct sets requires the full set, not a count |
| Median | No | Cannot compute global median from partial medians |
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.