How MongoDB works internally
How MongoDB stores documents as BSON, uses WiredTiger's B-tree and LSM engine, replicates via Raft-based elections, and distributes data through chunk-based sharding.
The Interview Question
Interviewer: "Your system uses MongoDB to store user profiles and activity feeds. A user writes a profile update, and another service reads it 50ms later but gets stale data. Walk me through what actually happens inside MongoDB from the moment a write hits the server to when it becomes visible on a secondary replica. And how would you fix the stale read?"
This question tests whether you understand MongoDB beyond db.collection.find(). The interviewer wants to hear about the WiredTiger storage engine, the oplog replication mechanism, write concern vs read concern semantics, and how replica sets achieve consensus. Answering with "MongoDB is a document database that stores JSON" earns zero credit. Explaining the journal, the checkpoint cycle, oplog tailing, and the read/write concern matrix is what demonstrates real depth.
What to Clarify Before Answering
You: "Good question. Let me clarify the setup before I go deep..."
- "Are we running a replica set or a standalone instance? Stale reads only happen with replication."
- "What write concern is the application using?
w:1behaves very differently fromw:majority." - "Is the reading service hitting a secondary replica directly, or going through the primary?"
- "Should I cover the sharded case too, or focus on replica set internals first?"
Why this matters: These questions are not stalling. They reveal that MongoDB's consistency behavior changes dramatically based on configuration. A single-node MongoDB always returns the latest write. A replica set with w:1 and secondary reads can absolutely return stale data, and the interviewer's scenario is almost certainly testing whether you know why.
The 30-Second Answer
MongoDB stores every document as BSON (Binary JSON) in a WiredTiger B-tree, which uses MVCC (multiversion concurrency control) to let readers and writers operate concurrently without locking each other. When you write a document, WiredTiger first writes to an in-memory buffer and the journal (a write-ahead log), then periodically flushes to data files during checkpoints every 60 seconds. For replication, the primary records every write in the oplog (a capped collection), and secondaries tail this oplog to apply changes. The writeConcern controls how many replicas must acknowledge a write before the client gets a response, and the readConcern controls what snapshot a read sees. Together, these two knobs define MongoDB's consistency model, from fast-and-possibly-stale (w:1, secondary reads) to fully linearizable (w:majority, readConcern: linearizable).
The Architecture Overview
Looking at the diagram above, every write flows through the primary node's WiredTiger engine, gets journaled for crash safety, and is recorded in the oplog. Secondaries continuously pull from the oplog and replay operations against their own WiredTiger instances. The gap between when the primary records an operation and when a secondary applies it is the replication lag, which is the root cause of the stale read in the interview question.
The replica set uses a Raft-inspired consensus protocol for leader election. If the primary goes down, the remaining nodes hold an election and one of the secondaries becomes the new primary, typically within 10-12 seconds. During this window, writes are unavailable but data is safe on the surviving nodes.
BSON: The Document Storage Format
MongoDB does not store JSON. It stores BSON (Binary JSON), a binary-encoded serialization format that extends JSON with additional data types and is optimized for fast traversal and in-place updates.
Why Not Just Use JSON?
JSON has three problems for a database engine:
- Parsing cost: JSON is text. Every read requires string parsing, number conversion, and key lookup by string comparison. In a database handling millions of reads per second, this overhead is unacceptable.
- No native types: JSON has no date type, no binary type, no 64-bit integer, no decimal128. MongoDB needs all of these.
- No length prefix: To find a field in a JSON document, you must parse from the beginning. There is no way to jump to the 50th field without reading the first 49.
BSON solves all three. Each element is prefixed with a type byte and a length, so the engine can skip directly to any field without parsing the rest of the document.
BSON Binary Layout
// BSON document layout (simplified)
document ::= int32 (total byte length)
element*
0x00 (terminator)
element ::= type_byte (1 byte: 0x01=double, 0x02=string, 0x03=embedded doc, ...)
field_name (C string, null-terminated)
value (type-dependent encoding with length prefix)
A 1 KB JSON document typically becomes 1.1-1.3 KB as BSON (the type bytes and lengths add overhead), but the performance gain from O(1) field access and zero string parsing far outweighs the size increase.
Why this matters in production
When you see MongoDB memory usage that seems too high for your dataset, remember that BSON documents stored in WiredTiger are compressed on disk but expanded in the WiredTiger cache. A 100 GB collection compressed to 30 GB on disk still needs significant cache space for active working set pages in their uncompressed BSON form.
Key BSON Types
| Type | BSON ID | Size | Notes |
|---|---|---|---|
| Double | 0x01 | 8 bytes | IEEE 754 |
| String | 0x02 | 4 + len + 1 bytes | Length-prefixed UTF-8 |
| Embedded document | 0x03 | Variable | Nested BSON |
| Array | 0x04 | Variable | BSON with "0", "1", "2" as keys |
| Binary | 0x05 | 4 + 1 + len bytes | For raw bytes, UUIDs |
| ObjectId | 0x07 | 12 bytes | Timestamp + random + counter |
| Date | 0x09 | 8 bytes | Milliseconds since Unix epoch |
| 64-bit integer | 0x12 | 8 bytes | Signed |
| Decimal128 | 0x13 | 16 bytes | IEEE 754 decimal (financial math) |
The ObjectId deserves special attention. It is 12 bytes: 4 bytes of Unix timestamp, 5 bytes of random value (unique per process), and 3 bytes of incrementing counter. This design means ObjectIds are roughly time-ordered, which keeps B-tree inserts mostly sequential and avoids random page splits. This is not an accident. It is a deliberate performance optimization.
WiredTiger: The Storage Engine
WiredTiger is the default storage engine since MongoDB 3.2, and it is the only production-supported engine since 4.2. Understanding WiredTiger is essential because it controls how every read and write physically happens on disk.
B-Tree Structure
WiredTiger stores each collection as a B-tree (or optionally an LSM tree, though B-tree is the default). Each leaf page in the B-tree contains multiple BSON documents. Internal pages contain key ranges and pointers to child pages.
Each leaf page is typically 4 KB to 32 KB and is compressed on disk using either snappy (default), zlib, or zstd compression. When MongoDB reads a document, WiredTiger locates the correct leaf page through the B-tree, decompresses it into the in-memory cache, and returns the requested document.
The B-tree structure means:
- Point lookups (
find({_id: X})) are O(log n) in the number of documents - Range scans (
find({age: {$gte: 25, $lte: 35}})) follow the leaf page chain sequentially - Inserts at the end of the key space (like ObjectId-based
_id) are efficient because they always go to the rightmost leaf page
MVCC: How Reads and Writes Coexist
WiredTiger uses multiversion concurrency control (MVCC). Every write creates a new version of the document rather than overwriting the old version in place. Readers see a consistent snapshot of the data as of the time their operation started, without blocking writers.
// Simplified MVCC timeline
//
// Time T1: Document {_id: 1, name: "Alice", version: 5} exists
// Time T2: Writer updates name to "Bob", creating version 6
// Time T3: Reader with snapshot at T1 still sees "Alice" (version 5)
// Time T4: New reader sees "Bob" (version 6)
// Time T5: WiredTiger evicts version 5 (no more readers need it)
This is critical for performance. In older storage engines (like the deprecated MMAPv1), a write lock on a collection blocked all readers. With WiredTiger's MVCC, a long-running aggregation pipeline reading millions of documents never blocks incoming writes, and writes never block reads.
What most people get wrong
MVCC means MongoDB keeps multiple versions of documents in memory simultaneously. Under heavy write load with long-running reads, the WiredTiger cache can fill up with old versions that cannot be evicted because an open cursor still references them. This is the "cache pressure" problem. If you see WiredTiger eviction warnings in the logs, check for long-running queries holding old snapshots open.
The Journal: Crash Safety
The journal is WiredTiger's write-ahead log (WAL). Every write goes to the journal before it is considered durable. The journal is fsynced to disk every 50ms by default (configurable via storage.journal.commitIntervalMs).
Here is what happens when MongoDB crashes and restarts:
- WiredTiger loads the last checkpoint (a consistent snapshot of all B-trees, taken every 60 seconds)
- It replays all journal entries written after that checkpoint
- The database is now in a consistent state with no data loss (assuming the journal was fsynced)
The maximum data loss window is the last 50ms of writes (between the last journal fsync and the crash). In practice, most deployments use w:majority with journaling, which means the write is not acknowledged until it is journaled on a majority of replica set members.
Checkpoints: The 60-Second Flush
Every 60 seconds (or when the journal reaches 2 GB), WiredTiger writes a checkpoint. A checkpoint is a consistent snapshot of all B-tree pages to disk. This is a heavy I/O operation because it flushes all dirty pages.
// Checkpoint flow (simplified)
function writeCheckpoint():
// 1. Mark current state as checkpoint boundary
snapshot = getCurrentMVCCSnapshot()
// 2. Write all dirty pages to disk (background I/O)
for each dirtyPage in cache:
compressedData = compress(dirtyPage, algorithm="snappy")
writeToDataFile(compressedData)
// 3. Update the checkpoint metadata
updateCheckpointTimestamp(snapshot.timestamp)
// 4. Old journal entries before this checkpoint can be recycled
recycleJournalBefore(snapshot.timestamp)
Why this matters in production
The 60-second checkpoint cycle is why MongoDB can show periodic I/O spikes. If your disk throughput is barely sufficient for normal operations, checkpoints will cause latency spikes every minute. I always recommend provisioned IOPS (on cloud) or NVMe SSDs (on-premise) for MongoDB. Spinning disks are not viable for any production MongoDB workload.
Compression
WiredTiger compresses data at the page level before writing to disk. The compression options are:
| Algorithm | Compression Ratio | CPU Cost | Best For |
|---|---|---|---|
| snappy (default) | ~2-4x | Very low | General purpose, balanced |
| zstd | ~3-6x | Low-medium | Large datasets where storage cost matters |
| zlib | ~3-5x | Medium-high | Legacy, prefer zstd instead |
| none | 1x | Zero | Very low latency requirements |
I recommend zstd for most production workloads. It provides significantly better compression than snappy with only marginally higher CPU cost. The storage savings (and reduced I/O) typically more than offset the CPU overhead.
Index pages use prefix compression by default, which is extremely efficient because consecutive index entries often share long key prefixes. A collection of 100 million documents with a compound index on {country: 1, city: 1} compresses dramatically because thousands of consecutive entries share the same country value.
The Write Path: From Driver to Disk
Let me walk through exactly what happens when your application calls insertOne() or updateOne().
The important detail is that the document is written to memory (the WiredTiger cache) and the journal, but not immediately to the data files. The data files are updated during the next checkpoint (every 60 seconds). This is why the journal exists: if MongoDB crashes between checkpoints, the journal entries replay the missing writes.
For updates, WiredTiger does not modify the document in place. It creates a new version of the B-tree page containing the updated document. The old page version remains available for any concurrent readers using MVCC snapshots.
Memory Management and Cache Eviction
WiredTiger manages its own memory through the WiredTiger cache, which is separate from the operating system's filesystem cache. By default, the cache uses the larger of either 256 MB or 50% of available RAM minus 1 GB.
When the cache fills up, WiredTiger evicts pages using an LRU-based eviction policy:
- Clean pages (unmodified since last checkpoint) are simply discarded. They can be re-read from the data files.
- Dirty pages (modified since last checkpoint) must be written to disk before eviction. This is called "eviction I/O" and appears as increased disk writes in monitoring.
- Pages with old MVCC versions cannot be evicted until all readers referencing those versions close their cursors.
The WiredTiger cache hit ratio is the single most important metric for MongoDB performance. If it drops below 95%, the system is spending too much time reading from disk instead of serving from memory.
// Key WiredTiger cache metrics (from db.serverStatus().wiredTiger.cache)
{
"bytes currently in the cache": 6442450944, // Current cache usage
"maximum bytes configured": 8589934592, // Cache size limit
"pages read into cache": 1542367, // Cache misses
"pages written from cache": 834291, // Dirty evictions
"tracked dirty bytes in the cache": 268435456, // Dirty data
"unmodified pages evicted": 2103847, // Clean evictions
"modified pages evicted": 98234, // Forced dirty evictions
"percentage overhead": 8 // Internal metadata overhead
}
Why this matters in production
I always set up alerts on two WiredTiger metrics: cache utilization (should stay below 80%) and dirty bytes percentage (should stay below 20% of cache size). When dirty bytes exceed 20%, WiredTiger triggers aggressive eviction, which competes with application I/O and causes latency spikes. If you see this pattern, your write throughput is exceeding the checkpoint's ability to flush dirty pages.
Replica Sets: How MongoDB Replicates Data
A replica set is a group of mongod processes that maintain the same data set. One node is the primary (accepts writes), and the rest are secondaries (replicate from the primary and can serve reads).
Oplog: The Replication Stream
Every write to the primary is recorded in the oplog (operations log), a special capped collection in the local database. The oplog stores idempotent operations, meaning each entry can be applied multiple times with the same result.
// Example oplog entry
{
"ts": Timestamp(1704067200, 1), // Operation timestamp (seconds + ordinal)
"t": NumberLong(5), // Election term
"h": NumberLong(0), // Legacy hash (unused since 4.0)
"v": 2, // Oplog version
"op": "u", // Operation type: i=insert, u=update, d=delete
"ns": "mydb.users", // Namespace (database.collection)
"o2": {"_id": ObjectId("...")}, // Query selector (for updates)
"o": { // Operation payload
"$v": 2,
"diff": {"u": {"name": "Bob"}} // Update diff (not full replacement)
}
}
Secondaries tail the primary's oplog, pulling new entries and applying them to their own WiredTiger instances. This is similar to how MySQL's binlog replication works, but with two key differences:
- The oplog is idempotent by design. If a secondary applies the same entry twice (due to a network retry), the result is identical. MySQL's statement-based replication does not have this guarantee.
- The oplog uses timestamps with ordinals, not sequential IDs. This allows the replication protocol to handle clock skew and out-of-order delivery gracefully.
Oplog Sizing and the Replication Window
The oplog is a capped collection, meaning it has a fixed maximum size. When it fills up, the oldest entries are overwritten. The replication window is the time span between the oldest and newest oplog entries. If a secondary falls behind more than the replication window, it can no longer catch up by tailing the oplog and must perform a full resync (copying the entire dataset from the primary).
// Check oplog size and replication window
rs.printReplicationInfo()
// Output:
// configured oplog size: 2048 MB
// log length start to end: 172800 secs (48 hrs)
// oplog first event time: Mon Jan 01 2024 00:00:00
// oplog last event time: Wed Jan 03 2024 00:00:00
I recommend an oplog size that provides at least 48-72 hours of replication window. This gives you time to recover a downed secondary without triggering a full resync. For write-heavy workloads, you may need to increase the oplog to 5-10% of your data size.
Replication Lag and Monitoring
Replication lag is the delay between when a write is recorded in the primary's oplog and when it is applied on a secondary. In a healthy replica set with low write throughput, replication lag is under 100ms. Under heavy write load or when a secondary has slower hardware, lag can grow to seconds or even minutes.
// Check replication lag
rs.printSecondaryReplicationInfo()
// Output:
// source: secondary1:27017
// syncedTo: Mon Jan 03 2024 00:00:00
// 0 secs (0 hrs) behind the primary
// source: secondary2:27017
// syncedTo: Mon Jan 02 2024 23:59:55
// 5 secs (0 hrs) behind the primary
When replication lag exceeds a few seconds, reads with readPreference: secondaryPreferred return increasingly stale data. This is exactly the scenario in the opening interview question: the application reads from a secondary that has not yet applied the recent write.
Raft-Inspired Election Protocol
MongoDB's replica set election protocol is based on the Raft consensus algorithm (with modifications). When the primary becomes unreachable, the following sequence occurs:
- Secondaries detect the primary is unreachable after the
electionTimeoutMillisperiod (default 10 seconds) - An eligible secondary calls an election by incrementing its term and requesting votes from all replica set members
- Each member votes for the first candidate that has an oplog at least as up-to-date as its own (the "log completeness" check)
- If a candidate receives votes from a majority of members, it becomes the new primary
- The new primary starts accepting writes in the new term
The election typically completes in 10-12 seconds: 10 seconds for the timeout detection plus 1-2 seconds for the election itself. During this window, the application cannot write to the replica set. The MongoDB drivers detect the topology change and automatically redirect writes to the new primary once the election completes.
What most people get wrong
Setting electionTimeoutMillis too low (like 2 seconds) does not make failover faster in a good way. It causes spurious elections during network blips, which means your primary flips back and forth, causing w:majority writes to fail during each transition. I recommend keeping the default 10 seconds unless you have extremely stable networking and have tested thoroughly.
Write Concern: Controlling Durability
The writeConcern parameter on each write operation controls when the server acknowledges the write:
| Write Concern | Meaning | Durability | Latency |
|---|---|---|---|
w: 0 | Fire and forget, no acknowledgment | None | Lowest (~0.5ms) |
w: 1 | Primary acknowledges after journaling | Single node | Low (~1-2ms) |
w: "majority" | Majority of replicas acknowledge | Survives primary failure | Medium (~5-15ms) |
w: 3 | Exactly 3 nodes acknowledge | Explicit count | Higher |
j: true | Wait for journal fsync (combinable) | Journal-durable | +50ms worst case |
The critical insight is that w:1 (the default for many drivers) means the write is only durable on the primary. If the primary crashes before the secondaries replicate the write, that write is lost. This is not a bug. It is the expected behavior of w:1. For any data you cannot afford to lose, use w: "majority".
Read Concern: Controlling Consistency
The readConcern parameter controls what data a read operation can see:
| Read Concern | What You See | Use Case |
|---|---|---|
"local" (default) | Latest data on the queried node (may be rolled back) | Low-latency reads where occasional stale data is acceptable |
"available" | Same as local but faster on sharded clusters (skips orphan filtering) | Analytics queries that tolerate stale data |
"majority" | Only data committed to a majority of replicas | Reads that must not see data that could be rolled back |
"linearizable" | Latest majority-committed data plus a confirmation that the primary is still the leader | Critical reads that need true linearizability (e.g., leader elections, locks) |
"snapshot" | A consistent snapshot across shards (used in multi-document transactions) | Transactions |
Sharding: How MongoDB Distributes Data
When a single replica set cannot handle the data volume or throughput, MongoDB uses sharding to distribute data across multiple replica sets (called shards).
Sharded Cluster Architecture
The three key components:
-
mongos (router): A stateless process that routes queries to the correct shard(s). Applications connect to mongos, not directly to shards. You can run as many mongos instances as you need behind a load balancer.
-
Config servers: A replica set that stores the chunk-to-shard mapping. The mongos caches this mapping locally and refreshes it when it detects a stale route (a chunk has moved).
-
Shards: Each shard is a replica set holding a subset of the data. Data is divided into contiguous ranges called chunks (default 128 MB each), and each chunk lives on exactly one shard.
Shard Key Selection
The shard key is the most important decision in a sharded MongoDB deployment. It determines how data is distributed, how queries are routed, and ultimately the performance ceiling of your cluster.
Chunk Splitting and Balancing
When a chunk grows beyond the configured maximum size (default 128 MB), MongoDB automatically splits it into two smaller chunks at the midpoint of the key range. Splitting is a metadata-only operation on the config server; it does not move any data.
The balancer is a background process that runs on one of the mongos instances. It monitors the chunk count per shard and migrates chunks from overloaded shards to underloaded shards to maintain an even distribution. A migration involves:
- The destination shard copies all documents in the chunk from the source shard
- During the copy, any new writes to the chunk are forwarded to both shards
- Once the copy is complete, the config server updates the chunk map
- The source shard deletes its copy of the chunk data
Balancer impact on production
Chunk migrations are expensive I/O operations. Each migration reads the entire chunk from the source shard and writes it to the destination shard. On a busy cluster, I recommend scheduling the balancer to run only during off-peak hours using sh.setBalancerState() and the balancer window configuration.
Targeted vs Scatter-Gather Queries
When a query includes the shard key, mongos can route it to the exact shard that holds the data. This is a targeted query and has the same performance as querying an unsharded collection.
When a query does not include the shard key, mongos must send it to all shards and merge the results. This is a scatter-gather query and its latency is dominated by the slowest shard. For a cluster with 10 shards, a scatter-gather query does 10x the work of a targeted query.
// Targeted query (shard key included)
db.events.find({ tenantId: "acme", timestamp: { $gte: ISODate("2024-01-01") } })
// mongos routes to the shard that owns tenantId "acme"
// Scatter-gather query (no shard key)
db.events.find({ eventType: "login" })
// mongos sends to ALL shards, merges results
// Latency = max(shard1_time, shard2_time, ..., shardN_time)
This is why shard key selection is the most critical decision. If 80% of your queries do not include the shard key, you are doing scatter-gather on 80% of traffic, and each query generates N network hops instead of 1.
Jumbo Chunks and the Balancer's Limits
A jumbo chunk is a chunk that exceeds the maximum size (128 MB) but cannot be split further because all documents in the chunk share the same shard key value. This happens when the shard key has low cardinality (e.g., sharding on {country: 1} when 60% of data is from one country).
Jumbo chunks cannot be split or migrated, which leads to an unbalanced cluster. The shard holding the jumbo chunk receives disproportionate traffic and storage. This is one of the most common MongoDB sharding mistakes and is very difficult to fix after the fact (it often requires resharding the entire collection).
To avoid jumbo chunks:
- Choose a shard key with high cardinality (millions of distinct values)
- Use a compound shard key where the combination has high cardinality
- Never shard on a field with fewer than 1,000 distinct values
Query Execution and Indexes
MongoDB's query system converts a find() or aggregate() call into an execution plan, similar to how a SQL database processes a SELECT statement.
The Query Planner
When a query arrives, the query planner:
- Identifies all candidate indexes that could satisfy the query
- For each candidate, generates an execution plan
- Runs all candidate plans in parallel for a short trial period (up to
maxTimeMSor 10,000 documents) - Picks the plan that returned results with the least work (fewest index keys examined)
- Caches the winning plan for the query shape
// Example: explain output showing query plan
db.users.find({age: {$gte: 25}, city: "NYC"}).explain("executionStats")
// Output (simplified):
{
"winningPlan": {
"stage": "FETCH", // Fetch full documents
"inputStage": {
"stage": "IXSCAN", // Index scan
"indexName": "age_1_city_1", // Compound index used
"keyPattern": {"age": 1, "city": 1},
"direction": "forward",
"indexBounds": {
"age": ["[25, Infinity]"],
"city": ["[\"NYC\", \"NYC\"]"]
}
}
},
"executionStats": {
"totalKeysExamined": 1542, // Index entries scanned
"totalDocsExamined": 1542, // Documents fetched
"nReturned": 1542 // Results returned
}
}
The ideal query has totalKeysExamined == totalDocsExamined == nReturned. If totalDocsExamined is much higher than nReturned, you are scanning documents that do not match the query, which means your index is not selective enough.
Index Types
| Index Type | Syntax | Use Case |
|---|---|---|
| Single field | {age: 1} | Equality and range queries on one field |
| Compound | {country: 1, city: 1, age: 1} | Multi-field queries following ESR rule |
| Multikey | {tags: 1} | Indexing array fields |
| Text | {$text: {$search: "..."}}} | Full-text search (limited, prefer Atlas Search) |
| Geospatial | {location: "2dsphere"} | Location-based queries |
| Hashed | {userId: "hashed"} | Even distribution for shard keys |
| Wildcard | {"$**": 1} | Indexing unknown/dynamic field names |
| TTL | {createdAt: 1}, {expireAfterSeconds: 86400} | Auto-delete expired documents |
The ESR Rule for Compound Indexes
The order of fields in a compound index matters enormously. The ESR rule (Equality, Sort, Range) defines the optimal field ordering:
- Equality fields first: fields queried with exact match (
{status: "active"}) - Sort fields next: fields used in
.sort()({createdAt: -1}) - Range fields last: fields queried with ranges (
{age: {$gte: 25}})
// Query: active users in NYC, sorted by newest, aged 25+
db.users.find({
status: "active", // Equality
city: "NYC", // Equality
age: {$gte: 25} // Range
}).sort({createdAt: -1}) // Sort
// Optimal compound index following ESR:
db.users.createIndex({
status: 1, // E: equality
city: 1, // E: equality
createdAt: -1, // S: sort
age: 1 // R: range
})
Covered Queries
A covered query is one where all fields in the query filter, sort, and projection exist in the index. MongoDB can answer the query entirely from the index without fetching the full document from the B-tree. This eliminates the FETCH stage and can improve performance by 2-5x.
// Covered query: all fields in the index
db.users.find(
{status: "active", city: "NYC"},
{_id: 0, status: 1, city: 1, createdAt: 1} // Projection matches index
).sort({createdAt: -1})
// With index: {status: 1, city: 1, createdAt: -1}
// explain() shows: "stage": "IXSCAN" with NO "FETCH" stage
The Aggregation Pipeline
MongoDB's aggregation framework processes documents through a pipeline of stages. Each stage transforms the documents and passes them to the next stage.
db.orders.aggregate([
{ $match: { status: "completed", date: { $gte: ISODate("2024-01-01") } } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" }, count: { $sum: 1 } } },
{ $sort: { total: -1 } },
{ $limit: 10 }
])
The query optimizer pushes $match stages as early as possible in the pipeline and coalesces adjacent $sort and $limit stages. The $match at the beginning can use indexes, but once documents pass through $group, they are in-memory objects that cannot use indexes.
For sharded collections, the aggregation pipeline splits into two phases:
- Shard phase:
$matchand$grouprun on each shard in parallel - Merge phase: Results from all shards are merged on the mongos (or a designated shard) for the final
$sortand$limit
Change Streams: Real-Time Data Notifications
Change streams allow applications to subscribe to real-time data changes on a collection, database, or entire deployment. Under the hood, change streams are built on top of the oplog. The driver opens a tailable cursor on the oplog and filters for events matching the specified collection.
// Watch for changes on the orders collection
const changeStream = db.collection("orders").watch([
{ $match: { "fullDocument.status": "completed" } }
]);
changeStream.on("change", (event) => {
// event.operationType: "insert", "update", "replace", "delete"
// event.fullDocument: the complete document (for insert/replace/update)
// event.updateDescription: { updatedFields, removedFields } (for updates)
console.log("Order completed:", event.fullDocument._id);
});
Change streams are resumable. Each event includes a _id field (the resume token). If the client disconnects, it can reconnect with the resume token and pick up where it left off without missing events. This is significantly more robust than polling for changes and is the recommended approach for event-driven architectures with MongoDB.
Why this matters in production
Change streams replace the old (and fragile) practice of tailing the oplog directly. Direct oplog tailing required knowledge of internal oplog format and broke across MongoDB version upgrades. Change streams provide a stable, documented API with the same real-time capabilities. I use them for cache invalidation, search index updates, and event sourcing patterns.
Transactions: Multi-Document ACID
Since MongoDB 4.0 (replica sets) and 4.2 (sharded clusters), MongoDB supports multi-document ACID transactions. This was the most requested feature in MongoDB's history, because without it, operations spanning multiple documents had no atomicity guarantee.
How Transactions Work Internally
A transaction in MongoDB uses WiredTiger's MVCC to create a consistent snapshot at transaction start. All reads within the transaction see this snapshot, and all writes are buffered until commit.
Key implementation details:
- Snapshot isolation: The transaction reads from the snapshot at T1. If another transaction modifies the same documents, the first transaction will get a
WriteConflicterror on commit. The application must retry. - 60-second limit: Transactions that run longer than
transactionLifetimeLimitSeconds(default 60 seconds) are automatically aborted. This is intentional: long transactions hold MVCC snapshots open, preventing WiredTiger from reclaiming old document versions. - Oplog entry: The entire transaction is recorded as a single
applyOpsoplog entry, ensuring secondaries replay the transaction atomically.
What most people get wrong
Multi-document transactions in MongoDB are not free. They have overhead from snapshot management, write conflict detection, and the 60-second time limit. Do not treat MongoDB transactions like SQL transactions where you wrap everything in BEGIN/COMMIT. Instead, design your schema to keep related data in a single document whenever possible (embedded documents and arrays). Use transactions only when you genuinely need atomicity across multiple documents or collections.
Write Conflicts and Retries
When two transactions modify the same document, MongoDB uses optimistic concurrency control. Both transactions proceed without locking, but at commit time, the second transaction to commit detects the conflict.
// Transaction retry pattern (recommended by MongoDB)
async function runTransactionWithRetry(session, txnFunc) {
while (true) {
try {
session.startTransaction({
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" }
});
await txnFunc(session);
await session.commitTransaction();
break; // Success
} catch (error) {
if (error.hasErrorLabel("TransientTransactionError")) {
// Write conflict or transient error, retry the whole transaction
continue;
}
throw error; // Non-retryable error
}
}
}
What Happens When Things Break
MongoDB is designed for production failures. Here are the key failure scenarios and how the system responds.
| Failure | What Happens | How MongoDB Responds | Impact on Clients |
|---|---|---|---|
| Primary crashes | Writes unavailable | Election in ~10-12s, new primary elected | 10-12s write downtime; reads from secondaries continue |
| Secondary crashes | One fewer replica | Replica set continues with reduced redundancy | No impact if majority of nodes survive |
| Network partition (minority side) | Partitioned nodes cannot reach majority | Partitioned nodes step down to secondary | Clients connected to minority partition get errors |
| Disk full on primary | Writes fail | MongoDB rejects writes; reads continue | Write errors until space freed |
| Oplog falls behind | Secondary too far behind primary | Secondary enters RECOVERING state, needs resync | That secondary unavailable for reads; may affect w:majority |
| WiredTiger cache exhaustion | Memory pressure | Aggressive eviction, slower operations | Increased latency, possible timeouts |
| Config server unavailable (sharded) | Chunk metadata inaccessible | Reads/writes to existing chunks continue; splits and migrations stop | No immediate data loss; cluster "freezes" in current state |
The Split-Brain Problem and How MongoDB Avoids It
In distributed systems, a network partition can create a "split-brain" scenario where two groups of nodes each believe they are the active cluster. MongoDB avoids split-brain through the majority requirement for elections.
In a 3-node replica set, a node needs 2 votes (a majority) to become primary. If the network partitions into a 2-node group and a 1-node group, only the 2-node group can elect a primary. The isolated node cannot get a majority and remains a read-only secondary. This guarantees that at most one primary exists at any time.
This is also why you should always use an odd number of replica set members (3, 5, or 7). With an even number (say 4), a network partition could create two 2-node groups, and neither group can achieve a majority. The entire cluster becomes read-only until the partition heals. An arbiter node (a lightweight member that votes but does not hold data) can be added to avoid this problem, though I prefer running 3 full data nodes for maximum redundancy.
Retryable Writes and Reads
Since MongoDB 3.6 (writes) and 4.2 (reads), the drivers support retryable operations. When a write fails due to a network error or a failover event, the driver automatically retries the operation once. This is safe because each write is tagged with a unique transaction ID, and the server deduplicates retried writes.
// Retryable writes are enabled by default
const client = new MongoClient("mongodb://...", {
retryWrites: true, // Default: true since 3.6
retryReads: true // Default: true since 4.2
});
// If this write fails due to network error or failover:
// 1. Driver detects the error
// 2. Driver discovers the new primary via SDAM (see below)
// 3. Driver retries the same write with the same txnNumber
// 4. New primary deduplicates if the write was already applied
await db.collection("users").updateOne(
{ _id: userId },
{ $set: { lastLogin: new Date() } }
);
Why this matters in production
Before retryable writes, every application had to implement its own retry logic with deduplication. This was error-prone and inconsistent across services. With retryable writes enabled (the default), most transient failures during elections are handled transparently by the driver. I have seen this reduce error rates during planned maintenance (rolling restarts) from 2-5% to near zero.
Server Discovery and Monitoring (SDAM)
The MongoDB driver continuously monitors the topology of the replica set through a protocol called SDAM (Server Discovery and Monitoring). The driver maintains a background thread that pings each node every 10 seconds (configurable via heartbeatFrequencyMS) to detect:
- Which node is the primary
- Which nodes are secondaries
- Which nodes are unresponsive
- The replication lag on each secondary
When a failover occurs, the driver detects the topology change through SDAM and automatically redirects operations to the new primary. The application does not need any custom failover logic. This entire process is invisible to the application code beyond a brief latency spike during the election window.
Rollback Scenario
When a primary crashes and a secondary is elected, the new primary might be missing some writes that the old primary had accepted with w:1 (not replicated to majority). When the old primary comes back online, it discovers that its oplog diverges from the current primary. It must roll back those unreplicated writes.
Rolled-back documents are saved to a rollback/ directory as BSON files. You can recover them manually, but this is a last resort. The correct fix is to use w: "majority" for all writes you cannot afford to lose.
Performance Characteristics
| Operation | Latency (typical) | Throughput | Notes |
|---|---|---|---|
Point read by _id (primary) | 0.5-2ms | 50K-100K ops/s per node | Single B-tree traversal |
| Point read (secondary) | 0.5-2ms + replication lag | 50K-100K ops/s | May return stale data |
| Indexed query | 1-10ms | 10K-50K ops/s | Depends on selectivity and result size |
| Collection scan | 100ms-minutes | Varies | Full B-tree traversal, avoid in production |
| Write (w:1, journaled) | 1-2ms | 20K-50K ops/s per node | Single-node durability only |
| Write (w:majority) | 5-15ms | 5K-20K ops/s per node | Waits for replication |
| Transaction (2-3 docs) | 10-30ms | 1K-5K txn/s | MVCC snapshot + conflict detection overhead |
| Aggregation pipeline | 10ms-seconds | Depends on pipeline | Indexes help for $match stage only |
What Affects Performance
-
Working set size vs RAM: If your frequently accessed data fits in the WiredTiger cache (typically 50% of RAM), most reads come from memory. If it does not fit, every cache miss means a disk read, and latency jumps from microseconds to milliseconds.
-
Index coverage: Queries without index support trigger collection scans. On a collection with 100 million documents, a collection scan can take minutes. Use
explain()on every production query. -
Document size: WiredTiger compresses pages, but large documents (>100 KB) reduce the number of documents per leaf page, increasing the number of pages that must be read for range queries. If a document contains a large array that grows unboundedly, you will hit the 16 MB BSON document size limit and performance degrades well before that.
-
Write contention: Under heavy write load, WiredTiger's checkpoint cycle (every 60 seconds) creates I/O spikes. Combined with replication overhead for
w:majority, sustained write throughput is lower than read throughput. -
Connection pooling: Each MongoDB driver maintains a connection pool to the server. The default pool size is typically 100 connections. If your application creates too many connections (e.g., in a serverless environment where each function instance creates its own pool), the mongod process can run out of file descriptors and start rejecting connections. Use connection pooling libraries and keep the total connection count under
net.maxIncomingConnections(default 65,536).
Performance Optimization Strategies
Read-heavy workloads:
- Use
readPreference: secondaryPreferredto distribute reads across replicas - Ensure every query has an appropriate index (use
explain()) - Use projection to return only needed fields, reducing network and cache overhead
- Consider covered queries (all fields in the index) for high-throughput reads
Write-heavy workloads:
- Use
w:1for non-critical writes where occasional data loss is acceptable - Batch writes using
insertMany()orbulkWrite()to reduce round-trips - Use
unordered: truein bulk operations so individual failures do not block the batch - Monitor checkpoint I/O and consider reducing
syncPeriodSecsfor more frequent, smaller checkpoints
Mixed workloads:
- Separate read and write traffic using read preference routing
- Use WiredTiger's built-in rate limiting (
storage.wiredTiger.engineConfig.configString) - Monitor cache dirty bytes and ensure checkpoint throughput keeps pace with write rate
How This Compares to Alternatives
| Feature | MongoDB | PostgreSQL | DynamoDB | Cassandra |
|---|---|---|---|---|
| Data model | Documents (BSON) | Relational (rows) | Key-value / document | Wide-column |
| Schema | Flexible (schema-on-read) | Strict (schema-on-write) | Flexible per item | Flexible per row |
| Query language | MongoDB Query Language (MQL) | SQL | PartiQL / API | CQL |
| Joins | $lookup (limited) | Full JOIN support | None | None |
| Transactions | Multi-document ACID (since 4.0) | Full ACID | Single-item or TransactWriteItems | Lightweight transactions |
| Replication | Replica set (Raft-based) | Streaming replication | Built-in (managed) | Masterless (Dynamo-style) |
| Sharding | Native range/hash sharding | Manual (Citus) or partitioning | Automatic (partition key) | Automatic (consistent hash) |
| Consistency | Tunable (per-operation) | Strong (serializable available) | Tunable (eventual or strong) | Tunable (per-query) |
| Managed option | Atlas | RDS / Aurora | Native AWS | Managed Cassandra (Keyspaces) |
A few notes on this comparison:
- MongoDB's
$lookupis significantly slower than SQL JOINs for large datasets because it does not use hash joins or merge joins. It always uses a nested loop strategy. - DynamoDB has no equivalent to MongoDB's aggregation pipeline. Complex analytics require exporting to a data lake.
- Cassandra's lightweight transactions (LWT) are much more limited than MongoDB's multi-document transactions and carry a significant performance penalty.
When to Choose MongoDB
I reach for MongoDB when the data model is naturally document-shaped (nested objects, variable schemas across records), when I need flexible indexing across arbitrary fields, and when the team values developer productivity with a schemaless data model. MongoDB's aggregation pipeline is also surprisingly powerful for analytics that do not require multi-table joins.
I switch to PostgreSQL when I need complex joins across multiple entities, strong transactional guarantees across many tables, or when the data is fundamentally relational. I switch to DynamoDB when I need truly massive scale with single-digit millisecond latency and can design a strict access pattern around partition keys.
Schema Design: Embedding vs Referencing
MongoDB's document model gives you a choice that relational databases do not: embed related data in a single document or reference it from a separate collection (like a foreign key).
The embedding decision should follow these rules:
- Embed when the related data is always accessed together and has bounded growth
- Reference when the related data grows unboundedly, is frequently updated independently, or is accessed separately
- Hybrid (embed the common case, reference the edge case) when one access pattern dominates
Read Preference: Routing Reads in a Replica Set
Read preference controls which replica set member a read operation targets:
| Read Preference | Behavior | Use Case |
|---|---|---|
primary (default) | Always read from primary | Strong consistency required |
primaryPreferred | Primary if available, secondary if not | Consistency preferred, availability fallback |
secondary | Always read from a secondary | Analytics/reporting (stale data acceptable) |
secondaryPreferred | Secondary if available, primary if not | Read scale-out with best-effort consistency |
nearest | Read from the node with lowest network latency | Geo-distributed reads, latency-sensitive |
Hidden danger with secondaryPreferred
secondaryPreferred sounds safe ("prefer secondary, but use primary if needed"), but it creates a subtle problem: if all secondaries are slightly behind, your application consistently reads stale data. Worse, during an election when secondaries are catching up, data can appear to "go backward" (a read returns newer data, then the next read returns older data). For any data where consistency matters, use primary read preference with readConcern: "majority".
Interview Cheat Sheet
- When asked about storage format: "MongoDB stores BSON, not JSON. BSON is binary-encoded with type bytes and length prefixes, enabling O(1) field access without parsing the entire document. An ObjectId is 12 bytes: timestamp + random + counter, designed for roughly time-ordered B-tree inserts."
- When asked about the storage engine: "WiredTiger uses B-trees with MVCC. Readers see consistent snapshots without blocking writers. Data is compressed on disk (snappy/zstd) and written to a journal (WAL) every 50ms, with checkpoints every 60 seconds."
- When asked about replication: "Replica sets use a Raft-inspired protocol. The primary records writes in the oplog (a capped collection of idempotent operations). Secondaries tail the oplog. Elections take 10-12 seconds when a primary fails."
- When asked about consistency: "MongoDB's consistency is tunable per-operation. writeConcern controls how many replicas acknowledge a write. readConcern controls what snapshot a read sees. For true consistency, use w:majority + readConcern:majority."
- When asked about sharding: "Data is split into 128 MB chunks by shard key. mongos routes queries to the correct shard(s). A compound shard key with high-cardinality prefix provides both even distribution and targeted queries."
- When asked about transactions: "Since 4.0, MongoDB supports multi-document ACID transactions using MVCC snapshots. Transactions have a 60-second time limit and use optimistic concurrency control with automatic write conflict detection."
- When asked about indexes: "Follow the ESR rule: Equality fields first, then Sort fields, then Range fields. Use explain() to verify covered queries where all fields exist in the index, eliminating the FETCH stage."
- When asked about failure handling: "If primary fails, a new primary is elected in 10-12 seconds. Writes with w:1 on the old primary that were not replicated are rolled back (saved to rollback/ directory). Use w:majority to prevent data loss."
- When asked about performance: "Point reads by _id take 0.5-2ms. The key performance factor is whether the working set fits in WiredTiger cache (50% of RAM). If it does not, every cache miss hits disk."
- When asked about schema design: "Embed related data in a single document when possible. This avoids joins and enables atomic single-document operations. Use references (like foreign keys) only when documents would grow unboundedly or when many-to-many relationships require it."
Test Your Understanding
Quick Recap
- MongoDB stores documents as BSON (Binary JSON), a binary format with type bytes and length prefixes that enables O(1) field access without parsing the full document.
- WiredTiger stores collections as compressed B-trees with MVCC, allowing concurrent readers and writers without blocking each other.
- The journal (WAL) fsyncs every 50ms and checkpoints flush all dirty pages every 60 seconds, creating the crash recovery mechanism.
- Replica sets use a Raft-inspired protocol with oplog tailing for replication, with elections completing in 10-12 seconds when the primary fails.
- Write concern (
w) controls how many replicas acknowledge a write; read concern controls what snapshot a read sees. Together they define the consistency model. - Sharding distributes data across replica sets using a shard key. Chunk splitting (metadata-only) and balancing (data migration) maintain even distribution.
- The ESR rule (Equality, Sort, Range) defines optimal compound index field ordering for queries with mixed predicates and sorts.
- Multi-document transactions use MVCC snapshots with a 60-second time limit and optimistic concurrency control for write conflict detection.
- The biggest performance factor is whether the working set fits in the WiredTiger cache (typically 50% of RAM). Cache misses hit disk and increase latency by 10-100x.
- MongoDB is best suited for document-shaped data with flexible schemas. For heavy join workloads or strict relational integrity, PostgreSQL is a better fit.
- Change streams provide real-time, resumable notifications of data changes by tailing the oplog through a stable API, enabling event-driven architectures without polling.
- The SDAM protocol in MongoDB drivers continuously monitors replica set topology and automatically redirects operations to the correct primary after failover, making the application resilient to elections without custom retry logic.
Related Concepts
- How DynamoDB works internally: DynamoDB's partition-key routing and write-ahead log are similar to MongoDB's shard key routing and journal, but DynamoDB is fully managed with automatic scaling.
- How Cassandra works internally: Cassandra's masterless architecture and tunable consistency provide an interesting contrast to MongoDB's primary-based replication and write/read concern model.
- How PostgreSQL executes a query: PostgreSQL's query planner and B-tree indexes operate on similar principles to MongoDB's, but with the added complexity of table joins and SQL optimization.
- How Kafka works internally: Kafka's offset-based log consumption is conceptually similar to MongoDB's oplog tailing, and both systems use leader-based replication with follower consumers.
- Consistency models in distributed systems: MongoDB's tunable consistency (from eventual to linearizable) is a practical implementation of the theoretical consistency spectrum discussed in CAP theorem and PACELC framework.
- How Redis works internally: Redis's in-memory data structures and single-threaded event loop provide an interesting contrast to MongoDB's disk-based B-trees and multi-threaded WiredTiger engine. Many architectures use both: Redis for hot data caching, MongoDB for persistent document storage.