How MongoDB works internally

The Interview Question

Interviewer: "Your system uses MongoDB to store user profiles and activity feeds. A user writes a profile update, and another service reads it 50ms later but gets stale data. Walk me through what actually happens inside MongoDB from the moment a write hits the server to when it becomes visible on a secondary replica. And how would you fix the stale read?"

This question tests whether you understand MongoDB beyond db.collection.find(). The interviewer wants to hear about the WiredTiger storage engine, the oplog replication mechanism, write concern vs read concern semantics, and how replica sets achieve consensus. Answering with "MongoDB is a document database that stores JSON" earns zero credit. Explaining the journal, the checkpoint cycle, oplog tailing, and the read/write concern matrix is what demonstrates real depth.

What to Clarify Before Answering

You: "Good question. Let me clarify the setup before I go deep..."

"Are we running a replica set or a standalone instance? Stale reads only happen with replication."
"What write concern is the application using? w:1 behaves very differently from w:majority."
"Is the reading service hitting a secondary replica directly, or going through the primary?"
"Should I cover the sharded case too, or focus on replica set internals first?"

Why this matters: These questions are not stalling. They reveal that MongoDB's consistency behavior changes dramatically based on configuration. A single-node MongoDB always returns the latest write. A replica set with w:1 and secondary reads can absolutely return stale data, and the interviewer's scenario is almost certainly testing whether you know why.

The 30-Second Answer

MongoDB stores every document as BSON (Binary JSON) in a WiredTiger B-tree, which uses MVCC (multiversion concurrency control) to let readers and writers operate concurrently without locking each other. When you write a document, WiredTiger first writes to an in-memory buffer and the journal (a write-ahead log), then periodically flushes to data files during checkpoints every 60 seconds. For replication, the primary records every write in the oplog (a capped collection), and secondaries tail this oplog to apply changes. The writeConcern controls how many replicas must acknowledge a write before the client gets a response, and the readConcern controls what snapshot a read sees. Together, these two knobs define MongoDB's consistency model, from fast-and-possibly-stale (w:1, secondary reads) to fully linearizable (w:majority, readConcern: linearizable).

The Architecture Overview

Looking at the diagram above, every write flows through the primary node's WiredTiger engine, gets journaled for crash safety, and is recorded in the oplog. Secondaries continuously pull from the oplog and replay operations against their own WiredTiger instances. The gap between when the primary records an operation and when a secondary applies it is the replication lag, which is the root cause of the stale read in the interview question.

The replica set uses a Raft-inspired consensus protocol for leader election. If the primary goes down, the remaining nodes hold an election and one of the secondaries becomes the new primary, typically within 10-12 seconds. During this window, writes are unavailable but data is safe on the surviving nodes.

BSON: The Document Storage Format

MongoDB does not store JSON. It stores BSON (Binary JSON), a binary-encoded serialization format that extends JSON with additional data types and is optimized for fast traversal and in-place updates.

Why Not Just Use JSON?

JSON has three problems for a database engine:

Parsing cost: JSON is text. Every read requires string parsing, number conversion, and key lookup by string comparison. In a database handling millions of reads per second, this overhead is unacceptable.
No native types: JSON has no date type, no binary type, no 64-bit integer, no decimal128. MongoDB needs all of these.
No length prefix: To find a field in a JSON document, you must parse from the beginning. There is no way to jump to the 50th field without reading the first 49.

BSON solves all three. Each element is prefixed with a type byte and a length, so the engine can skip directly to any field without parsing the rest of the document.

BSON Binary Layout

// BSON document layout (simplified)
document ::= int32 (total byte length)
             element*
             0x00 (terminator)

element  ::= type_byte   (1 byte: 0x01=double, 0x02=string, 0x03=embedded doc, ...)
             field_name   (C string, null-terminated)
             value        (type-dependent encoding with length prefix)

A 1 KB JSON document typically becomes 1.1-1.3 KB as BSON (the type bytes and lengths add overhead), but the performance gain from O(1) field access and zero string parsing far outweighs the size increase.

Why this matters in production

When you see MongoDB memory usage that seems too high for your dataset, remember that BSON documents stored in WiredTiger are compressed on disk but expanded in the WiredTiger cache. A 100 GB collection compressed to 30 GB on disk still needs significant cache space for active working set pages in their uncompressed BSON form.

Key BSON Types

Type	BSON ID	Size	Notes
Double	0x01	8 bytes	IEEE 754
String	0x02	4 + len + 1 bytes	Length-prefixed UTF-8
Embedded document	0x03	Variable	Nested BSON
Array	0x04	Variable	BSON with "0", "1", "2" as keys
Binary	0x05	4 + 1 + len bytes	For raw bytes, UUIDs
ObjectId	0x07	12 bytes	Timestamp + random + counter
Date	0x09	8 bytes	Milliseconds since Unix epoch
64-bit integer	0x12	8 bytes	Signed
Decimal128	0x13	16 bytes	IEEE 754 decimal (financial math)

The ObjectId deserves special attention. It is 12 bytes: 4 bytes of Unix timestamp, 5 bytes of random value (unique per process), and 3 bytes of incrementing counter. This design means ObjectIds are roughly time-ordered, which keeps B-tree inserts mostly sequential and avoids random page splits. This is not an accident. It is a deliberate performance optimization.

WiredTiger: The Storage Engine

WiredTiger is the default storage engine since MongoDB 3.2, and it is the only production-supported engine since 4.2. Understanding WiredTiger is essential because it controls how every read and write physically happens on disk.

B-Tree Structure

WiredTiger stores each collection as a B-tree (or optionally an LSM tree, though B-tree is the default). Each leaf page in the B-tree contains multiple BSON documents. Internal pages contain key ranges and pointers to child pages.

Each leaf page is typically 4 KB to 32 KB and is compressed on disk using either snappy (default), zlib, or zstd compression. When MongoDB reads a document, WiredTiger locates the correct leaf page through the B-tree, decompresses it into the in-memory cache, and returns the requested document.

The B-tree structure means:

Point lookups (find({_id: X})) are O(log n) in the number of documents
Range scans (find({age: {$gte: 25, $lte: 35}})) follow the leaf page chain sequentially
Inserts at the end of the key space (like ObjectId-based _id) are efficient because they always go to the rightmost leaf page

MVCC: How Reads and Writes Coexist

WiredTiger uses multiversion concurrency control (MVCC). Every write creates a new version of the document rather than overwriting the old version in place. Readers see a consistent snapshot of the data as of the time their operation started, without blocking writers.

// Simplified MVCC timeline
//
// Time T1: Document {_id: 1, name: "Alice", version: 5} exists
// Time T2: Writer updates name to "Bob", creating version 6
// Time T3: Reader with snapshot at T1 still sees "Alice" (version 5)
// Time T4: New reader sees "Bob" (version 6)
// Time T5: WiredTiger evicts version 5 (no more readers need it)

This is critical for performance. In older storage engines (like the deprecated MMAPv1), a write lock on a collection blocked all readers. With WiredTiger's MVCC, a long-running aggregation pipeline reading millions of documents never blocks incoming writes, and writes never block reads.

What most people get wrong

MVCC means MongoDB keeps multiple versions of documents in memory simultaneously. Under heavy write load with long-running reads, the WiredTiger cache can fill up with old versions that cannot be evicted because an open cursor still references them. This is the "cache pressure" problem. If you see WiredTiger eviction warnings in the logs, check for long-running queries holding old snapshots open.

The Journal: Crash Safety

The journal is WiredTiger's write-ahead log (WAL). Every write goes to the journal before it is considered durable. The journal is fsynced to disk every 50ms by default (configurable via storage.journal.commitIntervalMs).

Here is what happens when MongoDB crashes and restarts:

WiredTiger loads the last checkpoint (a consistent snapshot of all B-trees, taken every 60 seconds)
It replays all journal entries written after that checkpoint
The database is now in a consistent state with no data loss (assuming the journal was fsynced)

The maximum data loss window is the last 50ms of writes (between the last journal fsync and the crash). In practice, most deployments use w:majority with journaling, which means the write is not acknowledged until it is journaled on a majority of replica set members.

Checkpoints: The 60-Second Flush

Every 60 seconds (or when the journal reaches 2 GB), WiredTiger writes a checkpoint. A checkpoint is a consistent snapshot of all B-tree pages to disk. This is a heavy I/O operation because it flushes all dirty pages.

// Checkpoint flow (simplified)
function writeCheckpoint():
    // 1. Mark current state as checkpoint boundary
    snapshot = getCurrentMVCCSnapshot()

    // 2. Write all dirty pages to disk (background I/O)
    for each dirtyPage in cache:
        compressedData = compress(dirtyPage, algorithm="snappy")
        writeToDataFile(compressedData)

    // 3. Update the checkpoint metadata
    updateCheckpointTimestamp(snapshot.timestamp)

    // 4. Old journal entries before this checkpoint can be recycled
    recycleJournalBefore(snapshot.timestamp)

Why this matters in production

The 60-second checkpoint cycle is why MongoDB can show periodic I/O spikes. If your disk throughput is barely sufficient for normal operations, checkpoints will cause latency spikes every minute. I always recommend provisioned IOPS (on cloud) or NVMe SSDs (on-premise) for MongoDB. Spinning disks are not viable for any production MongoDB workload.

Compression

WiredTiger compresses data at the page level before writing to disk. The compression options are:

Algorithm	Compression Ratio	CPU Cost	Best For
snappy (default)	~2-4x	Very low	General purpose, balanced
zstd	~3-6x	Low-medium	Large datasets where storage cost matters
zlib	~3-5x	Medium-high	Legacy, prefer zstd instead
none	1x	Zero	Very low latency requirements

I recommend zstd for most production workloads. It provides significantly better compression than snappy with only marginally higher CPU cost. The storage savings (and reduced I/O) typically more than offset the CPU overhead.

Index pages use prefix compression by default, which is extremely efficient because consecutive index entries often share long key prefixes. A collection of 100 million documents with a compound index on {country: 1, city: 1} compresses dramatically because thousands of consecutive entries share the same country value.

The Write Path: From Driver to Disk

Let me walk through exactly what happens when your application calls insertOne() or updateOne().

The important detail is that the document is written to memory (the WiredTiger cache) and the journal, but not immediately to the data files. The data files are updated during the next checkpoint (every 60 seconds). This is why the journal exists: if MongoDB crashes between checkpoints, the journal entries replay the missing writes.

For updates, WiredTiger does not modify the document in place. It creates a new version of the B-tree page containing the updated document. The old page version remains available for any concurrent readers using MVCC snapshots.

Memory Management and Cache Eviction

WiredTiger manages its own memory through the WiredTiger cache, which is separate from the operating system's filesystem cache. By default, the cache uses the larger of either 256 MB or 50% of available RAM minus 1 GB.

When the cache fills up, WiredTiger evicts pages using an LRU-based eviction policy:

Clean pages (unmodified since last checkpoint) are simply discarded. They can be re-read from the data files.
Dirty pages (modified since last checkpoint) must be written to disk before eviction. This is called "eviction I/O" and appears as increased disk writes in monitoring.
Pages with old MVCC versions cannot be evicted until all readers referencing those versions close their cursors.

The WiredTiger cache hit ratio is the single most important metric for MongoDB performance. If it drops below 95%, the system is spending too much time reading from disk instead of serving from memory.

// Key WiredTiger cache metrics (from db.serverStatus().wiredTiger.cache)
{
  "bytes currently in the cache": 6442450944,           // Current cache usage
  "maximum bytes configured": 8589934592,               // Cache size limit
  "pages read into cache": 1542367,                     // Cache misses
  "pages written from cache": 834291,                   // Dirty evictions
  "tracked dirty bytes in the cache": 268435456,        // Dirty data
  "unmodified pages evicted": 2103847,                  // Clean evictions
  "modified pages evicted": 98234,                      // Forced dirty evictions
  "percentage overhead": 8                              // Internal metadata overhead
}

Why this matters in production

I always set up alerts on two WiredTiger metrics: cache utilization (should stay below 80%) and dirty bytes percentage (should stay below 20% of cache size). When dirty bytes exceed 20%, WiredTiger triggers aggressive eviction, which competes with application I/O and causes latency spikes. If you see this pattern, your write throughput is exceeding the checkpoint's ability to flush dirty pages.

Replica Sets: How MongoDB Replicates Data

A replica set is a group of mongod processes that maintain the same data set. One node is the primary (accepts writes), and the rest are secondaries (replicate from the primary and can serve reads).

Oplog: The Replication Stream

Every write to the primary is recorded in the oplog (operations log), a special capped collection in the local database. The oplog stores idempotent operations, meaning each entry can be applied multiple times with the same result.

// Example oplog entry
{
  "ts": Timestamp(1704067200, 1),    // Operation timestamp (seconds + ordinal)
  "t": NumberLong(5),                 // Election term
  "h": NumberLong(0),                 // Legacy hash (unused since 4.0)
  "v": 2,                             // Oplog version
  "op": "u",                          // Operation type: i=insert, u=update, d=delete
  "ns": "mydb.users",                 // Namespace (database.collection)
  "o2": {"_id": ObjectId("...")},     // Query selector (for updates)
  "o": {                              // Operation payload
    "$v": 2,
    "diff": {"u": {"name": "Bob"}}   // Update diff (not full replacement)
  }
}

Secondaries tail the primary's oplog, pulling new entries and applying them to their own WiredTiger instances. This is similar to how MySQL's binlog replication works, but with two key differences:

The oplog is idempotent by design. If a secondary applies the same entry twice (due to a network retry), the result is identical. MySQL's statement-based replication does not have this guarantee.
The oplog uses timestamps with ordinals, not sequential IDs. This allows the replication protocol to handle clock skew and out-of-order delivery gracefully.

Oplog Sizing and the Replication Window

The oplog is a capped collection, meaning it has a fixed maximum size. When it fills up, the oldest entries are overwritten. The replication window is the time span between the oldest and newest oplog entries. If a secondary falls behind more than the replication window, it can no longer catch up by tailing the oplog and must perform a full resync (copying the entire dataset from the primary).

// Check oplog size and replication window
rs.printReplicationInfo()
// Output:
// configured oplog size: 2048 MB
// log length start to end: 172800 secs (48 hrs)
// oplog first event time: Mon Jan 01 2024 00:00:00
// oplog last event time:  Wed Jan 03 2024 00:00:00

I recommend an oplog size that provides at least 48-72 hours of replication window. This gives you time to recover a downed secondary without triggering a full resync. For write-heavy workloads, you may need to increase the oplog to 5-10% of your data size.

Replication Lag and Monitoring

Replication lag is the delay between when a write is recorded in the primary's oplog and when it is applied on a secondary. In a healthy replica set with low write throughput, replication lag is under 100ms. Under heavy write load or when a secondary has slower hardware, lag can grow to seconds or even minutes.

// Check replication lag
rs.printSecondaryReplicationInfo()
// Output:
// source: secondary1:27017
//   syncedTo: Mon Jan 03 2024 00:00:00
//   0 secs (0 hrs) behind the primary
// source: secondary2:27017
//   syncedTo: Mon Jan 02 2024 23:59:55
//   5 secs (0 hrs) behind the primary

When replication lag exceeds a few seconds, reads with readPreference: secondaryPreferred return increasingly stale data. This is exactly the scenario in the opening interview question: the application reads from a secondary that has not yet applied the recent write.

Raft-Inspired Election Protocol

MongoDB's replica set election protocol is based on the Raft consensus algorithm (with modifications). When the primary becomes unreachable, the following sequence occurs:

Secondaries detect the primary is unreachable after the electionTimeoutMillis period (default 10 seconds)
An eligible secondary calls an election by incrementing its term and requesting votes from all replica set members
Each member votes for the first candidate that has an oplog at least as up-to-date as its own (the "log completeness" check)
If a candidate receives votes from a majority of members, it becomes the new primary
The new primary starts accepting writes in the new term

The election typically completes in 10-12 seconds: 10 seconds for the timeout detection plus 1-2 seconds for the election itself. During this window, the application cannot write to the replica set. The MongoDB drivers detect the topology change and automatically redirect writes to the new primary once the election completes.

What most people get wrong

Setting electionTimeoutMillis too low (like 2 seconds) does not make failover faster in a good way. It causes spurious elections during network blips, which means your primary flips back and forth, causing w:majority writes to fail during each transition. I recommend keeping the default 10 seconds unless you have extremely stable networking and have tested thoroughly.

Write Concern: Controlling Durability

The writeConcern parameter on each write operation controls when the server acknowledges the write:

Write Concern	Meaning	Durability	Latency
`w: 0`	Fire and forget, no acknowledgment	None	Lowest (~0.5ms)
`w: 1`	Primary acknowledges after journaling	Single node	Low (~1-2ms)
`w: "majority"`	Majority of replicas acknowledge	Survives primary failure	Medium (~5-15ms)
`w: 3`	Exactly 3 nodes acknowledge	Explicit count	Higher
`j: true`	Wait for journal fsync (combinable)	Journal-durable	+50ms worst case

The critical insight is that w:1 (the default for many drivers) means the write is only durable on the primary. If the primary crashes before the secondaries replicate the write, that write is lost. This is not a bug. It is the expected behavior of w:1. For any data you cannot afford to lose, use w: "majority".

Read Concern: Controlling Consistency

The readConcern parameter controls what data a read operation can see:

Read Concern	What You See	Use Case
`"local"` (default)	Latest data on the queried node (may be rolled back)	Low-latency reads where occasional stale data is acceptable
`"available"`	Same as local but faster on sharded clusters (skips orphan filtering)	Analytics queries that tolerate stale data
`"majority"`	Only data committed to a majority of replicas	Reads that must not see data that could be rolled back
`"linearizable"`	Latest majority-committed data plus a confirmation that the primary is still the leader	Critical reads that need true linearizability (e.g., leader elections, locks)
`"snapshot"`	A consistent snapshot across shards (used in multi-document transactions)	Transactions

Sharding: How MongoDB Distributes Data

When a single replica set cannot handle the data volume or throughput, MongoDB uses sharding to distribute data across multiple replica sets (called shards).

Sharded Cluster Architecture

The three key components:

mongos (router): A stateless process that routes queries to the correct shard(s). Applications connect to mongos, not directly to shards. You can run as many mongos instances as you need behind a load balancer.
Config servers: A replica set that stores the chunk-to-shard mapping. The mongos caches this mapping locally and refreshes it when it detects a stale route (a chunk has moved).
Shards: Each shard is a replica set holding a subset of the data. Data is divided into contiguous ranges called chunks (default 128 MB each), and each chunk lives on exactly one shard.

Shard Key Selection

The shard key is the most important decision in a sharded MongoDB deployment. It determines how data is distributed, how queries are routed, and ultimately the performance ceiling of your cluster.

Chunk Splitting and Balancing

When a chunk grows beyond the configured maximum size (default 128 MB), MongoDB automatically splits it into two smaller chunks at the midpoint of the key range. Splitting is a metadata-only operation on the config server; it does not move any data.

The balancer is a background process that runs on one of the mongos instances. It monitors the chunk count per shard and migrates chunks from overloaded shards to underloaded shards to maintain an even distribution. A migration involves:

The destination shard copies all documents in the chunk from the source shard
During the copy, any new writes to the chunk are forwarded to both shards
Once the copy is complete, the config server updates the chunk map
The source shard deletes its copy of the chunk data

Balancer impact on production

Chunk migrations are expensive I/O operations. Each migration reads the entire chunk from the source shard and writes it to the destination shard. On a busy cluster, I recommend scheduling the balancer to run only during off-peak hours using sh.setBalancerState() and the balancer window configuration.

Targeted vs Scatter-Gather Queries

When a query includes the shard key, mongos can route it to the exact shard that holds the data. This is a targeted query and has the same performance as querying an unsharded collection.

When a query does not include the shard key, mongos must send it to all shards and merge the results. This is a scatter-gather query and its latency is dominated by the slowest shard. For a cluster with 10 shards, a scatter-gather query does 10x the work of a targeted query.

// Targeted query (shard key included)
db.events.find({ tenantId: "acme", timestamp: { $gte: ISODate("2024-01-01") } })
// mongos routes to the shard that owns tenantId "acme"

// Scatter-gather query (no shard key)
db.events.find({ eventType: "login" })
// mongos sends to ALL shards, merges results
// Latency = max(shard1_time, shard2_time, ..., shardN_time)

This is why shard key selection is the most critical decision. If 80% of your queries do not include the shard key, you are doing scatter-gather on 80% of traffic, and each query generates N network hops instead of 1.

Jumbo Chunks and the Balancer's Limits

A jumbo chunk is a chunk that exceeds the maximum size (128 MB) but cannot be split further because all documents in the chunk share the same shard key value. This happens when the shard key has low cardinality (e.g., sharding on {country: 1} when 60% of data is from one country).

Jumbo chunks cannot be split or migrated, which leads to an unbalanced cluster. The shard holding the jumbo chunk receives disproportionate traffic and storage. This is one of the most common MongoDB sharding mistakes and is very difficult to fix after the fact (it often requires resharding the entire collection).

To avoid jumbo chunks:

Choose a shard key with high cardinality (millions of distinct values)
Use a compound shard key where the combination has high cardinality
Never shard on a field with fewer than 1,000 distinct values

Query Execution and Indexes

MongoDB's query system converts a find() or aggregate() call into an execution plan, similar to how a SQL database processes a SELECT statement.

The Query Planner

When a query arrives, the query planner:

Identifies all candidate indexes that could satisfy the query
For each candidate, generates an execution plan
Runs all candidate plans in parallel for a short trial period (up to maxTimeMS or 10,000 documents)
Picks the plan that returned results with the least work (fewest index keys examined)
Caches the winning plan for the query shape

// Example: explain output showing query plan
db.users.find({age: {$gte: 25}, city: "NYC"}).explain("executionStats")

// Output (simplified):
{
  "winningPlan": {
    "stage": "FETCH",               // Fetch full documents
    "inputStage": {
      "stage": "IXSCAN",            // Index scan
      "indexName": "age_1_city_1",   // Compound index used
      "keyPattern": {"age": 1, "city": 1},
      "direction": "forward",
      "indexBounds": {
        "age": ["[25, Infinity]"],
        "city": ["[\"NYC\", \"NYC\"]"]
      }
    }
  },
  "executionStats": {
    "totalKeysExamined": 1542,       // Index entries scanned
    "totalDocsExamined": 1542,       // Documents fetched
    "nReturned": 1542                // Results returned
  }
}

The ideal query has totalKeysExamined == totalDocsExamined == nReturned. If totalDocsExamined is much higher than nReturned, you are scanning documents that do not match the query, which means your index is not selective enough.

Index Types

Index Type	Syntax	Use Case
Single field	`{age: 1}`	Equality and range queries on one field
Compound	`{country: 1, city: 1, age: 1}`	Multi-field queries following ESR rule
Multikey	`{tags: 1}`	Indexing array fields
Text	`{$text: {$search: "..."}}}`	Full-text search (limited, prefer Atlas Search)
Geospatial	`{location: "2dsphere"}`	Location-based queries
Hashed	`{userId: "hashed"}`	Even distribution for shard keys
Wildcard	`{"$**": 1}`	Indexing unknown/dynamic field names
TTL	`{createdAt: 1}, {expireAfterSeconds: 86400}`	Auto-delete expired documents

The ESR Rule for Compound Indexes

The order of fields in a compound index matters enormously. The ESR rule (Equality, Sort, Range) defines the optimal field ordering:

Equality fields first: fields queried with exact match ({status: "active"})
Sort fields next: fields used in .sort() ({createdAt: -1})
Range fields last: fields queried with ranges ({age: {$gte: 25}})

// Query: active users in NYC, sorted by newest, aged 25+
db.users.find({
  status: "active",    // Equality
  city: "NYC",         // Equality
  age: {$gte: 25}      // Range
}).sort({createdAt: -1}) // Sort

// Optimal compound index following ESR:
db.users.createIndex({
  status: 1,       // E: equality
  city: 1,         // E: equality
  createdAt: -1,   // S: sort
  age: 1           // R: range
})

Covered Queries

A covered query is one where all fields in the query filter, sort, and projection exist in the index. MongoDB can answer the query entirely from the index without fetching the full document from the B-tree. This eliminates the FETCH stage and can improve performance by 2-5x.

// Covered query: all fields in the index
db.users.find(
  {status: "active", city: "NYC"},
  {_id: 0, status: 1, city: 1, createdAt: 1}  // Projection matches index
).sort({createdAt: -1})

// With index: {status: 1, city: 1, createdAt: -1}
// explain() shows: "stage": "IXSCAN" with NO "FETCH" stage

The Aggregation Pipeline

MongoDB's aggregation framework processes documents through a pipeline of stages. Each stage transforms the documents and passes them to the next stage.

db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2024-01-01") } } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" }, count: { $sum: 1 } } },
  { $sort: { total: -1 } },
  { $limit: 10 }
])

The query optimizer pushes $match stages as early as possible in the pipeline and coalesces adjacent $sort and $limit stages. The $match at the beginning can use indexes, but once documents pass through $group, they are in-memory objects that cannot use indexes.

For sharded collections, the aggregation pipeline splits into two phases:

Shard phase: $match and $group run on each shard in parallel
Merge phase: Results from all shards are merged on the mongos (or a designated shard) for the final $sort and $limit

Change Streams: Real-Time Data Notifications

Change streams allow applications to subscribe to real-time data changes on a collection, database, or entire deployment. Under the hood, change streams are built on top of the oplog. The driver opens a tailable cursor on the oplog and filters for events matching the specified collection.

// Watch for changes on the orders collection
const changeStream = db.collection("orders").watch([
  { $match: { "fullDocument.status": "completed" } }
]);

changeStream.on("change", (event) => {
  // event.operationType: "insert", "update", "replace", "delete"
  // event.fullDocument: the complete document (for insert/replace/update)
  // event.updateDescription: { updatedFields, removedFields } (for updates)
  console.log("Order completed:", event.fullDocument._id);
});

Change streams are resumable. Each event includes a _id field (the resume token). If the client disconnects, it can reconnect with the resume token and pick up where it left off without missing events. This is significantly more robust than polling for changes and is the recommended approach for event-driven architectures with MongoDB.

Why this matters in production

Change streams replace the old (and fragile) practice of tailing the oplog directly. Direct oplog tailing required knowledge of internal oplog format and broke across MongoDB version upgrades. Change streams provide a stable, documented API with the same real-time capabilities. I use them for cache invalidation, search index updates, and event sourcing patterns.

Transactions: Multi-Document ACID

Since MongoDB 4.0 (replica sets) and 4.2 (sharded clusters), MongoDB supports multi-document ACID transactions. This was the most requested feature in MongoDB's history, because without it, operations spanning multiple documents had no atomicity guarantee.

How Transactions Work Internally

A transaction in MongoDB uses WiredTiger's MVCC to create a consistent snapshot at transaction start. All reads within the transaction see this snapshot, and all writes are buffered until commit.

Key implementation details:

Snapshot isolation: The transaction reads from the snapshot at T1. If another transaction modifies the same documents, the first transaction will get a WriteConflict error on commit. The application must retry.
60-second limit: Transactions that run longer than transactionLifetimeLimitSeconds (default 60 seconds) are automatically aborted. This is intentional: long transactions hold MVCC snapshots open, preventing WiredTiger from reclaiming old document versions.
Oplog entry: The entire transaction is recorded as a single applyOps oplog entry, ensuring secondaries replay the transaction atomically.

What most people get wrong

Multi-document transactions in MongoDB are not free. They have overhead from snapshot management, write conflict detection, and the 60-second time limit. Do not treat MongoDB transactions like SQL transactions where you wrap everything in BEGIN/COMMIT. Instead, design your schema to keep related data in a single document whenever possible (embedded documents and arrays). Use transactions only when you genuinely need atomicity across multiple documents or collections.

Write Conflicts and Retries

When two transactions modify the same document, MongoDB uses optimistic concurrency control. Both transactions proceed without locking, but at commit time, the second transaction to commit detects the conflict.

// Transaction retry pattern (recommended by MongoDB)
async function runTransactionWithRetry(session, txnFunc) {
  while (true) {
    try {
      session.startTransaction({
        readConcern: { level: "snapshot" },
        writeConcern: { w: "majority" }
      });
      await txnFunc(session);
      await session.commitTransaction();
      break;  // Success
    } catch (error) {
      if (error.hasErrorLabel("TransientTransactionError")) {
        // Write conflict or transient error, retry the whole transaction
        continue;
      }
      throw error;  // Non-retryable error
    }
  }
}

What Happens When Things Break

MongoDB is designed for production failures. Here are the key failure scenarios and how the system responds.

Failure	What Happens	How MongoDB Responds	Impact on Clients
Primary crashes	Writes unavailable	Election in ~10-12s, new primary elected	10-12s write downtime; reads from secondaries continue
Secondary crashes	One fewer replica	Replica set continues with reduced redundancy	No impact if majority of nodes survive
Network partition (minority side)	Partitioned nodes cannot reach majority	Partitioned nodes step down to secondary	Clients connected to minority partition get errors
Disk full on primary	Writes fail	MongoDB rejects writes; reads continue	Write errors until space freed
Oplog falls behind	Secondary too far behind primary	Secondary enters RECOVERING state, needs resync	That secondary unavailable for reads; may affect w:majority
WiredTiger cache exhaustion	Memory pressure	Aggressive eviction, slower operations	Increased latency, possible timeouts
Config server unavailable (sharded)	Chunk metadata inaccessible	Reads/writes to existing chunks continue; splits and migrations stop	No immediate data loss; cluster "freezes" in current state

The Split-Brain Problem and How MongoDB Avoids It

In distributed systems, a network partition can create a "split-brain" scenario where two groups of nodes each believe they are the active cluster. MongoDB avoids split-brain through the majority requirement for elections.

In a 3-node replica set, a node needs 2 votes (a majority) to become primary. If the network partitions into a 2-node group and a 1-node group, only the 2-node group can elect a primary. The isolated node cannot get a majority and remains a read-only secondary. This guarantees that at most one primary exists at any time.

This is also why you should always use an odd number of replica set members (3, 5, or 7). With an even number (say 4), a network partition could create two 2-node groups, and neither group can achieve a majority. The entire cluster becomes read-only until the partition heals. An arbiter node (a lightweight member that votes but does not hold data) can be added to avoid this problem, though I prefer running 3 full data nodes for maximum redundancy.

Retryable Writes and Reads

Since MongoDB 3.6 (writes) and 4.2 (reads), the drivers support retryable operations. When a write fails due to a network error or a failover event, the driver automatically retries the operation once. This is safe because each write is tagged with a unique transaction ID, and the server deduplicates retried writes.

// Retryable writes are enabled by default
const client = new MongoClient("mongodb://...", {
  retryWrites: true,   // Default: true since 3.6
  retryReads: true      // Default: true since 4.2
});

// If this write fails due to network error or failover:
// 1. Driver detects the error
// 2. Driver discovers the new primary via SDAM (see below)
// 3. Driver retries the same write with the same txnNumber
// 4. New primary deduplicates if the write was already applied
await db.collection("users").updateOne(
  { _id: userId },
  { $set: { lastLogin: new Date() } }
);

Why this matters in production

Before retryable writes, every application had to implement its own retry logic with deduplication. This was error-prone and inconsistent across services. With retryable writes enabled (the default), most transient failures during elections are handled transparently by the driver. I have seen this reduce error rates during planned maintenance (rolling restarts) from 2-5% to near zero.

Server Discovery and Monitoring (SDAM)

The MongoDB driver continuously monitors the topology of the replica set through a protocol called SDAM (Server Discovery and Monitoring). The driver maintains a background thread that pings each node every 10 seconds (configurable via heartbeatFrequencyMS) to detect:

Which node is the primary
Which nodes are secondaries
Which nodes are unresponsive
The replication lag on each secondary

When a failover occurs, the driver detects the topology change through SDAM and automatically redirects operations to the new primary. The application does not need any custom failover logic. This entire process is invisible to the application code beyond a brief latency spike during the election window.

Rollback Scenario

When a primary crashes and a secondary is elected, the new primary might be missing some writes that the old primary had accepted with w:1 (not replicated to majority). When the old primary comes back online, it discovers that its oplog diverges from the current primary. It must roll back those unreplicated writes.

Rolled-back documents are saved to a rollback/ directory as BSON files. You can recover them manually, but this is a last resort. The correct fix is to use w: "majority" for all writes you cannot afford to lose.

Performance Characteristics

Operation	Latency (typical)	Throughput	Notes
Point read by `_id` (primary)	0.5-2ms	50K-100K ops/s per node	Single B-tree traversal
Point read (secondary)	0.5-2ms + replication lag	50K-100K ops/s	May return stale data
Indexed query	1-10ms	10K-50K ops/s	Depends on selectivity and result size
Collection scan	100ms-minutes	Varies	Full B-tree traversal, avoid in production
Write (w:1, journaled)	1-2ms	20K-50K ops/s per node	Single-node durability only
Write (w:majority)	5-15ms	5K-20K ops/s per node	Waits for replication
Transaction (2-3 docs)	10-30ms	1K-5K txn/s	MVCC snapshot + conflict detection overhead
Aggregation pipeline	10ms-seconds	Depends on pipeline	Indexes help for $match stage only

What Affects Performance

Working set size vs RAM: If your frequently accessed data fits in the WiredTiger cache (typically 50% of RAM), most reads come from memory. If it does not fit, every cache miss means a disk read, and latency jumps from microseconds to milliseconds.
Index coverage: Queries without index support trigger collection scans. On a collection with 100 million documents, a collection scan can take minutes. Use explain() on every production query.
Document size: WiredTiger compresses pages, but large documents (>100 KB) reduce the number of documents per leaf page, increasing the number of pages that must be read for range queries. If a document contains a large array that grows unboundedly, you will hit the 16 MB BSON document size limit and performance degrades well before that.
Write contention: Under heavy write load, WiredTiger's checkpoint cycle (every 60 seconds) creates I/O spikes. Combined with replication overhead for w:majority, sustained write throughput is lower than read throughput.
Connection pooling: Each MongoDB driver maintains a connection pool to the server. The default pool size is typically 100 connections. If your application creates too many connections (e.g., in a serverless environment where each function instance creates its own pool), the mongod process can run out of file descriptors and start rejecting connections. Use connection pooling libraries and keep the total connection count under net.maxIncomingConnections (default 65,536).

Performance Optimization Strategies

Read-heavy workloads:

Use readPreference: secondaryPreferred to distribute reads across replicas
Ensure every query has an appropriate index (use explain())
Use projection to return only needed fields, reducing network and cache overhead
Consider covered queries (all fields in the index) for high-throughput reads

Write-heavy workloads:

Use w:1 for non-critical writes where occasional data loss is acceptable
Batch writes using insertMany() or bulkWrite() to reduce round-trips
Use unordered: true in bulk operations so individual failures do not block the batch
Monitor checkpoint I/O and consider reducing syncPeriodSecs for more frequent, smaller checkpoints

Mixed workloads:

Separate read and write traffic using read preference routing
Use WiredTiger's built-in rate limiting (storage.wiredTiger.engineConfig.configString)
Monitor cache dirty bytes and ensure checkpoint throughput keeps pace with write rate

How This Compares to Alternatives

Feature	MongoDB	PostgreSQL	DynamoDB	Cassandra
Data model	Documents (BSON)	Relational (rows)	Key-value / document	Wide-column
Schema	Flexible (schema-on-read)	Strict (schema-on-write)	Flexible per item	Flexible per row
Query language	MongoDB Query Language (MQL)	SQL	PartiQL / API	CQL
Joins	$lookup (limited)	Full JOIN support	None	None
Transactions	Multi-document ACID (since 4.0)	Full ACID	Single-item or TransactWriteItems	Lightweight transactions
Replication	Replica set (Raft-based)	Streaming replication	Built-in (managed)	Masterless (Dynamo-style)
Sharding	Native range/hash sharding	Manual (Citus) or partitioning	Automatic (partition key)	Automatic (consistent hash)
Consistency	Tunable (per-operation)	Strong (serializable available)	Tunable (eventual or strong)	Tunable (per-query)
Managed option	Atlas	RDS / Aurora	Native AWS	Managed Cassandra (Keyspaces)

A few notes on this comparison:

MongoDB's $lookup is significantly slower than SQL JOINs for large datasets because it does not use hash joins or merge joins. It always uses a nested loop strategy.
DynamoDB has no equivalent to MongoDB's aggregation pipeline. Complex analytics require exporting to a data lake.
Cassandra's lightweight transactions (LWT) are much more limited than MongoDB's multi-document transactions and carry a significant performance penalty.

When to Choose MongoDB

I reach for MongoDB when the data model is naturally document-shaped (nested objects, variable schemas across records), when I need flexible indexing across arbitrary fields, and when the team values developer productivity with a schemaless data model. MongoDB's aggregation pipeline is also surprisingly powerful for analytics that do not require multi-table joins.

I switch to PostgreSQL when I need complex joins across multiple entities, strong transactional guarantees across many tables, or when the data is fundamentally relational. I switch to DynamoDB when I need truly massive scale with single-digit millisecond latency and can design a strict access pattern around partition keys.

Schema Design: Embedding vs Referencing

MongoDB's document model gives you a choice that relational databases do not: embed related data in a single document or reference it from a separate collection (like a foreign key).

The embedding decision should follow these rules:

Embed when the related data is always accessed together and has bounded growth
Reference when the related data grows unboundedly, is frequently updated independently, or is accessed separately
Hybrid (embed the common case, reference the edge case) when one access pattern dominates

Read Preference: Routing Reads in a Replica Set

Read preference controls which replica set member a read operation targets:

Read Preference	Behavior	Use Case
`primary` (default)	Always read from primary	Strong consistency required
`primaryPreferred`	Primary if available, secondary if not	Consistency preferred, availability fallback
`secondary`	Always read from a secondary	Analytics/reporting (stale data acceptable)
`secondaryPreferred`	Secondary if available, primary if not	Read scale-out with best-effort consistency
`nearest`	Read from the node with lowest network latency	Geo-distributed reads, latency-sensitive

Hidden danger with secondaryPreferred

secondaryPreferred sounds safe ("prefer secondary, but use primary if needed"), but it creates a subtle problem: if all secondaries are slightly behind, your application consistently reads stale data. Worse, during an election when secondaries are catching up, data can appear to "go backward" (a read returns newer data, then the next read returns older data). For any data where consistency matters, use primary read preference with readConcern: "majority".

Interview Cheat Sheet

When asked about storage format: "MongoDB stores BSON, not JSON. BSON is binary-encoded with type bytes and length prefixes, enabling O(1) field access without parsing the entire document. An ObjectId is 12 bytes: timestamp + random + counter, designed for roughly time-ordered B-tree inserts."
When asked about the storage engine: "WiredTiger uses B-trees with MVCC. Readers see consistent snapshots without blocking writers. Data is compressed on disk (snappy/zstd) and written to a journal (WAL) every 50ms, with checkpoints every 60 seconds."
When asked about replication: "Replica sets use a Raft-inspired protocol. The primary records writes in the oplog (a capped collection of idempotent operations). Secondaries tail the oplog. Elections take 10-12 seconds when a primary fails."
When asked about consistency: "MongoDB's consistency is tunable per-operation. writeConcern controls how many replicas acknowledge a write. readConcern controls what snapshot a read sees. For true consistency, use w:majority + readConcern:majority."
When asked about sharding: "Data is split into 128 MB chunks by shard key. mongos routes queries to the correct shard(s). A compound shard key with high-cardinality prefix provides both even distribution and targeted queries."
When asked about transactions: "Since 4.0, MongoDB supports multi-document ACID transactions using MVCC snapshots. Transactions have a 60-second time limit and use optimistic concurrency control with automatic write conflict detection."
When asked about indexes: "Follow the ESR rule: Equality fields first, then Sort fields, then Range fields. Use explain() to verify covered queries where all fields exist in the index, eliminating the FETCH stage."
When asked about failure handling: "If primary fails, a new primary is elected in 10-12 seconds. Writes with w:1 on the old primary that were not replicated are rolled back (saved to rollback/ directory). Use w:majority to prevent data loss."
When asked about performance: "Point reads by _id take 0.5-2ms. The key performance factor is whether the working set fits in WiredTiger cache (50% of RAM). If it does not, every cache miss hits disk."
When asked about schema design: "Embed related data in a single document when possible. This avoids joins and enables atomic single-document operations. Use references (like foreign keys) only when documents would grow unboundedly or when many-to-many relationships require it."

Test Your Understanding

Quick Recap

MongoDB stores documents as BSON (Binary JSON), a binary format with type bytes and length prefixes that enables O(1) field access without parsing the full document.
WiredTiger stores collections as compressed B-trees with MVCC, allowing concurrent readers and writers without blocking each other.
The journal (WAL) fsyncs every 50ms and checkpoints flush all dirty pages every 60 seconds, creating the crash recovery mechanism.
Replica sets use a Raft-inspired protocol with oplog tailing for replication, with elections completing in 10-12 seconds when the primary fails.
Write concern (w) controls how many replicas acknowledge a write; read concern controls what snapshot a read sees. Together they define the consistency model.
Sharding distributes data across replica sets using a shard key. Chunk splitting (metadata-only) and balancing (data migration) maintain even distribution.
The ESR rule (Equality, Sort, Range) defines optimal compound index field ordering for queries with mixed predicates and sorts.
Multi-document transactions use MVCC snapshots with a 60-second time limit and optimistic concurrency control for write conflict detection.
The biggest performance factor is whether the working set fits in the WiredTiger cache (typically 50% of RAM). Cache misses hit disk and increase latency by 10-100x.
MongoDB is best suited for document-shaped data with flexible schemas. For heavy join workloads or strict relational integrity, PostgreSQL is a better fit.
Change streams provide real-time, resumable notifications of data changes by tailing the oplog through a stable API, enabling event-driven architectures without polling.
The SDAM protocol in MongoDB drivers continuously monitors replica set topology and automatically redirects operations to the correct primary after failover, making the application resilient to elections without custom retry logic.

How DynamoDB works internally: DynamoDB's partition-key routing and write-ahead log are similar to MongoDB's shard key routing and journal, but DynamoDB is fully managed with automatic scaling.
How Cassandra works internally: Cassandra's masterless architecture and tunable consistency provide an interesting contrast to MongoDB's primary-based replication and write/read concern model.
How PostgreSQL executes a query: PostgreSQL's query planner and B-tree indexes operate on similar principles to MongoDB's, but with the added complexity of table joins and SQL optimization.
How Kafka works internally: Kafka's offset-based log consumption is conceptually similar to MongoDB's oplog tailing, and both systems use leader-based replication with follower consumers.
Consistency models in distributed systems: MongoDB's tunable consistency (from eventual to linearizable) is a practical implementation of the theoretical consistency spectrum discussed in CAP theorem and PACELC framework.
How Redis works internally: Redis's in-memory data structures and single-threaded event loop provide an interesting contrast to MongoDB's disk-based B-trees and multi-threaded WiredTiger engine. Many architectures use both: Redis for hot data caching, MongoDB for persistent document storage.

The Interview Question

Interviewer: "Your system uses MongoDB to store user profiles and activity feeds. A user writes a profile update, and another service reads it 50ms later but gets stale data. Walk me through what actually happens inside MongoDB from the moment a write hits the server to when it becomes visible on a secondary replica. And how would you fix the stale read?"

What to Clarify Before Answering

You: "Good question. Let me clarify the setup before I go deep..."

"Are we running a replica set or a standalone instance? Stale reads only happen with replication."
"What write concern is the application using? w:1 behaves very differently from w:majority."
"Is the reading service hitting a secondary replica directly, or going through the primary?"
"Should I cover the sharded case too, or focus on replica set internals first?"

The 30-Second Answer

The Architecture Overview

BSON: The Document Storage Format

Why Not Just Use JSON?

JSON has three problems for a database engine:

Parsing cost: JSON is text. Every read requires string parsing, number conversion, and key lookup by string comparison. In a database handling millions of reads per second, this overhead is unacceptable.
No native types: JSON has no date type, no binary type, no 64-bit integer, no decimal128. MongoDB needs all of these.
No length prefix: To find a field in a JSON document, you must parse from the beginning. There is no way to jump to the 50th field without reading the first 49.

BSON solves all three. Each element is prefixed with a type byte and a length, so the engine can skip directly to any field without parsing the rest of the document.

BSON Binary Layout

// BSON document layout (simplified)
document ::= int32 (total byte length)
             element*
             0x00 (terminator)

element  ::= type_byte   (1 byte: 0x01=double, 0x02=string, 0x03=embedded doc, ...)
             field_name   (C string, null-terminated)
             value        (type-dependent encoding with length prefix)

Why this matters in production

Key BSON Types

Type	BSON ID	Size	Notes
Double	0x01	8 bytes	IEEE 754
String	0x02	4 + len + 1 bytes	Length-prefixed UTF-8
Embedded document	0x03	Variable	Nested BSON
Array	0x04	Variable	BSON with "0", "1", "2" as keys
Binary	0x05	4 + 1 + len bytes	For raw bytes, UUIDs
ObjectId	0x07	12 bytes	Timestamp + random + counter
Date	0x09	8 bytes	Milliseconds since Unix epoch
64-bit integer	0x12	8 bytes	Signed
Decimal128	0x13	16 bytes	IEEE 754 decimal (financial math)

WiredTiger: The Storage Engine

B-Tree Structure

The B-tree structure means:

Point lookups (find({_id: X})) are O(log n) in the number of documents
Range scans (find({age: {$gte: 25, $lte: 35}})) follow the leaf page chain sequentially
Inserts at the end of the key space (like ObjectId-based _id) are efficient because they always go to the rightmost leaf page

MVCC: How Reads and Writes Coexist

// Simplified MVCC timeline
//
// Time T1: Document {_id: 1, name: "Alice", version: 5} exists
// Time T2: Writer updates name to "Bob", creating version 6
// Time T3: Reader with snapshot at T1 still sees "Alice" (version 5)
// Time T4: New reader sees "Bob" (version 6)
// Time T5: WiredTiger evicts version 5 (no more readers need it)

What most people get wrong

The Journal: Crash Safety

Here is what happens when MongoDB crashes and restarts:

WiredTiger loads the last checkpoint (a consistent snapshot of all B-trees, taken every 60 seconds)
It replays all journal entries written after that checkpoint
The database is now in a consistent state with no data loss (assuming the journal was fsynced)

Checkpoints: The 60-Second Flush

// Checkpoint flow (simplified)
function writeCheckpoint():
    // 1. Mark current state as checkpoint boundary
    snapshot = getCurrentMVCCSnapshot()

    // 2. Write all dirty pages to disk (background I/O)
    for each dirtyPage in cache:
        compressedData = compress(dirtyPage, algorithm="snappy")
        writeToDataFile(compressedData)

    // 3. Update the checkpoint metadata
    updateCheckpointTimestamp(snapshot.timestamp)

    // 4. Old journal entries before this checkpoint can be recycled
    recycleJournalBefore(snapshot.timestamp)

Why this matters in production

Compression

WiredTiger compresses data at the page level before writing to disk. The compression options are:

Algorithm	Compression Ratio	CPU Cost	Best For
snappy (default)	~2-4x	Very low	General purpose, balanced
zstd	~3-6x	Low-medium	Large datasets where storage cost matters
zlib	~3-5x	Medium-high	Legacy, prefer zstd instead
none	1x	Zero	Very low latency requirements

The Write Path: From Driver to Disk

Let me walk through exactly what happens when your application calls insertOne() or updateOne().

Memory Management and Cache Eviction

When the cache fills up, WiredTiger evicts pages using an LRU-based eviction policy:

Clean pages (unmodified since last checkpoint) are simply discarded. They can be re-read from the data files.
Dirty pages (modified since last checkpoint) must be written to disk before eviction. This is called "eviction I/O" and appears as increased disk writes in monitoring.
Pages with old MVCC versions cannot be evicted until all readers referencing those versions close their cursors.

// Key WiredTiger cache metrics (from db.serverStatus().wiredTiger.cache)
{
  "bytes currently in the cache": 6442450944,           // Current cache usage
  "maximum bytes configured": 8589934592,               // Cache size limit
  "pages read into cache": 1542367,                     // Cache misses
  "pages written from cache": 834291,                   // Dirty evictions
  "tracked dirty bytes in the cache": 268435456,        // Dirty data
  "unmodified pages evicted": 2103847,                  // Clean evictions
  "modified pages evicted": 98234,                      // Forced dirty evictions
  "percentage overhead": 8                              // Internal metadata overhead
}

Why this matters in production

Replica Sets: How MongoDB Replicates Data

Oplog: The Replication Stream

// Example oplog entry
{
  "ts": Timestamp(1704067200, 1),    // Operation timestamp (seconds + ordinal)
  "t": NumberLong(5),                 // Election term
  "h": NumberLong(0),                 // Legacy hash (unused since 4.0)
  "v": 2,                             // Oplog version
  "op": "u",                          // Operation type: i=insert, u=update, d=delete
  "ns": "mydb.users",                 // Namespace (database.collection)
  "o2": {"_id": ObjectId("...")},     // Query selector (for updates)
  "o": {                              // Operation payload
    "$v": 2,
    "diff": {"u": {"name": "Bob"}}   // Update diff (not full replacement)
  }
}

The oplog is idempotent by design. If a secondary applies the same entry twice (due to a network retry), the result is identical. MySQL's statement-based replication does not have this guarantee.
The oplog uses timestamps with ordinals, not sequential IDs. This allows the replication protocol to handle clock skew and out-of-order delivery gracefully.

Oplog Sizing and the Replication Window

// Check oplog size and replication window
rs.printReplicationInfo()
// Output:
// configured oplog size: 2048 MB
// log length start to end: 172800 secs (48 hrs)
// oplog first event time: Mon Jan 01 2024 00:00:00
// oplog last event time:  Wed Jan 03 2024 00:00:00

Replication Lag and Monitoring

// Check replication lag
rs.printSecondaryReplicationInfo()
// Output:
// source: secondary1:27017
//   syncedTo: Mon Jan 03 2024 00:00:00
//   0 secs (0 hrs) behind the primary
// source: secondary2:27017
//   syncedTo: Mon Jan 02 2024 23:59:55
//   5 secs (0 hrs) behind the primary

Raft-Inspired Election Protocol

MongoDB's replica set election protocol is based on the Raft consensus algorithm (with modifications). When the primary becomes unreachable, the following sequence occurs:

Secondaries detect the primary is unreachable after the electionTimeoutMillis period (default 10 seconds)
An eligible secondary calls an election by incrementing its term and requesting votes from all replica set members
Each member votes for the first candidate that has an oplog at least as up-to-date as its own (the "log completeness" check)
If a candidate receives votes from a majority of members, it becomes the new primary
The new primary starts accepting writes in the new term

What most people get wrong

Write Concern: Controlling Durability

The writeConcern parameter on each write operation controls when the server acknowledges the write:

Write Concern	Meaning	Durability	Latency
`w: 0`	Fire and forget, no acknowledgment	None	Lowest (~0.5ms)
`w: 1`	Primary acknowledges after journaling	Single node	Low (~1-2ms)
`w: "majority"`	Majority of replicas acknowledge	Survives primary failure	Medium (~5-15ms)
`w: 3`	Exactly 3 nodes acknowledge	Explicit count	Higher
`j: true`	Wait for journal fsync (combinable)	Journal-durable	+50ms worst case

Read Concern: Controlling Consistency

The readConcern parameter controls what data a read operation can see:

Read Concern	What You See	Use Case
`"local"` (default)	Latest data on the queried node (may be rolled back)	Low-latency reads where occasional stale data is acceptable
`"available"`	Same as local but faster on sharded clusters (skips orphan filtering)	Analytics queries that tolerate stale data
`"majority"`	Only data committed to a majority of replicas	Reads that must not see data that could be rolled back
`"linearizable"`	Latest majority-committed data plus a confirmation that the primary is still the leader	Critical reads that need true linearizability (e.g., leader elections, locks)
`"snapshot"`	A consistent snapshot across shards (used in multi-document transactions)	Transactions

Sharding: How MongoDB Distributes Data

When a single replica set cannot handle the data volume or throughput, MongoDB uses sharding to distribute data across multiple replica sets (called shards).

Sharded Cluster Architecture

The three key components:

mongos (router): A stateless process that routes queries to the correct shard(s). Applications connect to mongos, not directly to shards. You can run as many mongos instances as you need behind a load balancer.
Config servers: A replica set that stores the chunk-to-shard mapping. The mongos caches this mapping locally and refreshes it when it detects a stale route (a chunk has moved).
Shards: Each shard is a replica set holding a subset of the data. Data is divided into contiguous ranges called chunks (default 128 MB each), and each chunk lives on exactly one shard.

Shard Key Selection

The shard key is the most important decision in a sharded MongoDB deployment. It determines how data is distributed, how queries are routed, and ultimately the performance ceiling of your cluster.

Chunk Splitting and Balancing

The destination shard copies all documents in the chunk from the source shard
During the copy, any new writes to the chunk are forwarded to both shards
Once the copy is complete, the config server updates the chunk map
The source shard deletes its copy of the chunk data

Balancer impact on production

Targeted vs Scatter-Gather Queries

When a query includes the shard key, mongos can route it to the exact shard that holds the data. This is a targeted query and has the same performance as querying an unsharded collection.

// Targeted query (shard key included)
db.events.find({ tenantId: "acme", timestamp: { $gte: ISODate("2024-01-01") } })
// mongos routes to the shard that owns tenantId "acme"

// Scatter-gather query (no shard key)
db.events.find({ eventType: "login" })
// mongos sends to ALL shards, merges results
// Latency = max(shard1_time, shard2_time, ..., shardN_time)

Jumbo Chunks and the Balancer's Limits

To avoid jumbo chunks:

Choose a shard key with high cardinality (millions of distinct values)
Use a compound shard key where the combination has high cardinality
Never shard on a field with fewer than 1,000 distinct values

Query Execution and Indexes

MongoDB's query system converts a find() or aggregate() call into an execution plan, similar to how a SQL database processes a SELECT statement.

The Query Planner

When a query arrives, the query planner:

Identifies all candidate indexes that could satisfy the query
For each candidate, generates an execution plan
Runs all candidate plans in parallel for a short trial period (up to maxTimeMS or 10,000 documents)
Picks the plan that returned results with the least work (fewest index keys examined)
Caches the winning plan for the query shape

// Example: explain output showing query plan
db.users.find({age: {$gte: 25}, city: "NYC"}).explain("executionStats")

// Output (simplified):
{
  "winningPlan": {
    "stage": "FETCH",               // Fetch full documents
    "inputStage": {
      "stage": "IXSCAN",            // Index scan
      "indexName": "age_1_city_1",   // Compound index used
      "keyPattern": {"age": 1, "city": 1},
      "direction": "forward",
      "indexBounds": {
        "age": ["[25, Infinity]"],
        "city": ["[\"NYC\", \"NYC\"]"]
      }
    }
  },
  "executionStats": {
    "totalKeysExamined": 1542,       // Index entries scanned
    "totalDocsExamined": 1542,       // Documents fetched
    "nReturned": 1542                // Results returned
  }
}

Index Types

Index Type	Syntax	Use Case
Single field	`{age: 1}`	Equality and range queries on one field
Compound	`{country: 1, city: 1, age: 1}`	Multi-field queries following ESR rule
Multikey	`{tags: 1}`	Indexing array fields
Text	`{$text: {$search: "..."}}}`	Full-text search (limited, prefer Atlas Search)
Geospatial	`{location: "2dsphere"}`	Location-based queries
Hashed	`{userId: "hashed"}`	Even distribution for shard keys
Wildcard	`{"$**": 1}`	Indexing unknown/dynamic field names
TTL	`{createdAt: 1}, {expireAfterSeconds: 86400}`	Auto-delete expired documents

The ESR Rule for Compound Indexes

The order of fields in a compound index matters enormously. The ESR rule (Equality, Sort, Range) defines the optimal field ordering:

Equality fields first: fields queried with exact match ({status: "active"})
Sort fields next: fields used in .sort() ({createdAt: -1})
Range fields last: fields queried with ranges ({age: {$gte: 25}})

// Query: active users in NYC, sorted by newest, aged 25+
db.users.find({
  status: "active",    // Equality
  city: "NYC",         // Equality
  age: {$gte: 25}      // Range
}).sort({createdAt: -1}) // Sort

// Optimal compound index following ESR:
db.users.createIndex({
  status: 1,       // E: equality
  city: 1,         // E: equality
  createdAt: -1,   // S: sort
  age: 1           // R: range
})

Covered Queries

// Covered query: all fields in the index
db.users.find(
  {status: "active", city: "NYC"},
  {_id: 0, status: 1, city: 1, createdAt: 1}  // Projection matches index
).sort({createdAt: -1})

// With index: {status: 1, city: 1, createdAt: -1}
// explain() shows: "stage": "IXSCAN" with NO "FETCH" stage

The Aggregation Pipeline

MongoDB's aggregation framework processes documents through a pipeline of stages. Each stage transforms the documents and passes them to the next stage.

db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2024-01-01") } } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" }, count: { $sum: 1 } } },
  { $sort: { total: -1 } },
  { $limit: 10 }
])

For sharded collections, the aggregation pipeline splits into two phases:

Shard phase: $match and $group run on each shard in parallel
Merge phase: Results from all shards are merged on the mongos (or a designated shard) for the final $sort and $limit

Change Streams: Real-Time Data Notifications

// Watch for changes on the orders collection
const changeStream = db.collection("orders").watch([
  { $match: { "fullDocument.status": "completed" } }
]);

changeStream.on("change", (event) => {
  // event.operationType: "insert", "update", "replace", "delete"
  // event.fullDocument: the complete document (for insert/replace/update)
  // event.updateDescription: { updatedFields, removedFields } (for updates)
  console.log("Order completed:", event.fullDocument._id);
});

Why this matters in production

Transactions: Multi-Document ACID

How Transactions Work Internally

A transaction in MongoDB uses WiredTiger's MVCC to create a consistent snapshot at transaction start. All reads within the transaction see this snapshot, and all writes are buffered until commit.

Key implementation details:

Snapshot isolation: The transaction reads from the snapshot at T1. If another transaction modifies the same documents, the first transaction will get a WriteConflict error on commit. The application must retry.
60-second limit: Transactions that run longer than transactionLifetimeLimitSeconds (default 60 seconds) are automatically aborted. This is intentional: long transactions hold MVCC snapshots open, preventing WiredTiger from reclaiming old document versions.
Oplog entry: The entire transaction is recorded as a single applyOps oplog entry, ensuring secondaries replay the transaction atomically.

What most people get wrong

Write Conflicts and Retries

// Transaction retry pattern (recommended by MongoDB)
async function runTransactionWithRetry(session, txnFunc) {
  while (true) {
    try {
      session.startTransaction({
        readConcern: { level: "snapshot" },
        writeConcern: { w: "majority" }
      });
      await txnFunc(session);
      await session.commitTransaction();
      break;  // Success
    } catch (error) {
      if (error.hasErrorLabel("TransientTransactionError")) {
        // Write conflict or transient error, retry the whole transaction
        continue;
      }
      throw error;  // Non-retryable error
    }
  }
}

What Happens When Things Break

MongoDB is designed for production failures. Here are the key failure scenarios and how the system responds.

Failure	What Happens	How MongoDB Responds	Impact on Clients
Primary crashes	Writes unavailable	Election in ~10-12s, new primary elected	10-12s write downtime; reads from secondaries continue
Secondary crashes	One fewer replica	Replica set continues with reduced redundancy	No impact if majority of nodes survive
Network partition (minority side)	Partitioned nodes cannot reach majority	Partitioned nodes step down to secondary	Clients connected to minority partition get errors
Disk full on primary	Writes fail	MongoDB rejects writes; reads continue	Write errors until space freed
Oplog falls behind	Secondary too far behind primary	Secondary enters RECOVERING state, needs resync	That secondary unavailable for reads; may affect w:majority
WiredTiger cache exhaustion	Memory pressure	Aggressive eviction, slower operations	Increased latency, possible timeouts
Config server unavailable (sharded)	Chunk metadata inaccessible	Reads/writes to existing chunks continue; splits and migrations stop	No immediate data loss; cluster "freezes" in current state

The Split-Brain Problem and How MongoDB Avoids It

Retryable Writes and Reads

// Retryable writes are enabled by default
const client = new MongoClient("mongodb://...", {
  retryWrites: true,   // Default: true since 3.6
  retryReads: true      // Default: true since 4.2
});

// If this write fails due to network error or failover:
// 1. Driver detects the error
// 2. Driver discovers the new primary via SDAM (see below)
// 3. Driver retries the same write with the same txnNumber
// 4. New primary deduplicates if the write was already applied
await db.collection("users").updateOne(
  { _id: userId },
  { $set: { lastLogin: new Date() } }
);

Why this matters in production

Server Discovery and Monitoring (SDAM)

Which node is the primary
Which nodes are secondaries
Which nodes are unresponsive
The replication lag on each secondary

Rollback Scenario

Performance Characteristics

Operation	Latency (typical)	Throughput	Notes
Point read by `_id` (primary)	0.5-2ms	50K-100K ops/s per node	Single B-tree traversal
Point read (secondary)	0.5-2ms + replication lag	50K-100K ops/s	May return stale data
Indexed query	1-10ms	10K-50K ops/s	Depends on selectivity and result size
Collection scan	100ms-minutes	Varies	Full B-tree traversal, avoid in production
Write (w:1, journaled)	1-2ms	20K-50K ops/s per node	Single-node durability only
Write (w:majority)	5-15ms	5K-20K ops/s per node	Waits for replication
Transaction (2-3 docs)	10-30ms	1K-5K txn/s	MVCC snapshot + conflict detection overhead
Aggregation pipeline	10ms-seconds	Depends on pipeline	Indexes help for $match stage only

What Affects Performance

Working set size vs RAM: If your frequently accessed data fits in the WiredTiger cache (typically 50% of RAM), most reads come from memory. If it does not fit, every cache miss means a disk read, and latency jumps from microseconds to milliseconds.
Index coverage: Queries without index support trigger collection scans. On a collection with 100 million documents, a collection scan can take minutes. Use explain() on every production query.
Document size: WiredTiger compresses pages, but large documents (>100 KB) reduce the number of documents per leaf page, increasing the number of pages that must be read for range queries. If a document contains a large array that grows unboundedly, you will hit the 16 MB BSON document size limit and performance degrades well before that.
Write contention: Under heavy write load, WiredTiger's checkpoint cycle (every 60 seconds) creates I/O spikes. Combined with replication overhead for w:majority, sustained write throughput is lower than read throughput.
Connection pooling: Each MongoDB driver maintains a connection pool to the server. The default pool size is typically 100 connections. If your application creates too many connections (e.g., in a serverless environment where each function instance creates its own pool), the mongod process can run out of file descriptors and start rejecting connections. Use connection pooling libraries and keep the total connection count under net.maxIncomingConnections (default 65,536).

Performance Optimization Strategies

Read-heavy workloads:

Use readPreference: secondaryPreferred to distribute reads across replicas
Ensure every query has an appropriate index (use explain())
Use projection to return only needed fields, reducing network and cache overhead
Consider covered queries (all fields in the index) for high-throughput reads

Write-heavy workloads:

Use w:1 for non-critical writes where occasional data loss is acceptable
Batch writes using insertMany() or bulkWrite() to reduce round-trips
Use unordered: true in bulk operations so individual failures do not block the batch
Monitor checkpoint I/O and consider reducing syncPeriodSecs for more frequent, smaller checkpoints

Mixed workloads:

Separate read and write traffic using read preference routing
Use WiredTiger's built-in rate limiting (storage.wiredTiger.engineConfig.configString)
Monitor cache dirty bytes and ensure checkpoint throughput keeps pace with write rate

How This Compares to Alternatives

Feature	MongoDB	PostgreSQL	DynamoDB	Cassandra
Data model	Documents (BSON)	Relational (rows)	Key-value / document	Wide-column
Schema	Flexible (schema-on-read)	Strict (schema-on-write)	Flexible per item	Flexible per row
Query language	MongoDB Query Language (MQL)	SQL	PartiQL / API	CQL
Joins	$lookup (limited)	Full JOIN support	None	None
Transactions	Multi-document ACID (since 4.0)	Full ACID	Single-item or TransactWriteItems	Lightweight transactions
Replication	Replica set (Raft-based)	Streaming replication	Built-in (managed)	Masterless (Dynamo-style)
Sharding	Native range/hash sharding	Manual (Citus) or partitioning	Automatic (partition key)	Automatic (consistent hash)
Consistency	Tunable (per-operation)	Strong (serializable available)	Tunable (eventual or strong)	Tunable (per-query)
Managed option	Atlas	RDS / Aurora	Native AWS	Managed Cassandra (Keyspaces)

A few notes on this comparison:

MongoDB's $lookup is significantly slower than SQL JOINs for large datasets because it does not use hash joins or merge joins. It always uses a nested loop strategy.
DynamoDB has no equivalent to MongoDB's aggregation pipeline. Complex analytics require exporting to a data lake.
Cassandra's lightweight transactions (LWT) are much more limited than MongoDB's multi-document transactions and carry a significant performance penalty.

When to Choose MongoDB

Schema Design: Embedding vs Referencing

MongoDB's document model gives you a choice that relational databases do not: embed related data in a single document or reference it from a separate collection (like a foreign key).

The embedding decision should follow these rules:

Embed when the related data is always accessed together and has bounded growth
Reference when the related data grows unboundedly, is frequently updated independently, or is accessed separately
Hybrid (embed the common case, reference the edge case) when one access pattern dominates

Read Preference: Routing Reads in a Replica Set

Read preference controls which replica set member a read operation targets:

Read Preference	Behavior	Use Case
`primary` (default)	Always read from primary	Strong consistency required
`primaryPreferred`	Primary if available, secondary if not	Consistency preferred, availability fallback
`secondary`	Always read from a secondary	Analytics/reporting (stale data acceptable)
`secondaryPreferred`	Secondary if available, primary if not	Read scale-out with best-effort consistency
`nearest`	Read from the node with lowest network latency	Geo-distributed reads, latency-sensitive

Hidden danger with secondaryPreferred

Interview Cheat Sheet

When asked about storage format: "MongoDB stores BSON, not JSON. BSON is binary-encoded with type bytes and length prefixes, enabling O(1) field access without parsing the entire document. An ObjectId is 12 bytes: timestamp + random + counter, designed for roughly time-ordered B-tree inserts."
When asked about the storage engine: "WiredTiger uses B-trees with MVCC. Readers see consistent snapshots without blocking writers. Data is compressed on disk (snappy/zstd) and written to a journal (WAL) every 50ms, with checkpoints every 60 seconds."
When asked about replication: "Replica sets use a Raft-inspired protocol. The primary records writes in the oplog (a capped collection of idempotent operations). Secondaries tail the oplog. Elections take 10-12 seconds when a primary fails."
When asked about consistency: "MongoDB's consistency is tunable per-operation. writeConcern controls how many replicas acknowledge a write. readConcern controls what snapshot a read sees. For true consistency, use w:majority + readConcern:majority."
When asked about sharding: "Data is split into 128 MB chunks by shard key. mongos routes queries to the correct shard(s). A compound shard key with high-cardinality prefix provides both even distribution and targeted queries."
When asked about transactions: "Since 4.0, MongoDB supports multi-document ACID transactions using MVCC snapshots. Transactions have a 60-second time limit and use optimistic concurrency control with automatic write conflict detection."
When asked about indexes: "Follow the ESR rule: Equality fields first, then Sort fields, then Range fields. Use explain() to verify covered queries where all fields exist in the index, eliminating the FETCH stage."
When asked about failure handling: "If primary fails, a new primary is elected in 10-12 seconds. Writes with w:1 on the old primary that were not replicated are rolled back (saved to rollback/ directory). Use w:majority to prevent data loss."
When asked about performance: "Point reads by _id take 0.5-2ms. The key performance factor is whether the working set fits in WiredTiger cache (50% of RAM). If it does not, every cache miss hits disk."
When asked about schema design: "Embed related data in a single document when possible. This avoids joins and enables atomic single-document operations. Use references (like foreign keys) only when documents would grow unboundedly or when many-to-many relationships require it."

Test Your Understanding

Quick Recap

MongoDB stores documents as BSON (Binary JSON), a binary format with type bytes and length prefixes that enables O(1) field access without parsing the full document.
WiredTiger stores collections as compressed B-trees with MVCC, allowing concurrent readers and writers without blocking each other.
The journal (WAL) fsyncs every 50ms and checkpoints flush all dirty pages every 60 seconds, creating the crash recovery mechanism.
Replica sets use a Raft-inspired protocol with oplog tailing for replication, with elections completing in 10-12 seconds when the primary fails.
Write concern (w) controls how many replicas acknowledge a write; read concern controls what snapshot a read sees. Together they define the consistency model.
Sharding distributes data across replica sets using a shard key. Chunk splitting (metadata-only) and balancing (data migration) maintain even distribution.
The ESR rule (Equality, Sort, Range) defines optimal compound index field ordering for queries with mixed predicates and sorts.
Multi-document transactions use MVCC snapshots with a 60-second time limit and optimistic concurrency control for write conflict detection.
The biggest performance factor is whether the working set fits in the WiredTiger cache (typically 50% of RAM). Cache misses hit disk and increase latency by 10-100x.
MongoDB is best suited for document-shaped data with flexible schemas. For heavy join workloads or strict relational integrity, PostgreSQL is a better fit.
Change streams provide real-time, resumable notifications of data changes by tailing the oplog through a stable API, enabling event-driven architectures without polling.
The SDAM protocol in MongoDB drivers continuously monitors replica set topology and automatically redirects operations to the correct primary after failover, making the application resilient to elections without custom retry logic.

How DynamoDB works internally: DynamoDB's partition-key routing and write-ahead log are similar to MongoDB's shard key routing and journal, but DynamoDB is fully managed with automatic scaling.
How Cassandra works internally: Cassandra's masterless architecture and tunable consistency provide an interesting contrast to MongoDB's primary-based replication and write/read concern model.
How PostgreSQL executes a query: PostgreSQL's query planner and B-tree indexes operate on similar principles to MongoDB's, but with the added complexity of table joins and SQL optimization.
How Kafka works internally: Kafka's offset-based log consumption is conceptually similar to MongoDB's oplog tailing, and both systems use leader-based replication with follower consumers.
Consistency models in distributed systems: MongoDB's tunable consistency (from eventual to linearizable) is a practical implementation of the theoretical consistency spectrum discussed in CAP theorem and PACELC framework.
How Redis works internally: Redis's in-memory data structures and single-threaded event loop provide an interesting contrast to MongoDB's disk-based B-trees and multi-threaded WiredTiger engine. Many architectures use both: Redis for hot data caching, MongoDB for persistent document storage.

Comments

Comments