๐Ÿ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Concepts

Databases

Learn how databases organize data for fast retrieval, which storage engine to choose for your workload, and how ACID transactions keep concurrent writes correct at scale.

53 min read2026-03-23mediumdatabasessqlnosqlstorage-enginesacidhld

TL;DR

  • A database is not just a place to store data โ€” it is a query engine, a concurrency manager, and a durability guarantee packaged together. Understanding all three layers is what separates "I added PostgreSQL" from a defensible system design.
  • The storage engine (B-Tree vs. LSM-Tree) determines your read/write trade-off before you write a single query. B-Tree for balanced OLTP; LSM-Tree for write-heavy workloads where sequential I/O matters.
  • Indexes are the single highest-leverage performance tool in any database. A missing index turns a 1ms primary-key lookup into a 30-second full-table scan at 100M rows. A composite index in the wrong column order is as useless as no index.
  • ACID (Atomicity, Consistency, Isolation, Durability) solves four distinct concurrency hazards. Each property has a specific mechanism: the transaction log enforces atomicity; fsync enforces durability; isolation levels control read/write anomalies; constraints enforce consistency.
  • The right database for a job is determined by your query shape โ€” specifically, whether you need JOINs, which indexes you'll build, write vs. read ratio, and consistency requirements โ€” not by your data format or team preference.

The Problem It Solves

It is Q4 earnings. Your fintech platform has processed 2 billion transactions this year. A risk analyst needs every transaction from account ACC-8841 in the last 90 days, grouped by merchant, filtered to amounts over $500.

Simple enough โ€” except your transactions live in a distributed file system as individual JSON files, one file per transaction. Finding account ACC-8841's transactions means opening all 2 billion files sequentially. At 0.1ms per file open, that is 55 hours of I/O.

Meanwhile, your reconciliation service is reading existing transaction files while the payments service is writing new ones. No locking. A payment file half-written by the payments service is read mid-write by the reconciliation service.

The reconciliation total is wrong โ€” but by how much? There is no way to know. The file system does not track partial writes.

And at 2:47 a.m., your transaction processing server crashes between writing the debit record and the credit record. Alice's $200 was debited. Bob never received it.

I've seen teams discover this failure mode six months into production โ€” usually when a customer calls finance about a missing transfer. You have no transaction log, no rollback, no way to know which transfers completed and which are half-done.

The 'we can figure out schema later' trap

Raw files and key-value stores feel fast to build with early on. They collapse under three specific pressures: multi-column query patterns (you can only index what you anticipated), concurrent writes without coordination (race conditions that only appear under load), and crash recovery (the question becomes "how much data did we lose?" not "did we lose any?"). Every database exists to solve these three problems simultaneously.

flowchart TD
  subgraph NoDB["๐Ÿ’ฅ No Database โ€” The Three Failure Modes"]
    direction TB

    Files(["๐Ÿ“ 2B JSON files\nFlat filesystem storage"])

    subgraph Query["โŒ Query Problem"]
      QP["Full file scan for every query\n2B files ร— 0.1ms = 55 hrs\nNo index possible"]
    end

    subgraph Concurrency["โŒ Concurrency Problem"]
      CP["Payments writes file mid-read\nReconciliation reads corrupt half-file\nBalance total is wrong โ€” silently"]
    end

    subgraph Crash["โŒ Crash Problem"]
      CR["Server crashes mid-transfer\nDebit written ยท Credit missing\nNo rollback ยท No audit trail"]
    end
  end

  Files -->|"Query for ACC-8841 in Q4"| QP
  Files -->|"Concurrent read+write"| CP
  Files -->|"Crash at 2:47 AM"| CR

A database solves all three failure modes in a single system: it gives you indexes for fast queries, transactions for concurrency safety, and a write-ahead log for crash recovery. These aren't nice-to-haves โ€” they're the reason databases exist.


What Is It?

A database is a structured data store with three integrated subsystems: a query interface that translates your intent into an execution plan, a storage engine that physically organises data for the access patterns you need, and a transaction manager that serialises concurrent operations and ensures committed changes survive failures. Most candidates treat it as just a place to store data. Understanding all three subsystems is what makes your storage trade-off reasoning actually defensible.

Analogy: Think of a large reference library with millions of books. If books were stored at random, finding all books published in 1987 on quantum physics requires reading every title. A library solves this with a card catalogue (B-Tree index), books shelved by subject and author (storage model), a sign-out ledger (transaction log), and a fireproof copy of the catalogue offsite (durability).

The catalogue is not the library โ€” it is the structure that makes the library useful at scale. Your schema, indexes, and access patterns are the catalogue.

flowchart TD
  subgraph AppTier["โš™๏ธ Application Tier"]
    App(["๐Ÿ‘ค Application\nread/write requests"])
  end

  subgraph DBTier["๐Ÿ—„๏ธ Database System"]
    CP["๐Ÿ”€ Connection Pool\nmax 100โ€“500 connections\nrequest queuing"]
    QP["โš™๏ธ Query Parser & Planner\nparse SQL โ†’ AST โ†’ execution plan\nchoose index or full scan"]
    EE["โš™๏ธ Execution Engine\nrun plan ยท join tables\neval predicates"]
    SE["๐Ÿ—„๏ธ Storage Engine\nB-Tree or LSM-Tree\nmanage pages on disk"]
    BP["โšก Buffer Pool\nin-memory page cache\n~90%+ cache of hot pages"]
    WAL[("๐Ÿ“ Write-Ahead Log\nappend-only crash recovery\nfsync on commit")]
    Disk[("๐Ÿ’พ Data Files\n8KB pages on disk\nprimary storage")]
  end

  App -->|"SQL / API query"| CP
  CP -->|"1 connection per request"| QP
  QP -->|"execution plan"| EE
  EE -->|"page request"| BP
  BP -->|"Buffer miss โ†’ disk read"| Disk
  EE -->|"Write โ†’ WAL first"| WAL
  WAL -.->|"Async flush to data files"| Disk
  BP -.->|"Dirty page eviction"| Disk

With a database, the risk analyst's query becomes:

SELECT merchant_id, SUM(amount), COUNT(*)
FROM transactions
WHERE account_id = 'ACC-8841'
  AND created_at >= NOW() - INTERVAL '90 days'
  AND amount > 500
GROUP BY merchant_id
ORDER BY SUM(amount) DESC;

With a composite index on (account_id, created_at, amount), this executes in under 5ms regardless of how many total transactions exist โ€” because the storage engine traverses the B-Tree to account ACC-8841, range-scans the last 90 days, and filters the rest. It never touches rows for other accounts. That 5ms-vs-55-hours gap is only possible because all three subsystems โ€” query, storage, and transaction โ€” work together.


How It Works

Here is what happens end-to-end when your application issues SELECT * FROM products WHERE id = 7429:

  1. Connection pool assigns a connection โ€” Your application maintains a pool of persistent database connections. A new request borrows one instead of negotiating a full TCP + auth handshake (5โ€“50ms overhead per request without a pool).
  2. Parser validates syntax โ€” The SQL string is tokenised and parsed into an Abstract Syntax Tree. Syntax errors are caught here before touching any data.
  3. Query planner chooses an execution plan โ€” The planner checks pg_statistics for row count estimates, index availability, and selectivity. If id is a primary key (clustered B-Tree index), the plan is: Index Scan using products_pkey on products WHERE id = 7429. If no index exists: Seq Scan on products โ€” read every row.
  4. Execution engine runs the plan โ€” For an index scan, it traverses the B-Tree from the root to the leaf page containing id = 7429.
  5. Buffer pool check โ€” Before hitting disk, the execution engine checks the in-memory buffer pool. If the page is already cached (~95% hit rate on hot data), it returns immediately (< 0.5ms). On a miss, it fetches the 8KB page from disk (~1โ€“5ms SSD, ~5โ€“15ms HDD) and warms the buffer pool for future access.
  6. Result returned โ€” The deserialized row data is returned to the connection and ultimately to your application.
// Application layer โ€” always use parameterised queries (prevents SQL injection)
async function getProduct(productId: number): Promise<Product | null> {
  // Connection pool handles connection acquisition; do NOT create one per request
  const result = await db.query<Product>(
    'SELECT id, name, price, stock_qty FROM products WHERE id = $1',
    [productId]  // Parameter, not string interpolation โ€” prevents SQL injection
  );
  return result.rows[0] ?? null;
}
-- Debug slow queries with EXPLAIN ANALYZE (PostgreSQL)
EXPLAIN ANALYZE
SELECT id, name, price FROM products WHERE id = 7429;

-- Without index (catastrophic at 100M rows):
-- Seq Scan on products  (cost=0.00..2843291.00 rows=1 width=40)
--   (actual time=28943.871..28943.872 rows=1 loops=1)
-- Planning: 3.2ms ยท Execution: 28943.9ms (29 seconds!)

-- With primary key index (default B-Tree):
-- Index Scan using products_pkey  (cost=0.56..8.58 rows=1 width=40)
--   (actual time=0.042..0.043 rows=1 loops=1)
-- Planning: 0.1ms ยท Execution: 0.043ms (instant!)

Interview tip: always EXPLAIN ANALYZE before adding infrastructure

When an interviewer asks how you'd handle slow queries, say: "First, I'd run EXPLAIN ANALYZE to confirm the execution plan โ€” 80% of database performance problems are resolved with the right index, not a new database or caching layer." This signals you know to measure before architecting.

The sequence from query to data on every non-trivial request:

sequenceDiagram
  participant App as โš™๏ธ Application
  participant Pool as ๐Ÿ”€ Connection Pool
  participant Planner as โš™๏ธ Query Planner
  participant Buffer as โšก Buffer Pool
  participant Disk as ๐Ÿ’พ Disk

  App->>Pool: borrow connection
  Pool-->>App: connection ready (~0ms)
  App->>Planner: SELECT * FROM products WHERE id=7429
  Planner->>Planner: parse SQL โ†’ check statistics<br/>โ†’ choose index scan plan
  Planner->>Buffer: fetch page for id=7429

  alt Buffer HIT (~95% of reads)
    Buffer-->>Planner: page data  ยท  < 0.5ms
  else Buffer MISS
    Buffer->>Disk: read 8KB page from SSD
    Disk-->>Buffer: page data  ยท  1โ€“5ms
    Buffer-->>Planner: page data (now cached)
  end

  Planner-->>App: result row
  App->>Pool: return connection

Key Components

ComponentRole
Query parserConverts SQL string into an Abstract Syntax Tree. Rejects malformed queries immediately.
Query planner / optimiserChooses the execution plan: which indexes to use, join ordering, predicate pushdown. Wrong plans are the leading cause of unexplained slowdowns.
Execution engineRuns the plan: iterates rows, evaluates predicates, aggregates results. For complex queries, this stage does the most CPU work.
Buffer poolIn-memory cache of disk pages (typically 25โ€“80% of RAM in PostgreSQL). The goal is a 95%+ hit rate so most reads never touch disk. The database equivalent of an application-level cache โ€” but automatic.
Storage engineManages the on-disk layout: B-Tree for PostgreSQL/MySQL, LSM-Tree for Cassandra/RocksDB. Determines read/write performance characteristics.
Write-Ahead Log (WAL)Every write is appended to the WAL before touching data files. On crash, the DB replays the WAL to restore the committed state. This is how durability (the D in ACID) is implemented.
Transaction managerAssigns transaction IDs, tracks in-flight transactions, enforces isolation levels, coordinates commits and rollbacks. Maintains MVCC snapshots.
Connection poolManages a pool of persistent DB connections. Creating a new connection costs 5โ€“50ms and requires authentication + session setup. At 10K req/s, new-connection-per-request adds 50โ€“500ms of pure overhead.
Replication slotTracks how far each replica has consumed the WAL stream. A lagging replica with a stale slot will prevent the primary from cleaning old WAL segments โ€” a disk-space trap.
MVCC (Multi-Version Concurrency Control)Maintains multiple row versions so readers never block writers. Readers see a consistent snapshot; they do not acquire locks. This is why PostgreSQL reads scale independently of writes.

Storage Engines: B-Tree vs. LSM-Tree

The storage engine is the component that physically organises your data on disk. Choosing the wrong one for your workload is the most permanent performance mistake you can make โ€” it is baked into your schema design, your operational runbooks, and your hardware bill. I see this decision made on vibes more often than any other in system design โ€” "Cassandra scales better" is not an analysis.

Side-by-side comparison of B-Tree and LSM-Tree storage engines. Left: B-Tree with root, internal, and leaf nodes showing tree traversal. Right: LSM-Tree with MemTable in RAM, L0 unsorted SSTables, L1 sorted SSTables, and L2 fully merged SSTable.
B-Tree and LSM-Tree solve opposite problems. B-Tree optimises for reads with O(log N) traversal but suffers write amplification from in-place updates. LSM-Tree absorbs writes in RAM first, flushing to immutable sorted files โ€” eliminating random I/O at the cost of read amplification.

B-Tree

B-Trees store data as a balanced tree of fixed-size pages (typically 8KB). Every read and write traverses from the root to a leaf page.

The leaf pages contain the actual row data (or row pointers in a secondary index). Leaves are linked for efficient range scans.

Why it dominates OLTP: B-Tree provides O(log N) for both reads and writes, predictable latency, and efficient range queries (BETWEEN, ORDER BY, >, <). The standard page size means the OS page cache and the DB buffer pool interact cleanly, and SSD random read latency (0.1ms) is fast enough for typical OLTP query depths of 3โ€“5 levels.

The write amplification problem: A single 100-byte row write may trigger an 8KB page rewrite. More critically, as the tree rebalances, a single insert near a full page triggers a page split: the existing page is copied, split into two new pages, and the parent page is updated. In heavy insert workloads, write amplification factors of 5โ€“50ร— are common โ€” one logical write becomes 5โ€“50 physical disk writes.

LSM-Tree

LSM-Trees (Log-Structured Merge Trees) never update data in-place. All writes go to an in-memory buffer (MemTable) first. When the MemTable fills (~64MB), it is flushed as an immutable sorted file (SSTable) to disk.

Background compaction merges SSTables from Level 0 into larger sorted runs at Level 1, then Level 2, reducing the number of files reads must check.

Why it dominates write-heavy workloads: Writes are always sequential appends (O(1), no random I/O). On modern SSDs, sequential write throughput is 5โ€“10ร— faster than random writes. This makes LSM-Trees ideal for time-series, IoT telemetry, event logs, and write-heavy analytics.

The read amplification problem: To find a key, the DB must check: MemTable โ†’ L0 (every file, since they're unsorted at L0) โ†’ L1 โ†’ L2 โ†’ โ€ฆ each additional level adds disk accesses. LSM-Trees use per-SSTable Bloom filters to skip files that definitely do not contain a key, but reads still hit multiple files vs. a single B-Tree path.

DimensionB-TreeLSM-Tree
Write patternRandom I/O โ€” in-place updateSequential I/O โ€” always append
Write amplification5โ€“50ร— (page splits, rebalancing)10โ€“30ร— (compaction rewrites)
Read performanceO(log N) โ€” single tree pathO(log N) with amplification โ€” multi-level check
Range scansExcellent โ€” linked leaf pagesModerate โ€” must merge-read SSTables
Space efficiency~50โ€“60% page utilisationHigher โ€” but compaction reclaims space
Compaction costNone (tree maintains itself)Background CPU/IO โ€” can spike under heavy write load
Best forOLTP ยท mixed read/write ยท ACID transactionsWrite-heavy ingestion ยท time-series ยท analytics
EnginesPostgreSQL, MySQL InnoDB, SQLite, OracleCassandra, LevelDB, RocksDB, ClickHouse (partial)

The fundamental tension: read amplification vs. write amplification. B-Tree minimises read amplification (single path from root to leaf) at the cost of write amplification. LSM-Tree minimises write amplification (append-only) at the cost of read amplification (multiple levels to check).

Default to B-Tree unless your write throughput or append-only access pattern specifically justifies the LSM-Tree trade-off.


Indexing

An index is a separate data structure (usually a B-Tree) that stores a sorted subset of your columns with pointers back to the full rows. The query planner uses it to jump directly to relevant rows instead of scanning every row in the table. I'd argue this is the single highest-leverage database skill โ€” and the one most candidates gloss over when designing a system under time pressure.

Without an index on email: Find the user with email = 'alice@example.com' from 100M users โ†’ Full Table Scan โ†’ read every row โ†’ ~30 seconds.

With a B-Tree index on email: The planner traverses the B-Tree (5 levels deep at 100M rows) โ†’ jumps to the leaf page โ†’ fetches the row pointer โ†’ one disk I/O โ†’ < 1ms.

The 30,000ร— latency difference is not an exaggeration. It is the norm whenever a production query runs a full table scan against a large table.

-- See what the planner is doing
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';

-- After adding index:
CREATE INDEX idx_users_email ON users(email);
-- Index Scan using idx_users_email on users
-- (actual time=0.041..0.042 rows=1 loops=1)  -- 0.042ms

-- Composite index: column order matters enormously
-- Query: WHERE account_id = 'ACC-8841' AND created_at >= '2026-01-01' AND amount > 500
CREATE INDEX idx_txns_account_date_amount
  ON transactions(account_id, created_at, amount);
-- The planner uses this index for the above query.
-- Index on (created_at, account_id, amount) would NOT use the account_id push-down โ€”
-- range scan on the first column exhausts the index's selectivity benefit.

Index types

TypeData structureQuery supportNotes
B-Tree (default)Balanced treeEquality, range, LIKE 'prefix%'99% of indexes you will ever create
HashHash mapEquality only โ€” =Faster than B-Tree for pure equality at the cost of range support
GIN/GiSTInverted/spatialFull-text search, arrays, JSON containment, geographicPostgreSQL-specific; essential for @>, @@, && operators
PartialB-Tree on filtered subsetEquality + range on the subsetCREATE INDEX ... WHERE status = 'active' โ€” smaller, faster
CoveringB-Tree storing non-key columnsIndex-only scan โ€” zero table accessCREATE INDEX ... INCLUDE (price, stock_qty)
CompositeB-Tree on N columnsLeft-prefix rule โ€” first N columns must matchColumn order matters: high-cardinality, equality-first columns first

The left-prefix rule and when composite indexes fail

A composite index on (account_id, created_at, amount) is used by queries that filter on account_id, or account_id + created_at, or all three. It is NOT used by queries that only filter on created_at or amount alone โ€” the first column must be constrained. Build composite indexes starting with the highest-cardinality equality column (usually an ID), followed by range columns, followed by low-cardinality filters.

Covering indexes โ€” eliminating table access entirely

A covering index stores non-key columns directly in the index leaf pages. If all columns your query touches are in the index, the DB never fetches the underlying table pages โ€” the query executes entirely from the index, which is typically much smaller and more cache-friendly.

-- Without covering index: index scan + heap fetch (table access)
CREATE INDEX idx_products_status ON products(status);
SELECT id, name, price FROM products WHERE status = 'active';
-- Plan: Index Scan โ†’ for each matching id, fetch the heap page for name + price

-- With covering index: index-only scan
CREATE INDEX idx_products_status_covering ON products(status) INCLUDE (name, price);
SELECT id, name, price FROM products WHERE status = 'active';
-- Plan: Index Only Scan โ€” name and price come from the index, no table fetch
-- Typically 3โ€“5ร— faster for wide tables with narrow query projections

The N+1 query problem

The most common application-layer database performance issue. Your ORM executes one query to fetch a list, then one query per item to fetch a related record.

// N+1 โ€” catastrophic at scale
const orders = await db.query('SELECT * FROM orders WHERE status = $1', ['pending']);
// Returns 1,000 orders โ€” then fires 1,000 individual queries:
for (const order of orders) {
  order.customer = await db.query(
    'SELECT * FROM customers WHERE id = $1', [order.customer_id]
  );
}
// Total: 1,001 queries ยท 1,000 round-trips ยท ~5 seconds for what should be 5ms

// Fix: JOIN or batched IN clause
const orders = await db.query(`
  SELECT o.*, c.name, c.email
  FROM orders o
  JOIN customers c ON c.id = o.customer_id
  WHERE o.status = $1
`, ['pending']);
// Total: 1 query ยท 1 round-trip ยท ~5ms โœ…

Every slow query I've seen at scale traces to one of three causes: a missing index, the wrong composite index column order, or N+1 in the ORM layer โ€” check those before adding caching or a new database tier.


Transactions and ACID

A transaction is a unit of work that either completes entirely or not at all. ACID guarantees what that means under failure and concurrency. My recommendation: learn each of the four properties through its implementation mechanism โ€” the WAL enforces atomicity, fsync enforces durability, MVCC enables isolation, constraints enforce consistency โ€” rather than just memorising the acronym.

The canonical bank transfer example โ€” without transactions:

-- Session 1 executes these two statements independently (no transaction):
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
-- โ†‘ Server crashes here (power failure, OOM, reboot)
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
-- This line never executes. alice's $100 has vanished from the system.

With a transaction (ACID):

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
  -- If server crashes here, WAL records the in-progress transaction.
  -- On restart, PostgreSQL sees an uncommitted transaction and rolls it back.
  -- alice's balance is restored. No money is lost.
  UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;
-- Only COMMIT makes the change permanent and visible to other sessions.

A โ€” Atomicity

All operations in a transaction succeed, or none of them take effect. Implemented via the Write-Ahead Log (WAL): every change is logged before being applied. On crash, uncommitted changes are rolled back by replaying the WAL in reverse.

C โ€” Consistency

A transaction can only bring the database from one valid state to another. Constraints, foreign keys, and check constraints enforce this: if any constraint is violated mid-transaction, the entire transaction is rolled back. Consistency is enforced by your schema design โ€” the DB provides the mechanism.

I โ€” Isolation

Concurrent transactions do not observe each other's intermediate state. Isolation levels control the trade-off between correctness guarantees and throughput:

Isolation LevelDirty ReadNon-Repeatable ReadPhantom ReadPerformance
Read UncommittedโŒ PossibleโŒ PossibleโŒ Possibleโšกโšกโšก Highest
Read Committed (PostgreSQL default)โœ… PreventedโŒ PossibleโŒ Possibleโšกโšก High
Repeatable Readโœ… Preventedโœ… Preventedโœ… Prevented (PostgreSQL MVCC)โšก Moderate
Serializableโœ… Preventedโœ… Preventedโœ… Prevented๐Ÿข Lowest
  • Dirty read: Reading a row modified by an uncommitted transaction. Prevented by any level above Read Uncommitted.
  • Non-repeatable read: Re-reading the same row within a transaction and seeing a different value (another transaction committed a change between your two reads).
  • Phantom read: Re-running a range query and seeing new rows (another transaction inserted matching rows between your two queries).
sequenceDiagram
  participant T1 as ๐Ÿ”ต Transaction 1
  participant DB as ๐Ÿ—„๏ธ Database
  participant T2 as ๐Ÿ”ด Transaction 2

  Note over T1,T2: Read Committed isolation (PostgreSQL default)<br/>Non-repeatable read is POSSIBLE here

  T1->>DB: BEGIN
  T2->>DB: BEGIN

  T1->>DB: SELECT balance FROM accounts WHERE id='alice'
  DB-->>T1: balance = 1000

  T2->>DB: UPDATE accounts SET balance = 500 WHERE id='alice'
  T2->>DB: COMMIT  โ† committed before T1 re-reads

  T1->>DB: SELECT balance FROM accounts WHERE id='alice'
  DB-->>T1: balance = 500  โ† different value! (non-repeatable read)

  Note over T1: T1 sees two different values for<br/>the same row within one transaction.<br/>Use REPEATABLE READ to prevent this.

Interview tip: default to Read Committed, justify Serializable explicitly

PostgreSQL defaults to Read Committed for a reason: it prevents dirty reads (the most dangerous anomaly) while keeping throughput high. Escalate to Serializable only when you can justify it: financial double-spend prevention, inventory allocation, or any scenario where "SELECT + conditional UPDATE" must be atomic. Saying "I'd use Serializable transactions for all checkout operations to prevent overselling" is more precise than "I'd add a distributed lock."

D โ€” Durability

A committed transaction survives server crashes, power failures, and hardware faults. Implemented by fsync: PostgreSQL calls fsync() on the WAL file before acknowledging COMMIT to the client. This forces the OS to flush the WAL to physical disk rather than retaining it in the OS page cache (which would be lost on a crash).

-- PostgreSQL synchronous commit settings (trade durability for throughput)
-- synchronous_commit = on   โ†’ fsync on every COMMIT (default, fully durable)
-- synchronous_commit = off  โ†’ async commit, ~10ms durability window (risk: last 10ms of commits lost on crash)
-- synchronous_commit = remote_write โ†’ commit when at least one standby has received the WAL

Get the WAL and MVCC mental models solid โ€” every question about concurrent writes, replica lag, or crash recovery traces back to one of them.


Database Types

TypeData ModelPrimary Access PatternHorizontal ScaleACIDExamples
RelationalTables with rows + fixed schemaMulti-column queries, JOINs, aggregationsVertical + read replicas (sharding is manual)FullPostgreSQL, MySQL, SQLite
DocumentSemi-structured JSON documentsFetch document by ID or field; no JOINsNative horizontal shardingPer-documentMongoDB, CouchDB, Firestore
Key-ValueOpaque value per keySingle-key lookup only (no secondary index natively)Horizontal sharding by keyLimitedRedis, DynamoDB, etcd
Wide-ColumnRows with sparse, dynamic columnsKey + column family lookup; append-optimisedNative horizontal sharding across nodesEventual/configurableCassandra, HBase, ScyllaDB
Time-SeriesTimestamped metrics + tagsRange queries over time, downsampling, retentionHorizontal by time rangeLimitedInfluxDB, TimescaleDB, Prometheus
GraphNodes + directed/labelled relationshipsRelationship traversal (depth-first, shortest path)Limited (JOINs are O(1) per hop)Full (Neo4j)Neo4j, Amazon Neptune, DGraph

Which to choose and when

Here's the honest answer on database choice: start with Relational, and don't leave until a specific query shape or throughput requirement forces you out.

Relational is the correct default for any new service until proven otherwise. Every data model can be represented in a relational schema. The question is never "can I use Postgres?" but "what query patterns does Postgres make expensive?"

Document is the right upgrade when your data is naturally hierarchical (blog posts with nested comments are one document, not three JOIN tables) and you never query across document boundaries. The moment you need JOIN-equivalent queries, you have made the wrong choice.

Wide-Column (Cassandra) is correct when: write volume exceeds what a primary DB can absorb, data is append-only, and read patterns are known and fixed to partition key + clustering column. Its power is 1M+ writes/second at linear scale; its trap is the total inability to query columns outside the primary/clustering key without a secondary index (which is expensive at scale).

Key-Value is the right choice when the access pattern is exclusively get(key) and set(key, value) โ€” sessions, leaderboards, feature flags, distributed locks. DynamoDB adds secondary indexes and range queries on the sort key; Redis adds data structures (sorted sets, streams, pub/sub) on top of the basic K-V model.

Time-Series is correct when your schema is (timestamp, metric_name, value, tags[]) and your queries are "average CPU over last 5 minutes, grouped by host". Time-series databases optimise for exactly this: they compress timestamp data aggressively, automatically downsample old data (replace per-minute points with per-hour averages after 30 days), and enforce retention policies that would require manual partitioning in PostgreSQL.

Polyglot persistence โ€” using multiple databases in one system

Large-scale systems routinely combine database types for different workloads within the same product. Stripe uses PostgreSQL for transactional ledger entries, Redis for rate limiting and distributed locks, and a column store for analytics. This is called polyglot persistence and it is the correct architectural pattern โ€” each database is used for what it does best, not everything.

The companies doing polyglot persistence earned it by starting simple and migrating to each database only when a specific pain point demanded it. Nobody wins by starting with four databases.


Trade-offs

ProsCons
Indexes reduce query time by 100โ€“100,000ร— for selective lookups vs. full scansIndexes add write overhead on every INSERT/UPDATE/DELETE โ€” each index is a separate B-Tree that must be maintained
ACID transactions prevent data corruption under concurrent writes and crashesSerializable isolation reduces throughput; every additional isolation guarantee costs concurrency
Relational schemas enforce data integrity at the DB layer โ€” impossible to write partial or malformed recordsSchema migrations on large tables are expensive operations; ALTER TABLE ADD COLUMN on a 100M-row table can lock the table for minutes
Buffer pool caches hot pages automatically โ€” most reads never hit diskBuffer pool pressure from infrequent scans (OLAP queries) evicts hot pages needed by OLTP queries โ€” mixed workloads conflict
Replication provides read scaling and high availability at low latency overheadReplica lag introduces read-after-write inconsistency โ€” a user writes a row, reads it back immediately from a replica, and it isn't there yet
PostgreSQL handles 10Kโ€“50K transactional queries/second on modern hardwareHorizontal write scaling requires manual sharding in relational databases โ€” not a knob you turn, a re-architecture project
Mature tooling: pgvector, PostGIS, full-text search, JSONB, partitioning โ€” all without leaving the relational modelAuto-vacuuming and dead tuple bloat in MVCC databases require operational discipline; a neglected table grows unboundedly until vacuum reclaims space

The fundamental tension here is consistency vs. throughput. Each ACID guarantee restricts the concurrency model: serializable isolation is provably correct but reduces the number of concurrent transactions the database can handle.

Every time you relax an isolation level or use eventual consistency, you trade a specific correctness guarantee for a specific throughput gain. The engineering discipline is knowing which anomalies are acceptable for each data type in your system.


When to Use It / When to Avoid It

So when does the database choice actually matter in an interview? Every time โ€” because every system design answer involves persisting state, and the database type determines your query flexibility, consistency guarantees, and scaling ceiling.

Use a relational database (PostgreSQL) as your starting point when:

  • Your data has relationships between entities that you'll JOIN โ€” orders with customers, users with roles, payments with accounts.
  • You need ACID transactions โ€” financial operations, inventory management, any workflow where partial completion is incorrect.
  • Your query patterns are varied and not yet fixed โ€” relational indexes can be added after launch; a key-value schema cannot be retroactively queried.
  • You are at < 10K writes/second on a single-tenant service. PostgreSQL handles this comfortably without sharding.
  • Your team knows SQL. The operational cost of operating an unfamiliar database under production pressure is often higher than the cost of vertically scaling a familiar one.

Upgrade to a different database type when (and only when):

  • Write throughput exceeds what a primary + N replicas can sustain โ†’ Wide-Column (Cassandra) or a queue-backed writes pattern.
  • You are storing 100B+ time-stamped metrics with downsampling and retention requirements โ†’ TimescaleDB or InfluxDB.
  • Your primary query is relationship traversal ("all friends-of-friends within 3 hops") โ†’ Graph database; this becomes exponentially expensive in SQL JOINs.
  • Your schema is genuinely unknown and changes per customer (SaaS with per-customer custom fields at scale) โ†’ Document database, but validate that you truly never need cross-document queries before committing.

Avoid over-engineering by:

  • Not choosing a NoSQL database because it "scales better" โ€” define what scale you need and verify your chosen database cannot reach it before switching.
  • Not sharding before you've exhausted read replicas, connection pooling, and query optimisation โ€” sharding is a complexity multiplier that makes every future feature harder.
  • Not using multiple database types before exhausting what your single database type can do with proper schema design (JSONB in PostgreSQL handles many document use cases with full SQL queryability).

Real-World Examples

The most instructive examples aren't the "we switched to NoSQL" stories โ€” they're the "we stayed on PostgreSQL and here's exactly how" ones. The three below show what database discipline actually looks like at scale.

Stripe โ€” PostgreSQL at financial scale

Stripe runs one of the largest PostgreSQL deployments in the world, processing hundreds of billions of dollars annually. Their key architectural decisions: every service owns its own isolated PostgreSQL instance (not a shared mega-DB), tables are partitioned by created_at and id so old partitions become append-only and can be archived to cold storage, and all financial mutations use serializable isolation to prevent double-processing.

Stripe does not use NoSQL for their core ledger โ€” the correctness guarantees of relational ACID transactions are non-negotiable for payment data. Their scaling technique is not sharding; it is aggressive partitioning, careful index design, and splitting read-heavy operations onto read replicas.

Discord โ€” 1 trillion messages: Cassandra โ†’ ScyllaDB migration

Discord stored their message history in Apache Cassandra. At 1 trillion messages, read latency became unpredictable: Cassandra's garbage collection pauses caused p99 read latency to spike from 5ms to 500ms under load. In 2023, Discord migrated to ScyllaDB (a Cassandra-compatible, C++-rewritten engine with lower GC overhead).

The migration achieved 99th percentile read latency drop from ~500ms to ~15ms on the same data and access patterns. The lesson: the data model (wide-column, partition by channel_id + bucket_id) was correct. The JVM garbage collector was the bottleneck, not the storage engine.

Notion โ€” PostgreSQL at document-scale

Notion stores all workspace content in PostgreSQL. At 100M+ blocks across millions of workspaces, they use a physically partitioned schema: each customer workspace's data lives in separate PostgreSQL table partitions.

This keeps operational data access patterns local (no cross-workspace queries) and enables per-workspace backup, restore, and isolation. Notion deliberately chose PostgreSQL over a document database (despite storing JSON-like block structures) because of ACID transaction guarantees when a user edits a page simultaneously from multiple devices โ€” the relational transaction model handles concurrent edits correctly; eventual-consistency document databases would require application-layer conflict resolution.

The pattern across all three: the bottleneck was operational or implementation-level, not a fundamental flaw in the core data model choice.


How This Shows Up in Interviews

Every system design question involves a database โ€” name yours in the first three minutes with a concrete rationale. The mistake I see most often is candidates who spend ten minutes describing features and never commit to a database choice. State it, justify it with a query pattern and a throughput number, and move on.

When to bring it up proactively

Every system design question involves a database choice. Within the first 3 minutes, name your primary database with a rationale: "I'd use PostgreSQL for the user and payment data โ€” ACID transactions are non-negotiable for financial state, and Postgres handles 20K writes/second on a single primary which satisfies our throughput estimate." Then add read replicas for read scaling. Only introduce additional databases when you can show the specific query shape that your primary database cannot serve efficiently.

Don't say 'we can always switch databases later'

Migrating a production database is one of the most expensive engineering operations that exists. Schema and access pattern decisions made at system design time are extremely sticky. An interviewer will push back hard if you suggest adding a Cassandra cluster without specifying the write throughput that justifies it, or choosing MongoDB without demonstrating a schema that genuinely cannot be served by a relational model.

Here's what separates a strong answer from a weak one at the senior level:

Depth expected at senior/staff level:

  • Know your numbers: single PostgreSQL primary handles 10Kโ€“50K OLTP queries/second; adding a connection pooler (PgBouncer) handles C10K connections; each read replica adds ~50K read queries/second.
  • Explain index selection by workload: "I'd add a composite index on (user_id, created_at DESC) for the feed query โ€” user_id is the equality predicate (high cardinality), created_at DESC is the range that drives the sort, and I'd consider INCLUDE columns to make it a covering index."
  • Articulate the ACID trade-off concretely: "Read Committed prevents dirty reads at the cost of non-repeatable reads. For payment operations, I'd use Serializable to prevent the phantom read anomaly that allows double-spend. For read-heavy reporting queries, Read Committed is fine."
  • Address sharding as a last resort with specific triggers: "I'd shard only after read replicas + PgBouncer + index optimisation are exhausted. Sharding by user_id is the natural partition key here โ€” it keeps a user's data co-located and avoids cross-shard JOINs on the most common query."
  • Know operational concerns: vacuum in PostgreSQL, connection pool sizing (max_connections = RAM_for_connections / 10MB_per_connection), WAL management, and replica lag monitoring.

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"How do you handle 300K writes/second?""That's beyond a single PostgreSQL primary (~50K w/s). I'd evaluate: (1) can writes be made async via a queue? (2) can the write table be sharded by user_id? (3) is the write pattern time-series (โ†’ Cassandra/LSM-Tree)? Before choosing, I need to know if writes are idempotent and whether ORDER BY queries across all writers are required."
"What happens to reads when you add a read replica?""Reads on the replica see replica lag โ€” typically 10โ€“200ms behind the primary. For read-after-write (user posts a comment and immediately sees it), route the read back to the primary for that session. For analytics and reporting queries, replica lag is acceptable. Alert if replica lag exceeds your SLA threshold โ€” aggressive vacuum or large transactions on the primary cause it."
"How does PostgreSQL handle concurrent updates to the same row?""MVCC: readers never block writers. Writers queue behind each other per row via row-level locks (SELECT FOR UPDATE). PostgreSQL uses optimistic concurrency for reads: each transaction gets a snapshot; if two writes conflict, the second writer's transaction is aborted and must retry. This is why application retry logic on serialization_failure errors is required."
"When would you choose MongoDB over PostgreSQL?""When the data is genuinely document-shaped, schema evolution is so rapid that migrations are a bottleneck, documents are the primary access unit (no cross-document JOINs), and horizontal scaling is needed before tuning PostgreSQL is exhausted. The risk: MongoDB's eventual consistency model makes any multi-document operation a coordination problem the application must solve. PostgreSQL with JSONB handles most 'document-like' schemas with full SQL queryability โ€” I'd start there."
"How do you prevent an expensive OLAP query from killing your OLTP latency?""Query isolation: route OLAP queries to a dedicated read replica or a Redshift/BigQuery analytical store. On the OLAP replica, configure statement_timeout to kill runaway queries before they consume all connections. Physically separate OLAP and OLTP workloads at the hardware level โ€” they have opposite profiles (OLAP: sequential I/O, many rows, low concurrency; OLTP: random I/O, few rows, high concurrency)."

The candidates who stand out can trace every trade-off to a specific mechanism โ€” not "databases are slow" but "row-level locks serialize writes, which limits throughput at Serializable isolation."


Test Your Understanding


Quick Recap

  1. A database solves three problems simultaneously: it indexes data for fast queries, coordinates concurrent writes with transactions to prevent data corruption, and uses the Write-Ahead Log to survive crashes without data loss.
  2. The storage engine (B-Tree vs LSM-Tree) determines your read/write performance profile before any query runs โ€” B-Tree for predictable O(log N) OLTP reads and writes, LSM-Tree for write-heavy workloads that need sequential I/O; choose based on your write-to-read ratio, not the database brand.
  3. Indexes are the highest-leverage performance tool: without an index, a query on 100M rows takes 30+ seconds; with the right composite index, the same query runs in under 1ms โ€” the most important interview answer you can give to any "how do you scale reads?" question is "I'd check the query plan before adding infrastructure."
  4. ACID's I (Isolation) is the trickiest property โ€” PostgreSQL's default Read Committed prevents dirty reads but not non-repeatable reads; use SELECT FOR UPDATE or optimistic conditional UPDATE to make check-and-modify operations atomic; use Serializable for financial operations where phantom reads allow double-spend.
  5. Connection pools (PgBouncer) are non-optional at scale โ€” PostgreSQL allocates 5โ€“10MB per connection, and 10,000 direct connections consume 70GB before executing a single query; PgBouncer multiplexes 100,000 app-layer connections over 300โ€“500 real DB connections.
  6. The Saga pattern replaces distributed transactions for cross-service workflows โ€” each service commits locally, publishes a domain event, and compensating transactions roll back on failure โ€” avoiding 2PC's coordinator-failure-induced deadlocks at the cost of brief cross-service inconsistency windows.
  7. In every interview, state your database choice with a query analysis: "PostgreSQL handles our 5K writes/second and JOIN-heavy query pattern โ€” I'd add a composite index on (user_id, created_at) for the feed query, use read replicas for analytics, and only shard if write throughput exceeds what a single primary can sustain."

Related Concepts

  • Caching โ€” Caching exists to protect databases from read fan-out; understanding which queries are worth caching (high hit rate, acceptable staleness) requires understanding database query cost and connection pool limits. Cache hit rate and DB connection utilisation are the two numbers that determine whether your read tier is healthy.
  • Replication โ€” Database replication is covered in depth in the Replication article: primary-replica lag, synchronous vs asynchronous WAL streaming, and the trade-off between read scale and durability guarantees on commit.
  • Sharding โ€” When vertical scaling and read replicas are exhausted, sharding partitions data across multiple database primaries. The shard key choice determines whether queries become cross-shard fan-outs (expensive) or stay local (fast). Never shard before exhausting what a single primary can do.
  • SQL vs. NoSQL โ€” The tradeoffs article frames the relational vs. document/wide-column decision more completely, covering schema evolution speed, horizontal scaling flexibility, and the specific workload patterns where each model wins.
  • Scalability โ€” Databases are typically the first bottleneck on the scaling path. The scalability article covers the full vertical โ†’ read replicas โ†’ sharding โ†’ CDN progression with concrete traffic thresholds at each step.

Previous

CDN (Content Delivery Network)

Next

Sharding

Comments

On This Page

TL;DRThe Problem It SolvesWhat Is It?How It WorksKey ComponentsStorage Engines: B-Tree vs. LSM-TreeB-TreeLSM-TreeIndexingIndex typesCovering indexes โ€” eliminating table access entirelyThe N+1 query problemTransactions and ACIDA โ€” AtomicityC โ€” ConsistencyI โ€” IsolationD โ€” DurabilityDatabase TypesWhich to choose and whenTrade-offsWhen to Use It / When to Avoid ItReal-World ExamplesHow This Shows Up in InterviewsTest Your UnderstandingQuick RecapRelated Concepts