Databases
Learn how databases organize data for fast retrieval, which storage engine to choose for your workload, and how ACID transactions keep concurrent writes correct at scale.
TL;DR
- A database is not just a place to store data โ it is a query engine, a concurrency manager, and a durability guarantee packaged together. Understanding all three layers is what separates "I added PostgreSQL" from a defensible system design.
- The storage engine (B-Tree vs. LSM-Tree) determines your read/write trade-off before you write a single query. B-Tree for balanced OLTP; LSM-Tree for write-heavy workloads where sequential I/O matters.
- Indexes are the single highest-leverage performance tool in any database. A missing index turns a 1ms primary-key lookup into a 30-second full-table scan at 100M rows. A composite index in the wrong column order is as useless as no index.
- ACID (Atomicity, Consistency, Isolation, Durability) solves four distinct concurrency hazards. Each property has a specific mechanism: the transaction log enforces atomicity; fsync enforces durability; isolation levels control read/write anomalies; constraints enforce consistency.
- The right database for a job is determined by your query shape โ specifically, whether you need JOINs, which indexes you'll build, write vs. read ratio, and consistency requirements โ not by your data format or team preference.
The Problem It Solves
It is Q4 earnings. Your fintech platform has processed 2 billion transactions this year. A risk analyst needs every transaction from account ACC-8841 in the last 90 days, grouped by merchant, filtered to amounts over $500.
Simple enough โ except your transactions live in a distributed file system as individual JSON files, one file per transaction. Finding account ACC-8841's transactions means opening all 2 billion files sequentially. At 0.1ms per file open, that is 55 hours of I/O.
Meanwhile, your reconciliation service is reading existing transaction files while the payments service is writing new ones. No locking. A payment file half-written by the payments service is read mid-write by the reconciliation service.
The reconciliation total is wrong โ but by how much? There is no way to know. The file system does not track partial writes.
And at 2:47 a.m., your transaction processing server crashes between writing the debit record and the credit record. Alice's $200 was debited. Bob never received it.
I've seen teams discover this failure mode six months into production โ usually when a customer calls finance about a missing transfer. You have no transaction log, no rollback, no way to know which transfers completed and which are half-done.
The 'we can figure out schema later' trap
Raw files and key-value stores feel fast to build with early on. They collapse under three specific pressures: multi-column query patterns (you can only index what you anticipated), concurrent writes without coordination (race conditions that only appear under load), and crash recovery (the question becomes "how much data did we lose?" not "did we lose any?"). Every database exists to solve these three problems simultaneously.
flowchart TD
subgraph NoDB["๐ฅ No Database โ The Three Failure Modes"]
direction TB
Files(["๐ 2B JSON files\nFlat filesystem storage"])
subgraph Query["โ Query Problem"]
QP["Full file scan for every query\n2B files ร 0.1ms = 55 hrs\nNo index possible"]
end
subgraph Concurrency["โ Concurrency Problem"]
CP["Payments writes file mid-read\nReconciliation reads corrupt half-file\nBalance total is wrong โ silently"]
end
subgraph Crash["โ Crash Problem"]
CR["Server crashes mid-transfer\nDebit written ยท Credit missing\nNo rollback ยท No audit trail"]
end
end
Files -->|"Query for ACC-8841 in Q4"| QP
Files -->|"Concurrent read+write"| CP
Files -->|"Crash at 2:47 AM"| CR
A database solves all three failure modes in a single system: it gives you indexes for fast queries, transactions for concurrency safety, and a write-ahead log for crash recovery. These aren't nice-to-haves โ they're the reason databases exist.
What Is It?
A database is a structured data store with three integrated subsystems: a query interface that translates your intent into an execution plan, a storage engine that physically organises data for the access patterns you need, and a transaction manager that serialises concurrent operations and ensures committed changes survive failures. Most candidates treat it as just a place to store data. Understanding all three subsystems is what makes your storage trade-off reasoning actually defensible.
Analogy: Think of a large reference library with millions of books. If books were stored at random, finding all books published in 1987 on quantum physics requires reading every title. A library solves this with a card catalogue (B-Tree index), books shelved by subject and author (storage model), a sign-out ledger (transaction log), and a fireproof copy of the catalogue offsite (durability).
The catalogue is not the library โ it is the structure that makes the library useful at scale. Your schema, indexes, and access patterns are the catalogue.
flowchart TD
subgraph AppTier["โ๏ธ Application Tier"]
App(["๐ค Application\nread/write requests"])
end
subgraph DBTier["๐๏ธ Database System"]
CP["๐ Connection Pool\nmax 100โ500 connections\nrequest queuing"]
QP["โ๏ธ Query Parser & Planner\nparse SQL โ AST โ execution plan\nchoose index or full scan"]
EE["โ๏ธ Execution Engine\nrun plan ยท join tables\neval predicates"]
SE["๐๏ธ Storage Engine\nB-Tree or LSM-Tree\nmanage pages on disk"]
BP["โก Buffer Pool\nin-memory page cache\n~90%+ cache of hot pages"]
WAL[("๐ Write-Ahead Log\nappend-only crash recovery\nfsync on commit")]
Disk[("๐พ Data Files\n8KB pages on disk\nprimary storage")]
end
App -->|"SQL / API query"| CP
CP -->|"1 connection per request"| QP
QP -->|"execution plan"| EE
EE -->|"page request"| BP
BP -->|"Buffer miss โ disk read"| Disk
EE -->|"Write โ WAL first"| WAL
WAL -.->|"Async flush to data files"| Disk
BP -.->|"Dirty page eviction"| Disk
With a database, the risk analyst's query becomes:
SELECT merchant_id, SUM(amount), COUNT(*)
FROM transactions
WHERE account_id = 'ACC-8841'
AND created_at >= NOW() - INTERVAL '90 days'
AND amount > 500
GROUP BY merchant_id
ORDER BY SUM(amount) DESC;
With a composite index on (account_id, created_at, amount), this executes in under 5ms regardless of how many total transactions exist โ because the storage engine traverses the B-Tree to account ACC-8841, range-scans the last 90 days, and filters the rest. It never touches rows for other accounts. That 5ms-vs-55-hours gap is only possible because all three subsystems โ query, storage, and transaction โ work together.
How It Works
Here is what happens end-to-end when your application issues SELECT * FROM products WHERE id = 7429:
- Connection pool assigns a connection โ Your application maintains a pool of persistent database connections. A new request borrows one instead of negotiating a full TCP + auth handshake (5โ50ms overhead per request without a pool).
- Parser validates syntax โ The SQL string is tokenised and parsed into an Abstract Syntax Tree. Syntax errors are caught here before touching any data.
- Query planner chooses an execution plan โ The planner checks
pg_statisticsfor row count estimates, index availability, and selectivity. Ifidis a primary key (clustered B-Tree index), the plan is:Index Scan using products_pkey on products WHERE id = 7429. If no index exists:Seq Scan on productsโ read every row. - Execution engine runs the plan โ For an index scan, it traverses the B-Tree from the root to the leaf page containing
id = 7429. - Buffer pool check โ Before hitting disk, the execution engine checks the in-memory buffer pool. If the page is already cached (~95% hit rate on hot data), it returns immediately (< 0.5ms). On a miss, it fetches the 8KB page from disk (~1โ5ms SSD, ~5โ15ms HDD) and warms the buffer pool for future access.
- Result returned โ The deserialized row data is returned to the connection and ultimately to your application.
// Application layer โ always use parameterised queries (prevents SQL injection)
async function getProduct(productId: number): Promise<Product | null> {
// Connection pool handles connection acquisition; do NOT create one per request
const result = await db.query<Product>(
'SELECT id, name, price, stock_qty FROM products WHERE id = $1',
[productId] // Parameter, not string interpolation โ prevents SQL injection
);
return result.rows[0] ?? null;
}
-- Debug slow queries with EXPLAIN ANALYZE (PostgreSQL)
EXPLAIN ANALYZE
SELECT id, name, price FROM products WHERE id = 7429;
-- Without index (catastrophic at 100M rows):
-- Seq Scan on products (cost=0.00..2843291.00 rows=1 width=40)
-- (actual time=28943.871..28943.872 rows=1 loops=1)
-- Planning: 3.2ms ยท Execution: 28943.9ms (29 seconds!)
-- With primary key index (default B-Tree):
-- Index Scan using products_pkey (cost=0.56..8.58 rows=1 width=40)
-- (actual time=0.042..0.043 rows=1 loops=1)
-- Planning: 0.1ms ยท Execution: 0.043ms (instant!)
Interview tip: always EXPLAIN ANALYZE before adding infrastructure
When an interviewer asks how you'd handle slow queries, say: "First, I'd run EXPLAIN ANALYZE to confirm the execution plan โ 80% of database performance problems are resolved with the right index, not a new database or caching layer." This signals you know to measure before architecting.
The sequence from query to data on every non-trivial request:
sequenceDiagram
participant App as โ๏ธ Application
participant Pool as ๐ Connection Pool
participant Planner as โ๏ธ Query Planner
participant Buffer as โก Buffer Pool
participant Disk as ๐พ Disk
App->>Pool: borrow connection
Pool-->>App: connection ready (~0ms)
App->>Planner: SELECT * FROM products WHERE id=7429
Planner->>Planner: parse SQL โ check statistics<br/>โ choose index scan plan
Planner->>Buffer: fetch page for id=7429
alt Buffer HIT (~95% of reads)
Buffer-->>Planner: page data ยท < 0.5ms
else Buffer MISS
Buffer->>Disk: read 8KB page from SSD
Disk-->>Buffer: page data ยท 1โ5ms
Buffer-->>Planner: page data (now cached)
end
Planner-->>App: result row
App->>Pool: return connection
Key Components
| Component | Role |
|---|---|
| Query parser | Converts SQL string into an Abstract Syntax Tree. Rejects malformed queries immediately. |
| Query planner / optimiser | Chooses the execution plan: which indexes to use, join ordering, predicate pushdown. Wrong plans are the leading cause of unexplained slowdowns. |
| Execution engine | Runs the plan: iterates rows, evaluates predicates, aggregates results. For complex queries, this stage does the most CPU work. |
| Buffer pool | In-memory cache of disk pages (typically 25โ80% of RAM in PostgreSQL). The goal is a 95%+ hit rate so most reads never touch disk. The database equivalent of an application-level cache โ but automatic. |
| Storage engine | Manages the on-disk layout: B-Tree for PostgreSQL/MySQL, LSM-Tree for Cassandra/RocksDB. Determines read/write performance characteristics. |
| Write-Ahead Log (WAL) | Every write is appended to the WAL before touching data files. On crash, the DB replays the WAL to restore the committed state. This is how durability (the D in ACID) is implemented. |
| Transaction manager | Assigns transaction IDs, tracks in-flight transactions, enforces isolation levels, coordinates commits and rollbacks. Maintains MVCC snapshots. |
| Connection pool | Manages a pool of persistent DB connections. Creating a new connection costs 5โ50ms and requires authentication + session setup. At 10K req/s, new-connection-per-request adds 50โ500ms of pure overhead. |
| Replication slot | Tracks how far each replica has consumed the WAL stream. A lagging replica with a stale slot will prevent the primary from cleaning old WAL segments โ a disk-space trap. |
| MVCC (Multi-Version Concurrency Control) | Maintains multiple row versions so readers never block writers. Readers see a consistent snapshot; they do not acquire locks. This is why PostgreSQL reads scale independently of writes. |
Storage Engines: B-Tree vs. LSM-Tree
The storage engine is the component that physically organises your data on disk. Choosing the wrong one for your workload is the most permanent performance mistake you can make โ it is baked into your schema design, your operational runbooks, and your hardware bill. I see this decision made on vibes more often than any other in system design โ "Cassandra scales better" is not an analysis.
B-Tree
B-Trees store data as a balanced tree of fixed-size pages (typically 8KB). Every read and write traverses from the root to a leaf page.
The leaf pages contain the actual row data (or row pointers in a secondary index). Leaves are linked for efficient range scans.
Why it dominates OLTP: B-Tree provides O(log N) for both reads and writes, predictable latency, and efficient range queries (BETWEEN, ORDER BY, >, <). The standard page size means the OS page cache and the DB buffer pool interact cleanly, and SSD random read latency (0.1ms) is fast enough for typical OLTP query depths of 3โ5 levels.
The write amplification problem: A single 100-byte row write may trigger an 8KB page rewrite. More critically, as the tree rebalances, a single insert near a full page triggers a page split: the existing page is copied, split into two new pages, and the parent page is updated. In heavy insert workloads, write amplification factors of 5โ50ร are common โ one logical write becomes 5โ50 physical disk writes.
LSM-Tree
LSM-Trees (Log-Structured Merge Trees) never update data in-place. All writes go to an in-memory buffer (MemTable) first. When the MemTable fills (~64MB), it is flushed as an immutable sorted file (SSTable) to disk.
Background compaction merges SSTables from Level 0 into larger sorted runs at Level 1, then Level 2, reducing the number of files reads must check.
Why it dominates write-heavy workloads: Writes are always sequential appends (O(1), no random I/O). On modern SSDs, sequential write throughput is 5โ10ร faster than random writes. This makes LSM-Trees ideal for time-series, IoT telemetry, event logs, and write-heavy analytics.
The read amplification problem: To find a key, the DB must check: MemTable โ L0 (every file, since they're unsorted at L0) โ L1 โ L2 โ โฆ each additional level adds disk accesses. LSM-Trees use per-SSTable Bloom filters to skip files that definitely do not contain a key, but reads still hit multiple files vs. a single B-Tree path.
| Dimension | B-Tree | LSM-Tree |
|---|---|---|
| Write pattern | Random I/O โ in-place update | Sequential I/O โ always append |
| Write amplification | 5โ50ร (page splits, rebalancing) | 10โ30ร (compaction rewrites) |
| Read performance | O(log N) โ single tree path | O(log N) with amplification โ multi-level check |
| Range scans | Excellent โ linked leaf pages | Moderate โ must merge-read SSTables |
| Space efficiency | ~50โ60% page utilisation | Higher โ but compaction reclaims space |
| Compaction cost | None (tree maintains itself) | Background CPU/IO โ can spike under heavy write load |
| Best for | OLTP ยท mixed read/write ยท ACID transactions | Write-heavy ingestion ยท time-series ยท analytics |
| Engines | PostgreSQL, MySQL InnoDB, SQLite, Oracle | Cassandra, LevelDB, RocksDB, ClickHouse (partial) |
The fundamental tension: read amplification vs. write amplification. B-Tree minimises read amplification (single path from root to leaf) at the cost of write amplification. LSM-Tree minimises write amplification (append-only) at the cost of read amplification (multiple levels to check).
Default to B-Tree unless your write throughput or append-only access pattern specifically justifies the LSM-Tree trade-off.
Indexing
An index is a separate data structure (usually a B-Tree) that stores a sorted subset of your columns with pointers back to the full rows. The query planner uses it to jump directly to relevant rows instead of scanning every row in the table. I'd argue this is the single highest-leverage database skill โ and the one most candidates gloss over when designing a system under time pressure.
Without an index on email: Find the user with email = 'alice@example.com' from 100M users โ Full Table Scan โ read every row โ ~30 seconds.
With a B-Tree index on email: The planner traverses the B-Tree (5 levels deep at 100M rows) โ jumps to the leaf page โ fetches the row pointer โ one disk I/O โ < 1ms.
The 30,000ร latency difference is not an exaggeration. It is the norm whenever a production query runs a full table scan against a large table.
-- See what the planner is doing
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
-- After adding index:
CREATE INDEX idx_users_email ON users(email);
-- Index Scan using idx_users_email on users
-- (actual time=0.041..0.042 rows=1 loops=1) -- 0.042ms
-- Composite index: column order matters enormously
-- Query: WHERE account_id = 'ACC-8841' AND created_at >= '2026-01-01' AND amount > 500
CREATE INDEX idx_txns_account_date_amount
ON transactions(account_id, created_at, amount);
-- The planner uses this index for the above query.
-- Index on (created_at, account_id, amount) would NOT use the account_id push-down โ
-- range scan on the first column exhausts the index's selectivity benefit.
Index types
| Type | Data structure | Query support | Notes |
|---|---|---|---|
| B-Tree (default) | Balanced tree | Equality, range, LIKE 'prefix%' | 99% of indexes you will ever create |
| Hash | Hash map | Equality only โ = | Faster than B-Tree for pure equality at the cost of range support |
| GIN/GiST | Inverted/spatial | Full-text search, arrays, JSON containment, geographic | PostgreSQL-specific; essential for @>, @@, && operators |
| Partial | B-Tree on filtered subset | Equality + range on the subset | CREATE INDEX ... WHERE status = 'active' โ smaller, faster |
| Covering | B-Tree storing non-key columns | Index-only scan โ zero table access | CREATE INDEX ... INCLUDE (price, stock_qty) |
| Composite | B-Tree on N columns | Left-prefix rule โ first N columns must match | Column order matters: high-cardinality, equality-first columns first |
The left-prefix rule and when composite indexes fail
A composite index on (account_id, created_at, amount) is used by queries that filter on account_id, or account_id + created_at, or all three. It is NOT used by queries that only filter on created_at or amount alone โ the first column must be constrained. Build composite indexes starting with the highest-cardinality equality column (usually an ID), followed by range columns, followed by low-cardinality filters.
Covering indexes โ eliminating table access entirely
A covering index stores non-key columns directly in the index leaf pages. If all columns your query touches are in the index, the DB never fetches the underlying table pages โ the query executes entirely from the index, which is typically much smaller and more cache-friendly.
-- Without covering index: index scan + heap fetch (table access)
CREATE INDEX idx_products_status ON products(status);
SELECT id, name, price FROM products WHERE status = 'active';
-- Plan: Index Scan โ for each matching id, fetch the heap page for name + price
-- With covering index: index-only scan
CREATE INDEX idx_products_status_covering ON products(status) INCLUDE (name, price);
SELECT id, name, price FROM products WHERE status = 'active';
-- Plan: Index Only Scan โ name and price come from the index, no table fetch
-- Typically 3โ5ร faster for wide tables with narrow query projections
The N+1 query problem
The most common application-layer database performance issue. Your ORM executes one query to fetch a list, then one query per item to fetch a related record.
// N+1 โ catastrophic at scale
const orders = await db.query('SELECT * FROM orders WHERE status = $1', ['pending']);
// Returns 1,000 orders โ then fires 1,000 individual queries:
for (const order of orders) {
order.customer = await db.query(
'SELECT * FROM customers WHERE id = $1', [order.customer_id]
);
}
// Total: 1,001 queries ยท 1,000 round-trips ยท ~5 seconds for what should be 5ms
// Fix: JOIN or batched IN clause
const orders = await db.query(`
SELECT o.*, c.name, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = $1
`, ['pending']);
// Total: 1 query ยท 1 round-trip ยท ~5ms โ
Every slow query I've seen at scale traces to one of three causes: a missing index, the wrong composite index column order, or N+1 in the ORM layer โ check those before adding caching or a new database tier.
Transactions and ACID
A transaction is a unit of work that either completes entirely or not at all. ACID guarantees what that means under failure and concurrency. My recommendation: learn each of the four properties through its implementation mechanism โ the WAL enforces atomicity, fsync enforces durability, MVCC enables isolation, constraints enforce consistency โ rather than just memorising the acronym.
The canonical bank transfer example โ without transactions:
-- Session 1 executes these two statements independently (no transaction):
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
-- โ Server crashes here (power failure, OOM, reboot)
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
-- This line never executes. alice's $100 has vanished from the system.
With a transaction (ACID):
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
-- If server crashes here, WAL records the in-progress transaction.
-- On restart, PostgreSQL sees an uncommitted transaction and rolls it back.
-- alice's balance is restored. No money is lost.
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;
-- Only COMMIT makes the change permanent and visible to other sessions.
A โ Atomicity
All operations in a transaction succeed, or none of them take effect. Implemented via the Write-Ahead Log (WAL): every change is logged before being applied. On crash, uncommitted changes are rolled back by replaying the WAL in reverse.
C โ Consistency
A transaction can only bring the database from one valid state to another. Constraints, foreign keys, and check constraints enforce this: if any constraint is violated mid-transaction, the entire transaction is rolled back. Consistency is enforced by your schema design โ the DB provides the mechanism.
I โ Isolation
Concurrent transactions do not observe each other's intermediate state. Isolation levels control the trade-off between correctness guarantees and throughput:
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Performance |
|---|---|---|---|---|
| Read Uncommitted | โ Possible | โ Possible | โ Possible | โกโกโก Highest |
| Read Committed (PostgreSQL default) | โ Prevented | โ Possible | โ Possible | โกโก High |
| Repeatable Read | โ Prevented | โ Prevented | โ Prevented (PostgreSQL MVCC) | โก Moderate |
| Serializable | โ Prevented | โ Prevented | โ Prevented | ๐ข Lowest |
- Dirty read: Reading a row modified by an uncommitted transaction. Prevented by any level above Read Uncommitted.
- Non-repeatable read: Re-reading the same row within a transaction and seeing a different value (another transaction committed a change between your two reads).
- Phantom read: Re-running a range query and seeing new rows (another transaction inserted matching rows between your two queries).
sequenceDiagram
participant T1 as ๐ต Transaction 1
participant DB as ๐๏ธ Database
participant T2 as ๐ด Transaction 2
Note over T1,T2: Read Committed isolation (PostgreSQL default)<br/>Non-repeatable read is POSSIBLE here
T1->>DB: BEGIN
T2->>DB: BEGIN
T1->>DB: SELECT balance FROM accounts WHERE id='alice'
DB-->>T1: balance = 1000
T2->>DB: UPDATE accounts SET balance = 500 WHERE id='alice'
T2->>DB: COMMIT โ committed before T1 re-reads
T1->>DB: SELECT balance FROM accounts WHERE id='alice'
DB-->>T1: balance = 500 โ different value! (non-repeatable read)
Note over T1: T1 sees two different values for<br/>the same row within one transaction.<br/>Use REPEATABLE READ to prevent this.
Interview tip: default to Read Committed, justify Serializable explicitly
PostgreSQL defaults to Read Committed for a reason: it prevents dirty reads (the most dangerous anomaly) while keeping throughput high. Escalate to Serializable only when you can justify it: financial double-spend prevention, inventory allocation, or any scenario where "SELECT + conditional UPDATE" must be atomic. Saying "I'd use Serializable transactions for all checkout operations to prevent overselling" is more precise than "I'd add a distributed lock."
D โ Durability
A committed transaction survives server crashes, power failures, and hardware faults. Implemented by fsync: PostgreSQL calls fsync() on the WAL file before acknowledging COMMIT to the client. This forces the OS to flush the WAL to physical disk rather than retaining it in the OS page cache (which would be lost on a crash).
-- PostgreSQL synchronous commit settings (trade durability for throughput)
-- synchronous_commit = on โ fsync on every COMMIT (default, fully durable)
-- synchronous_commit = off โ async commit, ~10ms durability window (risk: last 10ms of commits lost on crash)
-- synchronous_commit = remote_write โ commit when at least one standby has received the WAL
Get the WAL and MVCC mental models solid โ every question about concurrent writes, replica lag, or crash recovery traces back to one of them.
Database Types
| Type | Data Model | Primary Access Pattern | Horizontal Scale | ACID | Examples |
|---|---|---|---|---|---|
| Relational | Tables with rows + fixed schema | Multi-column queries, JOINs, aggregations | Vertical + read replicas (sharding is manual) | Full | PostgreSQL, MySQL, SQLite |
| Document | Semi-structured JSON documents | Fetch document by ID or field; no JOINs | Native horizontal sharding | Per-document | MongoDB, CouchDB, Firestore |
| Key-Value | Opaque value per key | Single-key lookup only (no secondary index natively) | Horizontal sharding by key | Limited | Redis, DynamoDB, etcd |
| Wide-Column | Rows with sparse, dynamic columns | Key + column family lookup; append-optimised | Native horizontal sharding across nodes | Eventual/configurable | Cassandra, HBase, ScyllaDB |
| Time-Series | Timestamped metrics + tags | Range queries over time, downsampling, retention | Horizontal by time range | Limited | InfluxDB, TimescaleDB, Prometheus |
| Graph | Nodes + directed/labelled relationships | Relationship traversal (depth-first, shortest path) | Limited (JOINs are O(1) per hop) | Full (Neo4j) | Neo4j, Amazon Neptune, DGraph |
Which to choose and when
Here's the honest answer on database choice: start with Relational, and don't leave until a specific query shape or throughput requirement forces you out.
Relational is the correct default for any new service until proven otherwise. Every data model can be represented in a relational schema. The question is never "can I use Postgres?" but "what query patterns does Postgres make expensive?"
Document is the right upgrade when your data is naturally hierarchical (blog posts with nested comments are one document, not three JOIN tables) and you never query across document boundaries. The moment you need JOIN-equivalent queries, you have made the wrong choice.
Wide-Column (Cassandra) is correct when: write volume exceeds what a primary DB can absorb, data is append-only, and read patterns are known and fixed to partition key + clustering column. Its power is 1M+ writes/second at linear scale; its trap is the total inability to query columns outside the primary/clustering key without a secondary index (which is expensive at scale).
Key-Value is the right choice when the access pattern is exclusively get(key) and set(key, value) โ sessions, leaderboards, feature flags, distributed locks. DynamoDB adds secondary indexes and range queries on the sort key; Redis adds data structures (sorted sets, streams, pub/sub) on top of the basic K-V model.
Time-Series is correct when your schema is (timestamp, metric_name, value, tags[]) and your queries are "average CPU over last 5 minutes, grouped by host". Time-series databases optimise for exactly this: they compress timestamp data aggressively, automatically downsample old data (replace per-minute points with per-hour averages after 30 days), and enforce retention policies that would require manual partitioning in PostgreSQL.
Polyglot persistence โ using multiple databases in one system
Large-scale systems routinely combine database types for different workloads within the same product. Stripe uses PostgreSQL for transactional ledger entries, Redis for rate limiting and distributed locks, and a column store for analytics. This is called polyglot persistence and it is the correct architectural pattern โ each database is used for what it does best, not everything.
The companies doing polyglot persistence earned it by starting simple and migrating to each database only when a specific pain point demanded it. Nobody wins by starting with four databases.
Trade-offs
| Pros | Cons |
|---|---|
| Indexes reduce query time by 100โ100,000ร for selective lookups vs. full scans | Indexes add write overhead on every INSERT/UPDATE/DELETE โ each index is a separate B-Tree that must be maintained |
| ACID transactions prevent data corruption under concurrent writes and crashes | Serializable isolation reduces throughput; every additional isolation guarantee costs concurrency |
| Relational schemas enforce data integrity at the DB layer โ impossible to write partial or malformed records | Schema migrations on large tables are expensive operations; ALTER TABLE ADD COLUMN on a 100M-row table can lock the table for minutes |
| Buffer pool caches hot pages automatically โ most reads never hit disk | Buffer pool pressure from infrequent scans (OLAP queries) evicts hot pages needed by OLTP queries โ mixed workloads conflict |
| Replication provides read scaling and high availability at low latency overhead | Replica lag introduces read-after-write inconsistency โ a user writes a row, reads it back immediately from a replica, and it isn't there yet |
| PostgreSQL handles 10Kโ50K transactional queries/second on modern hardware | Horizontal write scaling requires manual sharding in relational databases โ not a knob you turn, a re-architecture project |
| Mature tooling: pgvector, PostGIS, full-text search, JSONB, partitioning โ all without leaving the relational model | Auto-vacuuming and dead tuple bloat in MVCC databases require operational discipline; a neglected table grows unboundedly until vacuum reclaims space |
The fundamental tension here is consistency vs. throughput. Each ACID guarantee restricts the concurrency model: serializable isolation is provably correct but reduces the number of concurrent transactions the database can handle.
Every time you relax an isolation level or use eventual consistency, you trade a specific correctness guarantee for a specific throughput gain. The engineering discipline is knowing which anomalies are acceptable for each data type in your system.
When to Use It / When to Avoid It
So when does the database choice actually matter in an interview? Every time โ because every system design answer involves persisting state, and the database type determines your query flexibility, consistency guarantees, and scaling ceiling.
Use a relational database (PostgreSQL) as your starting point when:
- Your data has relationships between entities that you'll JOIN โ orders with customers, users with roles, payments with accounts.
- You need ACID transactions โ financial operations, inventory management, any workflow where partial completion is incorrect.
- Your query patterns are varied and not yet fixed โ relational indexes can be added after launch; a key-value schema cannot be retroactively queried.
- You are at < 10K writes/second on a single-tenant service. PostgreSQL handles this comfortably without sharding.
- Your team knows SQL. The operational cost of operating an unfamiliar database under production pressure is often higher than the cost of vertically scaling a familiar one.
Upgrade to a different database type when (and only when):
- Write throughput exceeds what a primary + N replicas can sustain โ Wide-Column (Cassandra) or a queue-backed writes pattern.
- You are storing 100B+ time-stamped metrics with downsampling and retention requirements โ TimescaleDB or InfluxDB.
- Your primary query is relationship traversal ("all friends-of-friends within 3 hops") โ Graph database; this becomes exponentially expensive in SQL JOINs.
- Your schema is genuinely unknown and changes per customer (SaaS with per-customer custom fields at scale) โ Document database, but validate that you truly never need cross-document queries before committing.
Avoid over-engineering by:
- Not choosing a NoSQL database because it "scales better" โ define what scale you need and verify your chosen database cannot reach it before switching.
- Not sharding before you've exhausted read replicas, connection pooling, and query optimisation โ sharding is a complexity multiplier that makes every future feature harder.
- Not using multiple database types before exhausting what your single database type can do with proper schema design (JSONB in PostgreSQL handles many document use cases with full SQL queryability).
Real-World Examples
The most instructive examples aren't the "we switched to NoSQL" stories โ they're the "we stayed on PostgreSQL and here's exactly how" ones. The three below show what database discipline actually looks like at scale.
Stripe โ PostgreSQL at financial scale
Stripe runs one of the largest PostgreSQL deployments in the world, processing hundreds of billions of dollars annually. Their key architectural decisions: every service owns its own isolated PostgreSQL instance (not a shared mega-DB), tables are partitioned by created_at and id so old partitions become append-only and can be archived to cold storage, and all financial mutations use serializable isolation to prevent double-processing.
Stripe does not use NoSQL for their core ledger โ the correctness guarantees of relational ACID transactions are non-negotiable for payment data. Their scaling technique is not sharding; it is aggressive partitioning, careful index design, and splitting read-heavy operations onto read replicas.
Discord โ 1 trillion messages: Cassandra โ ScyllaDB migration
Discord stored their message history in Apache Cassandra. At 1 trillion messages, read latency became unpredictable: Cassandra's garbage collection pauses caused p99 read latency to spike from 5ms to 500ms under load. In 2023, Discord migrated to ScyllaDB (a Cassandra-compatible, C++-rewritten engine with lower GC overhead).
The migration achieved 99th percentile read latency drop from ~500ms to ~15ms on the same data and access patterns. The lesson: the data model (wide-column, partition by channel_id + bucket_id) was correct. The JVM garbage collector was the bottleneck, not the storage engine.
Notion โ PostgreSQL at document-scale
Notion stores all workspace content in PostgreSQL. At 100M+ blocks across millions of workspaces, they use a physically partitioned schema: each customer workspace's data lives in separate PostgreSQL table partitions.
This keeps operational data access patterns local (no cross-workspace queries) and enables per-workspace backup, restore, and isolation. Notion deliberately chose PostgreSQL over a document database (despite storing JSON-like block structures) because of ACID transaction guarantees when a user edits a page simultaneously from multiple devices โ the relational transaction model handles concurrent edits correctly; eventual-consistency document databases would require application-layer conflict resolution.
The pattern across all three: the bottleneck was operational or implementation-level, not a fundamental flaw in the core data model choice.
How This Shows Up in Interviews
Every system design question involves a database โ name yours in the first three minutes with a concrete rationale. The mistake I see most often is candidates who spend ten minutes describing features and never commit to a database choice. State it, justify it with a query pattern and a throughput number, and move on.
When to bring it up proactively
Every system design question involves a database choice. Within the first 3 minutes, name your primary database with a rationale: "I'd use PostgreSQL for the user and payment data โ ACID transactions are non-negotiable for financial state, and Postgres handles 20K writes/second on a single primary which satisfies our throughput estimate." Then add read replicas for read scaling. Only introduce additional databases when you can show the specific query shape that your primary database cannot serve efficiently.
Don't say 'we can always switch databases later'
Migrating a production database is one of the most expensive engineering operations that exists. Schema and access pattern decisions made at system design time are extremely sticky. An interviewer will push back hard if you suggest adding a Cassandra cluster without specifying the write throughput that justifies it, or choosing MongoDB without demonstrating a schema that genuinely cannot be served by a relational model.
Here's what separates a strong answer from a weak one at the senior level:
Depth expected at senior/staff level:
- Know your numbers: single PostgreSQL primary handles 10Kโ50K OLTP queries/second; adding a connection pooler (PgBouncer) handles C10K connections; each read replica adds ~50K read queries/second.
- Explain index selection by workload: "I'd add a composite index on
(user_id, created_at DESC)for the feed query โuser_idis the equality predicate (high cardinality),created_at DESCis the range that drives the sort, and I'd consider INCLUDE columns to make it a covering index." - Articulate the ACID trade-off concretely: "Read Committed prevents dirty reads at the cost of non-repeatable reads. For payment operations, I'd use Serializable to prevent the phantom read anomaly that allows double-spend. For read-heavy reporting queries, Read Committed is fine."
- Address sharding as a last resort with specific triggers: "I'd shard only after read replicas + PgBouncer + index optimisation are exhausted. Sharding by
user_idis the natural partition key here โ it keeps a user's data co-located and avoids cross-shard JOINs on the most common query." - Know operational concerns: vacuum in PostgreSQL, connection pool sizing (
max_connections = RAM_for_connections / 10MB_per_connection), WAL management, and replica lag monitoring.
Common follow-up questions and strong answers:
| Interviewer asks | Strong answer |
|---|---|
| "How do you handle 300K writes/second?" | "That's beyond a single PostgreSQL primary (~50K w/s). I'd evaluate: (1) can writes be made async via a queue? (2) can the write table be sharded by user_id? (3) is the write pattern time-series (โ Cassandra/LSM-Tree)? Before choosing, I need to know if writes are idempotent and whether ORDER BY queries across all writers are required." |
| "What happens to reads when you add a read replica?" | "Reads on the replica see replica lag โ typically 10โ200ms behind the primary. For read-after-write (user posts a comment and immediately sees it), route the read back to the primary for that session. For analytics and reporting queries, replica lag is acceptable. Alert if replica lag exceeds your SLA threshold โ aggressive vacuum or large transactions on the primary cause it." |
| "How does PostgreSQL handle concurrent updates to the same row?" | "MVCC: readers never block writers. Writers queue behind each other per row via row-level locks (SELECT FOR UPDATE). PostgreSQL uses optimistic concurrency for reads: each transaction gets a snapshot; if two writes conflict, the second writer's transaction is aborted and must retry. This is why application retry logic on serialization_failure errors is required." |
| "When would you choose MongoDB over PostgreSQL?" | "When the data is genuinely document-shaped, schema evolution is so rapid that migrations are a bottleneck, documents are the primary access unit (no cross-document JOINs), and horizontal scaling is needed before tuning PostgreSQL is exhausted. The risk: MongoDB's eventual consistency model makes any multi-document operation a coordination problem the application must solve. PostgreSQL with JSONB handles most 'document-like' schemas with full SQL queryability โ I'd start there." |
| "How do you prevent an expensive OLAP query from killing your OLTP latency?" | "Query isolation: route OLAP queries to a dedicated read replica or a Redshift/BigQuery analytical store. On the OLAP replica, configure statement_timeout to kill runaway queries before they consume all connections. Physically separate OLAP and OLTP workloads at the hardware level โ they have opposite profiles (OLAP: sequential I/O, many rows, low concurrency; OLTP: random I/O, few rows, high concurrency)." |
The candidates who stand out can trace every trade-off to a specific mechanism โ not "databases are slow" but "row-level locks serialize writes, which limits throughput at Serializable isolation."
Test Your Understanding
Quick Recap
- A database solves three problems simultaneously: it indexes data for fast queries, coordinates concurrent writes with transactions to prevent data corruption, and uses the Write-Ahead Log to survive crashes without data loss.
- The storage engine (B-Tree vs LSM-Tree) determines your read/write performance profile before any query runs โ B-Tree for predictable O(log N) OLTP reads and writes, LSM-Tree for write-heavy workloads that need sequential I/O; choose based on your write-to-read ratio, not the database brand.
- Indexes are the highest-leverage performance tool: without an index, a query on 100M rows takes 30+ seconds; with the right composite index, the same query runs in under 1ms โ the most important interview answer you can give to any "how do you scale reads?" question is "I'd check the query plan before adding infrastructure."
- ACID's I (Isolation) is the trickiest property โ PostgreSQL's default Read Committed prevents dirty reads but not non-repeatable reads; use
SELECT FOR UPDATEor optimistic conditional UPDATE to make check-and-modify operations atomic; use Serializable for financial operations where phantom reads allow double-spend. - Connection pools (PgBouncer) are non-optional at scale โ PostgreSQL allocates 5โ10MB per connection, and 10,000 direct connections consume 70GB before executing a single query; PgBouncer multiplexes 100,000 app-layer connections over 300โ500 real DB connections.
- The Saga pattern replaces distributed transactions for cross-service workflows โ each service commits locally, publishes a domain event, and compensating transactions roll back on failure โ avoiding 2PC's coordinator-failure-induced deadlocks at the cost of brief cross-service inconsistency windows.
- In every interview, state your database choice with a query analysis: "PostgreSQL handles our 5K writes/second and JOIN-heavy query pattern โ I'd add a composite index on
(user_id, created_at)for the feed query, use read replicas for analytics, and only shard if write throughput exceeds what a single primary can sustain."
Related Concepts
- Caching โ Caching exists to protect databases from read fan-out; understanding which queries are worth caching (high hit rate, acceptable staleness) requires understanding database query cost and connection pool limits. Cache hit rate and DB connection utilisation are the two numbers that determine whether your read tier is healthy.
- Replication โ Database replication is covered in depth in the Replication article: primary-replica lag, synchronous vs asynchronous WAL streaming, and the trade-off between read scale and durability guarantees on commit.
- Sharding โ When vertical scaling and read replicas are exhausted, sharding partitions data across multiple database primaries. The shard key choice determines whether queries become cross-shard fan-outs (expensive) or stay local (fast). Never shard before exhausting what a single primary can do.
- SQL vs. NoSQL โ The tradeoffs article frames the relational vs. document/wide-column decision more completely, covering schema evolution speed, horizontal scaling flexibility, and the specific workload patterns where each model wins.
- Scalability โ Databases are typically the first bottleneck on the scaling path. The scalability article covers the full vertical โ read replicas โ sharding โ CDN progression with concrete traffic thresholds at each step.