📐HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Concepts

Sharding

Learn how database sharding partitions rows across multiple nodes to break the single-machine write ceiling, which strategy to pick for your workload, and how to avoid the hotspot traps that kill real-world shard deployments.

42 min read2026-03-23hardshardingdatabasesscalabilitydistributed-systemshld

TL;DR

  • A shard is a horizontal slice of your database: same schema, different row subset. Sharding lets you scale writes linearly by adding machines — the one thing replication, caching, and vertical scaling cannot do.
  • The shard key is the most consequential schema decision in a sharded system. It determines write distribution, whether your hottest queries are cheap or catastrophically expensive, and is practically impossible to change once data is at scale.
  • Hash sharding distributes data evenly but breaks range queries. Range sharding preserves query locality but risks hot shards. Consistent hashing solves resharding by only redistributing 1/N keys when topology changes.
  • The hardest problems in sharding are operational, not conceptual: cross-shard joins require scatter-gather, schema migrations must propagate across all shards in parallel, and transactions spanning two shards require distributed 2PC or redesign.
  • Reach for sharding only after exhausting all simpler options: caching (reduces read load), read replicas (spreads read throughput), and vertical scaling (more RAM, faster disk). Sharding is irreversible architectural complexity.

The Problem It Solves

It's 2023. Your social platform just hit 100 million users. Your founding engineers made a sensible call: a single PostgreSQL database on the best server money can buy.

That machine now runs at 98% CPU, disk I/O is at the ceiling, and your on-call channel has a standing item: "DB brownout 7–9 PM UTC." You've added read replicas — four of them. SELECT performance is fine.

The problem is writes.

Every INSERT, UPDATE, and DELETE must go to the primary. At 100K writes per second, that's 100,000 disk operations per second, 100,000 write-ahead log entries, 100,000 row-level locks — all on one machine.

PostgreSQL sustains 20–50K writes/second on premium hardware. You are 2–5× over that ceiling.

You've tried vertical scaling: you're on the largest RDS instance available (db.r6g.16xlarge, 64 vCPUs, 512 GB RAM). AWS will not sell you a bigger one. You've tried connection pooling: PgBouncer reduces connection overhead but doesn't increase write throughput.

You've tried batching writes: some things can be batched, but not user sessions, not payment events, not real-time presence updates. I've watched teams spend months cycling through every one of these optimizations — each one buys time, none of them buys out.

The ceiling is fundamental. A single machine has a finite disk, finite memory bus, finite WAL throughput. No tuning, no money, no cloud provider buys past that.

You need to spread the write load across multiple machines. That requires splitting your data.

Three app servers each sending SQL queries to a single overloaded PostgreSQL node. The database CPU is at 98%, disk I/O is saturated, and query times have degraded from 5ms to 4,000ms.
Adding more app servers only worsens the write bottleneck — more concurrent connections hammer the same disk. The limit is single-node write throughput, not the app tier.

Replication solves the read problem, not the write problem

A read replica duplicates all writes from the primary and distributes read traffic. But every write still hits the primary — the primary is the bottleneck. Five read replicas do not reduce write load by a single byte. If your database is write-saturated, replication is not the solution. Sharding is.


What Is It?

Database sharding (horizontal partitioning) is splitting a single logical table across multiple physical database instances, each called a shard. Every shard carries the same schema but owns a mutually exclusive, collectively exhaustive subset of the rows.

Analogy: Imagine a library that has outgrown its single building. The head librarian splits the collection across five branches, organized alphabetically by author last name: Branch 1 handles A–D, Branch 2 handles E–H, and so on. Each branch has the same shelving system (schema) but different books (rows).

A reader looking for Dostoevsky goes directly to Branch 2. The library now serves 5× more simultaneous visitors — but finding all books on a theme that spans multiple letters requires visiting multiple branches. That trade-off is exact for sharding.

When you propose sharding in an interview, always follow it with the shard key you'd choose — naming the mechanism without naming the key is half an answer.

Users flow through three app servers into a shard router that hashes user_id to one of four shards. Each shard stores a quarter of the user base and handles 25K writes per second rather than the 100K that overwhelmed the single primary.
The shard router translates the partition key into the correct target shard. Each shard owns 25% of the data and handles only 25% of the write load — the single-node ceiling no longer applies.

With 4 shards, each serving 25M of the 100M users:

  • Write throughput per shard: 25K writes/s — well within PostgreSQL's sustained limit.
  • Data per shard: 125 GB — fits with comfortable headroom.
  • Failure blast radius: a single shard failure affects only 25% of users, not 100%.

The write ceiling that brought the system to its knees now belongs to four independent machines — and each one can grow.


How It Works

A request arrives to update user 47,201,389's profile. Here's what happens across the shard stack:

  1. Application code issues the query with the shard key: UPDATE users SET bio = ? WHERE user_id = 47201389.
  2. Shard router computes the target shard: shard_index = hash(user_id) % 4 = hash(47201389) % 4 = 1.
  3. Request is routed to Shard 1: the app opens a connection from its Shard 1 pool — not the pools for Shards 0, 2, or 3.
  4. Shard 1 executes the write: only Shard 1's disk is touched. Shards 0, 2, 3 are completely uninvolved. That isolation is where the throughput gain comes from.
  5. Reading back the row follows the same path: hash the user_id, land on Shard 1, read from Shard 1 (or its replicas).
class ShardRouter {
  private readonly numShards: number;
  private readonly shardPools: DatabasePool[];

  constructor(numShards: number, connectionStrings: string[]) {
    this.numShards = numShards;
    this.shardPools = connectionStrings.map(cs => new DatabasePool(cs));
  }

  // Deterministic: user_id=47201389 always → Shard 1
  getShardForUser(userId: number): DatabasePool {
    const shardIndex = userId % this.numShards;
    return this.shardPools[shardIndex];
  }

  async updateUser(userId: number, fields: Partial<User>): Promise<void> {
    const shard = this.getShardForUser(userId);
    await shard.query(
      'UPDATE users SET bio = $1, updated_at = $2 WHERE user_id = $3',
      [fields.bio, new Date(), userId]
    );
    // Only one shard's disk was written. The other 3 had zero involvement.
  }

  async getUserById(userId: number): Promise<User | null> {
    const shard = this.getShardForUser(userId);
    return shard.queryOne<User>(
      'SELECT id, name, bio, created_at FROM users WHERE user_id = $1',
      [userId]
    );
  }
}

When introducing sharding in a design, trace a specific key through the routing path: "User 47,201,389 hashes to shard index 1. Only that shard's connection pool is touched — the other 3 shards have zero contention." That concrete trace demonstrates you understand the mechanism, not just the concept.

The mistake I see most often is candidates who explain sharding in the abstract — hash function, multiple nodes, fan out the writes — but never walk through a specific key hitting a specific shard. Abstract explanations sound rehearsed. Tracing a key sounds like you've built this before.


Key Components

ComponentRole
Shard key (partition key)The column(s) used to determine which shard a row belongs to. The most consequential schema decision in the system. Changing it at scale requires migrating every row.
ShardA standalone database instance (PostgreSQL, MySQL, or a Cassandra node) holding a horizontal row subset. Owns its own connection pool, disk, WAL, and replica set.
Shard routerApplication code or middleware (e.g., Vitess, Citus, ProxySQL) that maps a shard key to a shard connection. Can live in the application layer or as a dedicated proxy.
Shard directoryA lookup table (often in Redis) mapping entity_id → shard_id. Used in directory-based sharding. Requires high availability — if it's down, no writes land.
Consistent hash ringA circular hash space where both nodes and keys are positioned. A key routes to the first node clockwise. Adding/removing a node redistributes only 1/N of keys.
Virtual node (vnode)Each physical shard owns multiple positions on the hash ring. Improves load balance when shards are added or removed — without vnodes, a departed node's range falls disproportionately on its one clockwise neighbor.
Connection poolEach shard requires its own pool. Total connections = app_servers × pool_size × num_shards. At 50 app servers × 20 connections × 8 shards = 8,000 DB connections — must be sized carefully.
Cross-shard queryA query whose WHERE clause doesn't include the shard key. Requires scatter-gather: fan out to N shards in parallel, collect results, merge in the application layer. Expensive — minimize by design.

Sharding Strategies

The most important design decision after "should we shard?" is "how do we pick which shard a row goes to?"

Hash Sharding

shard_index = murmur3(shard_key) % num_shards

A hash function distributes rows across shards. For uniformly distributed keys (user IDs, UUIDs), this guarantees roughly equal row counts and write throughput per shard.

Pros: Uniform distribution, no hot shards from data skew, simple to reason about.

Cons: Range queries require scatter-gather — WHERE user_id BETWEEN 1M AND 2M must fan out to every shard because adjacent IDs hash to unrelated shards. Adding or removing shards remaps nearly all keys (see Consistent Hashing below for the fix).

Range Sharding

Each shard owns a contiguous range of key values: Shard 1 handles user_id 0–24M, Shard 2 handles 25M–49M, etc. The range approach looks attractive because range queries become cheap — but I've seen this choice cause more production incidents than any other, usually because the team picked a monotonically increasing key without realizing it.

Pros: Range queries are cheap — WHERE user_id BETWEEN 1M AND 2M routes to at most one shard. Access locality is preserved.

Cons: Hot shards if data is non-uniformly distributed. If most writes are for recently created users (with the highest IDs), the "latest" shard bears all insertions. This write hot spot can saturate one shard while others sit mostly idle.

Directory-Based Sharding

A lookup service maintains a mapping table: {entity_id → shard_id}. The application consults this directory on every operation.

Pros: Maximum flexibility — individual rows can be moved between shards for manual rebalancing without any algorithmic change.

Cons: The directory itself is a SPOF and adds a latency hop on every write. Must be cached in Redis to keep it off the hot path. An outage stops all writes.

Geographic Sharding

Partition by region: EU users on EU region shards, US users on US shards.

Pros: Low-latency access (data lives near the user), regulatory compliance (GDPR data stays in EU).

Cons: Unbalanced shards if regions have unequal sizes (US often 5–10× larger than others). Cross-region queries require special handling.

StrategyEven distribution?Range queriesResharding costBest for
Hash✅ Yes❌ Scatter-gatherHigh (≈ most keys move)Write-heavy workloads without range query requirements
Range⚠️ Depends on key distribution✅ CheapMediumRange queries are common; key is uniformly distributed
Directory✅ Manual control✅ FlexibleLow (update directory only)Requires frequent rebalancing or irregular partitioning
Geographic⚠️ Region-dependent✅ Within regionLow per regionMulti-region data locality or compliance requirements

When in doubt, start with hash sharding — even distribution is easier to debug than a hotspot you didn't see coming.


Consistent Hashing

Naive hash(key) % n has a catastrophic property: when n changes by 1, nearly all keys change shards. Adding one shard to a 10-shard cluster remaps ~90% of all keys — a massive live migration that takes hours and risks data loss.

Consistent hashing fixes this. Both shards and keys are placed on a circular ring representing the full hash space (0 to 2³²). A key routes to the first shard clockwise from its position.

When a shard is added, only the keys that were previously handled by its clockwise neighbor are affected. That's 1/N of total keys — everything else is untouched.

A circular hash ring with four database nodes (A, B, C, D) placed at different angular positions. Three data keys appear on the ring as orange dots, with dashed arrows routing clockwise to the first node encountered.
Adding node E between A and B moves only A's former key range to E — the rest of the dataset is untouched. This 1/N redistribution property makes consistent hashing the only production-viable approach to resharding without downtime.
class ConsistentHashRing {
  private ring: Map<number, string> = new Map(); // hash position → shard ID
  private sortedPositions: number[] = [];
  private readonly vnodes = 150; // virtual nodes per physical shard

  addShard(shardId: string): void {
    for (let v = 0; v < this.vnodes; v++) {
      const position = murmurhash3(`${shardId}:vnode:${v}`);
      this.ring.set(position, shardId);
    }
    this.sortedPositions = [...this.ring.keys()].sort((a, b) => a - b);
  }

  removeShard(shardId: string): void {
    for (let v = 0; v < this.vnodes; v++) {
      const position = murmurhash3(`${shardId}:vnode:${v}`);
      this.ring.delete(position);
    }
    this.sortedPositions = [...this.ring.keys()].sort((a, b) => a - b);
  }

  getShardForKey(key: string): string {
    if (this.sortedPositions.length === 0) throw new Error('No shards registered');
    const keyPosition = murmurhash3(key);

    // Binary search: first ring position >= keyPosition (first clockwise node)
    const idx = this.sortedPositions.findIndex(pos => pos >= keyPosition);
    const targetPosition =
      idx === -1
        ? this.sortedPositions[0] // wrap around: key is past the last node → goes to first
        : this.sortedPositions[idx];

    return this.ring.get(targetPosition)!;
  }
}

Why 150 virtual nodes per shard? Without vnodes, removing a shard causes its entire range to fall on a single neighbor — temporarily handling double the traffic. With 150 vnodes, the removed shard's key responsibility is spread proportionally across all remaining shards.

Cassandra uses 256 vnodes by default; DynamoDB uses a similar ring-based approach internally. My recommendation: default to 150–256 vnodes unless you have a specific reason to go lower — the redistribution math pays off exactly when you need it most.


The Hotspot Problem

Shard keys that look uniform can still create severe hotspots due to workload asymmetry.

Consider sharding a social platform's follows table by followed_user_id:

sequenceDiagram
    participant App as ⚙️ App Server
    participant Router as 🔀 Shard Router
    participant S1 as 🗄️ Shard 1 (overloaded)
    participant SN as 🗄️ Shards 2–N

    App->>Router: GET followers of celebrity (user_id=1)
    Router->>S1: SELECT * FROM follows WHERE followed_id = 1
    Note over S1: 50M rows · All follower<br/>reads land here · CPU maxed
    S1-->>App: 50M follower records

    Note over SN: Shards 2-N sit mostly idle

User id=1 (a celebrity with 50M followers) has all 50 million follows.followed_user_id = 1 rows on Shard 1. That shard is hot; the others are cold.

The mistake I see most often in interviews is proposing a shard key with decent cardinality and calling it done — without asking whether the top 0.1% of users by activity will dominate one shard. Always stress-test your shard key against your power users before committing to it.

Hotspot typeRoot causeFix
Hot keyOne entity dominates I/O (celebrity, trending post)Sub-shard: replicate that entity's data across multiple shards; use a secondary hash on the read path
Write hot spotMonotonically increasing key (timestamp IDs, auto-increment) on a range shardSwitch to hash sharding or append a random suffix to the shard key
Data skewLow-cardinality shard key (status enum, country code)Choose a higher-cardinality key; combine with a secondary dimension
Traffic skewNew feature spikes reads on a specific shardAdd read replicas for that shard; promote a temporary emergency replica

The celebrity hotspot is the most commonly missed failure in interviews

Proposing user_id sharding for a social platform without calling out the hotspot risk for power users signals surface-level understanding. Always follow a shard key proposal with: "The risk is users with disproportionate activity counts — I'd monitor P99 write QPS per shard and route hot-path reads for those users to a dedicated cache tier."


Cross-Shard Queries

The price of sharding writes is expensive cross-shard reads. When a query's WHERE clause doesn't include the shard key, it must scatter-gather across every shard. I'll often see candidates add sharding to a design and then propose queries that bypass the shard key entirely — that's paying the full operational cost of sharding for zero routing benefit.

// CHEAP: query includes shard key — single shard touched
async function getUserPosts(userId: number): Promise<Post[]> {
  const shard = router.getShardForUser(userId);
  return shard.query(
    'SELECT * FROM posts WHERE user_id = $1 ORDER BY created_at DESC',
    [userId]
  );
  // Exactly one shard queried. O(log N) B-tree lookup within that shard.
}

// EXPENSIVE: no shard key — must query ALL shards and merge
async function getPostsFromLastHour(): Promise<Post[]> {
  const cutoff = new Date(Date.now() - 3_600_000);

  // Fan out to all shards in parallel — each does a full or partial scan
  const shardResults = await Promise.all(
    allShards.map(shard =>
      shard.query('SELECT * FROM posts WHERE created_at > $1', [cutoff])
    )
  );

  // Merge-sort in the application layer
  return shardResults
    .flat()
    .sort((a, b) => b.created_at.getTime() - a.created_at.getTime());
  // Cost: N full shard scans + O(R log R) merge, where R = total result rows
}

Three ways to avoid the scatter-gather tax:

  1. Co-locate related data via shard key: shard the posts table by user_id so "get user's posts" is always single-shard.
  2. Precomputed aggregates: maintain a global_stats table on a dedicated coordinator shard, updated via Change Data Capture. Avoids scatter-gathering to compute counts.
  3. Secondary index shards: maintain a separate shard keyed by a secondary dimension. Instagram sharded by user_id but maintained a secondary index shard keyed by created_at for time-series feed generation.

Design your shard key around the query that runs most often, not the query you find most interesting.


Trade-offs

ProsCons
Linear write scaling — add a shard, add proportional write capacityCross-shard queries require scatter-gather across N shards and application-layer merge
Failure isolation — one shard crash affects only its subset of usersShard key choice is practically irreversible at scale — changing it means migrating every row
Smaller per-shard indexes → lower B-tree depth → faster lookupsTransactions spanning two shard keys require distributed 2PC or Saga redesign
Independent backup and restore per shardTotal connection count multiplies: app_servers × pool_size × num_shards
Schema migrations can roll out per shard in controlled batchesNo SQL JOIN across shards — all relationship traversal must happen in application code
Supports geographic partitioning for data residency complianceOperational complexity multiplies: monitoring, alerting, and debugging N independent databases

The fundamental tension here is write scalability vs. query flexibility. Sharding breaks the implicit promise of a relational database — that any query over any combination of columns is a first-class citizen. Once you shard, your most natural analytical queries become your most expensive operations.

Once you commit to a shard key, you've picked a side — and changing your mind at scale costs months.


When to Use It / When to Avoid It

So when does sharding actually make sense — and when are you just adding complexity you'll regret? Here's how I think about it.

Use sharding when:

  • Write throughput has exhausted the single primary's capacity and all vertical scaling options are depleted.
  • The dataset is too large for a single machine's disk even after pruning and archiving cold data.
  • You need failure isolation — a regulatory or SLO requirement that a subset of users can be served even during a DB incident affecting others.
  • Your workload has a natural, high-cardinality partition dimension that aligns with your hottest query pattern.

Avoid sharding (or be very careful) when:

  • Read load is the problem — that's a caching and read-replica problem. Sharding does not reduce read load.
  • Your queries regularly cross partition boundaries — scatter-gather can be slower than a well-indexed single-node query.
  • You haven't validated indexes, query plans, and connection pooling — a missing index can cost more than the entire sharding effort and is reversible in hours.
  • The team lacks operational DB experience — monitoring N independent databases, managing cross-shard schema migrations, and diagnosing hotspots are non-trivial skills.
  • You're handling fewer than 10 million rows — vertical scaling and read replicas handle this range without the irreversible complexity of sharding.

Sharding is the nuclear option of database scaling — correct when nothing else works, expensive to walk back, and worth every penny when you actually need it.

The sharding readiness checklist

Before sharding, verify: (1) indexes are optimal — run EXPLAIN ANALYZE on slow write paths; (2) connection pooling is in place — PgBouncer in transaction mode; (3) read load is on read replicas, not the primary; (4) the primary is on the largest available instance; (5) write-intensive paths are batched where the business allows. If you've cleared all five and the primary is still write-saturated, shard.


Real-World Examples

Instagram (2012) — From 1 to 512 Logical Shards

When Instagram reached 30 million users, their single PostgreSQL primary was write-saturated. They sharded by user_id, distributing 512 logical shards across 12 physical PostgreSQL nodes — not 512 separate servers. The key insight from their 2012 engineering blog: logical shards (database namespaces) are far cheaper than physical machines.

Initially, multiple logical shards live on the same physical node. To add capacity, you migrate a logical shard from a busy physical node to a new one — no key redistribution, no algorithmic change. The shard key (user_id) worked well for profile lookups but forced scatter-gather for global activity feeds, eventually driving them to build a dedicated fanout service that pre-writes events into follower timelines at write time.

Discord (2023) — 1 Trillion Messages Partitioned by Channel

Discord stores over 1 trillion chat messages. Their data model (originally Cassandra, later ScyllaDB) partitions by (channel_id, message_bucket) where message_bucket = floor(epoch_seconds / BUCKET_SIZE). Every message in a channel within a time window lives on the same partition — making "load the last 50 messages in this channel" a fast single-partition scan.

The trade-off was deliberate: "show all messages by a specific user across all channels" (a support query) is scatter-gather across every partition. Discord explicitly optimized for the dominant read pattern (channel history) at the cost of the minority pattern (cross-channel user audit). The lesson: shard key selection is a read-pattern alignment decision, not just a write-distribution decision.

Vitess (2012, YouTube) — Transparent MySQL Sharding at Scale

Vitess is Google's open-source sharding middleware for MySQL, originally built internally at YouTube in 2012 to manage a sprawling set of 50+ manually operated MySQL shards. YouTube's team was spending significant engineering time on shard-specific routing code and inconsistent migration procedures across services. Vitess abstracted all of that behind a single MySQL-compatible endpoint: application code queries a standard MySQL connection, and Vitess transparently routes each query to the correct shard, rewrites scatter-gather queries, and handles resharding online with no downtime.

Key capabilities: VSchema (describing how each table is sharded), scatter-gather with rate-limiting to prevent shard storms, and online schema changes without table locks. GitHub adopted Vitess in 2018 to scale MySQL beyond what a single primary could sustain, migrating to a Vitess cluster handling hundreds of thousands of queries per second. Slack followed a similar path for their message storage tier.

The lesson: building routing logic in application code creates inconsistency and maintenance debt across every service; a middleware layer like Vitess standardizes sharding conventions and keeps application code clean — the shard topology can change without any application deployment. My recommendation: if you're sharding MySQL at scale, Vitess is worth evaluating before building routing logic from scratch.


How This Shows Up in Interviews

When to proactively introduce sharding

Introduce sharding when you identify a write bottleneck — not just "we have a lot of data." In any design where you project more than 50K writes/second to a single service's primary, say: "At this write volume we'll saturate a single DB primary within 18 months. I'd shard by [key] — that distributes writes evenly while keeping [primary query pattern] on a single shard." Naming the key and justifying it in one sentence is the first signal of real depth.

Don't conflate sharding with replication — they solve different problems

Interviewers mark down candidates who propose both sharding and replication without explaining which bottleneck each solves. Read replicas spread read throughput — the primary still takes all writes. Sharding distributes writes across multiple primaries. State the specific bottleneck you're solving before prescribing the solution.

The callouts above cover the conceptual baseline. Senior and staff candidates are expected to go further — to demonstrate operational reasoning, not just pattern recognition. My recommendation: practice naming the shard key and justifying it in a single sentence for every major design you've prepared.

Depth expected at senior/staff level:

  • State the shard key and justify it: "I'd shard by user_id because it aligns with the hottest query pattern — profile lookups — and keeps all of a user's data co-located, avoiding cross-shard joins on the write path."
  • Proactively call out the hotspot problem: "user_id sharding works until we have influencers with 50 million followers. I'd monitor per-shard P99 write latency to detect skew and route hot-key reads to a dedicated cache tier."
  • Distinguish resharding approaches: consistent hashing (1/N keys move) vs. naive modulo (nearly all keys move) — and explain why this matters operationally.
  • Name the cross-shard transaction problem and explain how the design avoids it: "The shard key keeps all order data for a user on one shard — payment and order state are co-located, so no distributed transaction is needed."
  • Do the connection pool math aloud: "50 app servers × 20 connections × 8 shards = 8,000 DB connections. I'd put PgBouncer in transaction-pooling mode in front of each shard."

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"How would you shard the users table for a social platform?""Hash shard by user_id with consistent hashing and 150 vnodes per shard. Even write distribution, all user data co-located for profile lookups, and resharding cost is bounded at 1/N keys. Trade-off: aggregate queries like 'count signups this week' require scatter-gather — I'd precompute those into a separate analytics store via Change Data Capture."
"What's the difference between sharding and partitioning?""Partitioning is intra-database: splitting a table into multiple files within the same instance (PostgreSQL RANGE partitioning, MySQL table partitioning). It gives query pruning and maintenance benefits but no write-throughput scaling. Sharding is inter-database: separate instances, connection pools, and failure domains. Sharding gives write-throughput scaling; partitioning doesn't."
"A shard is getting 80% of all writes. How do you diagnose and fix it?""Compare per-shard write QPS to the expected 1/N baseline. If one shard is hot: (1) check for low-cardinality shard key, (2) check for a hot key (one entity driving massive write volume — celebrity problem), (3) check for a monotonically increasing key on a range shard. Fix by root cause: redistribute vnodes, route the hot entity to a side cache, or switch to hash sharding for monotonically increasing keys."
"How do you handle schema migrations across 50 shards?""Expand-Contract: deploy application code that handles both old and new schema before touching any shard. Roll the ALTER TABLE across shards one at a time using gh-ost (no table lock, tracks live writes via binlog). Only after all 50 shards complete migration, remove the old code path. This ensures any mid-migration shard still serves traffic without 500s."
"How do cross-shard transactions work?""They don't natively — a transaction is scoped to one connection, which is one shard. For operations spanning two shard keys, use either Distributed 2PC (correct but slow, avoid) or the Saga pattern with compensating transactions via message queue (eventually consistent). The best answer is always: design the shard key to co-locate transactionally related data so cross-shard transactions are never needed."

The interview is not a test of whether you know the term "scatter-gather" — it's a test of whether you know when to pay that cost deliberately.


Test Your Understanding


Quick Recap

  1. Sharding is the only mechanism that scales database write throughput horizontally — caching reduces reads, read replicas spread read load, but only distributing the primary across multiple machines breaks the write ceiling.
  2. The shard key determines which shard a row lives on and is the most consequential schema decision in the system — it must be high-cardinality, write-distributed, aligned with the hottest query pattern, and is practically immutable at scale.
  3. Hash sharding distributes writes evenly but breaks range queries; range sharding preserves query locality but creates hotspots for monotonically increasing keys; consistent hashing bounds resharding cost to 1/N key migration when topology changes.
  4. The hotspot problem — one shard bearing disproportionate traffic because one entity drives concentrated writes — is the most common production failure in real-world sharding and must be anticipated when choosing a shard key.
  5. Cross-shard queries require scatter-gather (fan out to N shards, merge in application), which is expensive — design the shard key to co-locate transactionally related data, and precompute cross-shard aggregates via Change Data Capture.
  6. Start with 50 logical shards on 5 physical machines: logical shards let you add capacity by migrating a namespace to a new machine with one config change, not by remapping keys — zero algorithmic change, bounded migration cost.
  7. In every interview, name the shard key and justify it in the same sentence: "I'd shard by user_id — that distributes writes evenly while keeping all of a user's data co-located, which avoids cross-shard joins on the hot path." That sentence signals you understand what makes sharding hard, not just that it exists.

Related Concepts

  • Databases — Covers the internals sharding depends on: B-tree indexes, connection pooling, WAL throughput, and why single-node write capacity has a hard ceiling that vertical scaling alone cannot escape.
  • Replication — Sharding and replication solve different problems: sharding splits writes across multiple primaries; replication duplicates a primary to serve reads. Production systems use both — every shard has its own replica set.
  • Consistency models — Sharding complicates consistency: cross-shard transactions require distributed coordination, and eventual consistency often becomes the pragmatic default for cross-shard reads. Understand which consistency you're trading before sharding.
  • Caching — The correct first response to a read-heavy DB bottleneck is caching, not sharding. A cache layer in front of a single DB often defers sharding by years. Knowing when caching is insufficient is the prerequisite for knowing when sharding is warranted.
  • Load balancing — Load balancers distribute app-tier traffic; shard routers distribute DB-tier traffic. Both are routing layers operating at different tiers with different failure modes. Understanding health checks and failover at the load balancer layer informs resilient shard router design.

Previous

Databases

Next

Replication

Comments

On This Page

TL;DRThe Problem It SolvesWhat Is It?How It WorksKey ComponentsSharding StrategiesHash ShardingRange ShardingDirectory-Based ShardingGeographic ShardingConsistent HashingThe Hotspot ProblemCross-Shard QueriesTrade-offsWhen to Use It / When to Avoid ItReal-World ExamplesHow This Shows Up in InterviewsTest Your UnderstandingQuick RecapRelated Concepts