Over-sharding anti-pattern
Learn why premature sharding creates expensive cross-shard queries, forces you to live with a bad shard key forever, and how to know when you actually need to shard.
TL;DR
- Over-sharding means partitioning your database before you have enough data and access pattern knowledge to choose a good shard key, resulting in a partition strategy you'll be stuck with.
- Choosing the wrong shard key turns single-row lookups into scatter-gather queries across all shards, adding latency, complexity, and cost.
- Most applications don't need sharding until they exceed a few hundred GB of hot data or tens of thousands of writes per second. Most systems never reach these thresholds on a single well-tuned replica set.
- The right progression: vertical scaling, then read replicas, then caching, then functional partitioning, then sharding (only when everything else is exhausted).
- Once you pick a shard key, changing it requires a full data migration. Premature sharding is worse than late sharding because you're locked into a key chosen without mature access pattern data.
The Problem
A startup shards their user table by user_id modulo 8 on day one. Their first 60 days of traffic don't come close to stressing a single Postgres instance. But they have 8 shards now, and they're already paying the complexity tax. Every migration takes 8 times longer. Every backup test runs 8 times. Every new engineer has to understand the routing layer before they can write a query.
I've watched this play out three times. Each time, the team was convinced they'd need sharding "any day now." Each time, they picked a shard key based on the primary key instead of actual query patterns.
The first warning sign is always the same: a product requirement changes, and suddenly the shard key doesn't match. A query that should take 10ms now takes 300ms because it fans out to every shard, waits for the slowest one, merges the results in application code, and then sorts and paginates.
Six months later, the product pivots. The access pattern that previously looked up users by user_id now needs to list all users in a geographic region. This query requires hitting all 8 shards, aggregating results in the application layer, sorting, and paginating. A scatter-gather query that takes 300ms instead of 10ms.
The timing breakdown tells the story:
| Query Type | Single DB | 8 Shards (scatter-gather) |
|---|---|---|
| Single user lookup | 2ms | 3ms (routing overhead) |
| Users by region | 10ms | 300ms (8 parallel queries + merge + sort) |
| User count by region | 5ms | 180ms (8 COUNT queries + sum) |
| Full-text search | 15ms | 400ms (8 searches + merge + rank) |
They can't re-shard easily. The migration requires moving hundreds of millions of rows. Every service that writes user data needs to be updated simultaneously. They're stuck with a bad shard key for years.
The hidden cost of every shard
Each additional shard multiplies operational complexity across every database operation:
| Operation | Single DB | 4 Shards | 8 Shards |
|---|---|---|---|
| Single-key lookup | 1 query | 1 query (+routing) | 1 query (+routing) |
| Range query (unknown shard) | 1 query | 4 queries | 8 queries |
| Cross-shard aggregation | 1 query | 4 queries | 8 queries |
| Schema migration | 1 DDL | 4 coordinated DDLs | 8 coordinated DDLs |
| Backup/restore test | 1 database | 4 databases | 8 databases |
| Transactions | ACID | Cross-shard: saga | Cross-shard: saga |
| Debugging a query | 1 database log | 4 logs to correlate | 8 logs to correlate |
| Connection pool size | N connections | 4N connections | 8N connections |
You're paying all these costs before you've needed them. And the worst part: you're paying them at the exact stage of your company when engineering bandwidth is most scarce.
Why It Happens
Over-sharding comes from individually reasonable decisions:
- "We'll need it eventually." The team anticipates growth that hasn't materialized yet and prepares infrastructure for 100x current load.
- Resume-driven development. Sharding sounds impressive in a design doc. "We shard across 8 nodes" looks better than "we use a single Postgres instance."
- Copying big-company playbooks. Google, Facebook, and Amazon shard everything because they operate at planetary scale. A startup with 10K users does not have the same constraints.
- Fear of migration pain. "If we shard later, the migration will be harder." This is true, but sharding now with the wrong key is worse than sharding later with the right one.
The core misconception: sharding is a scaling solution you can add preemptively. In reality, sharding is a commitment that constrains every future query pattern, schema change, and operational procedure.
One team I worked with spent 3 months building a sharding layer for a service that had 2GB of data. Those 3 months could have shipped 6 features. When I asked what problem sharding solved, the answer was "we might need it in two years." Two years later, the data was 15GB, still easily handled by a single instance.
The opportunity cost is the real killer. Every hour spent on sharding infrastructure is an hour not spent on product features, performance optimization, or paying down tech debt that actually matters. At the startup stage, engineering velocity is your most valuable asset.
How to Detect It
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.