Learn how data partitioning splits rows across nodes for horizontal scalability, when to pick range vs hash vs directory-based strategies, and how to handle hotspots and rebalancing.
42 min read2026-04-04mediumpartitioningshardingdistributed-systemsdatabases
Data partitioning splits rows across multiple nodes so no single machine is the bottleneck. It enables horizontal write scaling but introduces query routing complexity and rebalancing pain.
Three core strategies: range partitioning (key ranges map to specific shards), hash partitioning (hash(key) % N picks the shard), and directory-based (an explicit lookup table maps keys to shards).
Range partitioning supports efficient range queries but creates hotspots when keys are sequential (timestamps, auto-increment IDs).
Hash partitioning distributes load evenly but forces scatter-gather for any range query.
Consistent hashing is the production standard because it minimizes data movement when nodes are added or removed (only ~1/N keys migrate instead of ~80%).
Your e-commerce platform has been growing for two years. The founding team made a sensible decision: one PostgreSQL database, vertically scaled. It's now on the largest RDS instance available, 512 GB RAM, NVMe storage.
Last month, the product catalog crossed 800 million rows. The orders table alone is 2 TB. Full table scans for analytics queries that used to take 30 seconds now take 12 minutes. Write latency at peak has crept from 2ms to 45ms because the WAL is competing with index maintenance on a table that no longer fits in memory.
Backups now take 6 hours. A VACUUM FULL on the orders table takes your database offline for 45 minutes. Schema migrations on an 800-million-row table require hours of ALTER TABLE time. The DBA team is spending more time managing this single database than the feature teams spend building features.
And it's getting worse: the data is growing at 50 million rows per month. In 6 months, you'll have 1.1 billion rows. The backup window won't fit in off-peak hours. The indexes will no longer fit in memory, and B-tree lookups will start hitting disk on every query.
You've added read replicas, but the problem isn't reads. It's that a single node is storing, indexing, and writing all 800 million rows. I've watched teams try every optimization in the book at this stage: connection pooling, query tuning, bigger hardware, table partitioning within the same database. Each one buys a few weeks. None of them changes the fundamental math. The most frustrating part is that your database metrics show the problem clearly (CPU at 90%, disk IOPS at ceiling, replication lag growing) but no single-node optimization can fix it.
Comments
A single machine has a finite disk, a finite memory bus, and a finite WAL throughput. Once your dataset exceeds what one node can handle, you need to split the data itself.
Replication doesn't solve this. A read replica can offload SELECT queries, but every INSERT still goes to the primary. Caching helps read-heavy workloads, but if your bottleneck is write throughput, cache hits don't reduce write load. Vertical scaling hits a ceiling (you can't buy infinite RAM). The only path forward is splitting the data itself across multiple independent machines.
The fix is partitioning: splitting the 800 million rows across multiple nodes so each node stores and indexes a manageable subset. Instead of one machine doing all the work, you distribute both the storage and the compute.
The concept is simple to state but has deep operational implications: query routing, rebalancing, cross-shard queries, distributed transactions, and per-shard monitoring all become your problem. This is why partitioning is a last resort, not a first optimization.
If you're unsure whether you need partitioning, you probably don't. But when you hit this wall, nothing else fixes it.
Data partitioning is dividing a dataset into smaller, independent subsets (called partitions or shards) and distributing them across multiple nodes. Each partition holds a different slice of the rows but shares the same schema. Together, the partitions hold the complete dataset. The key distinction from replication: replication copies the same data to multiple places (redundancy); partitioning splits different data across multiple places (distribution).
Analogy: Imagine a library with 10 million books on one floor. Finding any book requires walking the entire floor. Now split the library across 10 floors, A-C on floor 1, D-F on floor 2, and so on. Each floor is smaller, faster to search, and can be staffed independently. The directory at the entrance tells you which floor to visit.
That directory is the partition routing logic. The floor assignment rule (alphabetical ranges) is the partitioning strategy. And when the library grows and you add an 11th floor, the books that need to move are the rebalancing problem. The library analogy also explains the scatter-gather cost: if someone asks "find all books by Author X" and Author X has books across 5 floors, you need to send librarians to all 5 floors and merge the results.
Each shard now holds ~270 million rows instead of 800 million. Indexes fit in memory. WAL throughput per node is 3x lower. Write latency drops back to 2ms. For your interview: partitioning is how you scale writes past what a single machine can handle.
These terms are sometimes used loosely, so let's be precise:
Partitioning is the general concept of splitting data into subsets. It can happen within a single database (Postgres table partitioning) or across multiple machines.
Sharding is partitioning across multiple database instances (separate machines). Every shard is a full database server with its own storage, compute, and connection pool.
Replication is copying the same data to multiple nodes for redundancy and read scaling. Replicas hold identical data; partitions hold different data.
You can combine them: shard your data across 4 machines, and replicate each shard to 2 replicas. That gives you both write scaling (4 shards) and read scaling + fault tolerance (2 replicas per shard). In interviews, mentioning this combination shows you understand the full picture.
Here's what happens on every query in a hash-partitioned system, step by step:
Client sends a request:GET /users/user_8374. The app server needs to fetch this user's profile.
Compute the partition key: The app hashes the user ID: hash("user_8374") -> 0xA3F2.... This is deterministic, so the same key always maps to the same shard.
Route to the correct shard:hash_value % num_shards (or a ring lookup for consistent hashing). The router determines shard 2 owns this key.
Execute on the target shard: The query runs on shard 2 only. Other shards are untouched. Index lookups are faster because the shard holds 1/N of the total rows.
Return the result: Shard 2 responds directly. No cross-shard coordination needed for single-key lookups.
The three common routing strategies are:
Client-aware routing: The application library knows the partition map and connects directly to the right shard. The client caches the map locally and refreshes it periodically or on topology changes. Used by Redis Cluster and Cassandra drivers. Lowest latency (no proxy hop), but the client must handle topology changes gracefully.
Proxy-based routing: A middleware layer (like Vitess VTGate, ProxySQL, or PgBouncer with routing rules) intercepts queries and routes them to the correct shard. The application connects to the proxy as if it were a single database. Simplest for application developers, but adds a network hop.
Coordinator-based routing: A centralized query planner dissects the query and distributes subqueries to the appropriate shards. Used by CockroachDB and YugabyteDB. Most transparent (the app sees one logical database), but the coordinator can become a bottleneck.
For most custom sharding setups, proxy-based routing is the best starting point. It keeps the application code clean and centralizes the routing logic.
// Simplified partition routingfunction getShardForKey(key: string, numShards: number): number { const hash = murmurHash3(key); // consistent, fast hash return hash % numShards; // deterministic shard assignment}async function getUser(userId: string): Promise<User> { const shardId = getShardForKey(userId, NUM_SHARDS); const pool = shardPools[shardId]; // each shard has its own connection pool const result = await pool.query( 'SELECT * FROM users WHERE id = $1', [userId] ); return result.rows[0];}
For queries that span multiple partition keys (like "find all users in Seattle"), the story changes. The app must scatter the query to all shards and gather the results. This is the most expensive operation in a partitioned system, and your latency equals the slowest shard.
// Scatter-gather: query all shards when partition key is unknownasync function findUsersByCity(city: string): Promise<User[]> { const promises = shardPools.map((pool, shardId) => pool.query('SELECT * FROM users WHERE city = $1', [city]) ); const results = await Promise.all(promises); // wait for ALL shards const merged = results.flatMap(r => r.rows); // merge result sets return merged.sort((a, b) => a.name.localeCompare(b.name)); // sort}
This scatter-gather pattern is the fundamental cost of partitioning. Single-key lookups are fast, but queries that don't include the partition key hit every shard. The more shards you have, the higher the scatter-gather overhead. At 8 shards this is manageable; at 200 shards, scatter-gather queries dominate your latency budget.
Scatter-gather is not free
A query missing the partition key in its WHERE clause forces a fan-out to all N shards. At 50 shards, that is 50 parallel queries, 50 result sets to merge, and your latency is governed by the slowest shard (p99 tail latency). Design your partition key around your hottest query pattern, not your data model.
The column (or columns) used to determine which shard owns each row. The single most important decision in the system. A bad partition key cannot be fixed without resharding all data.
Partition function
The algorithm that maps a key to a shard: hash modulo, range boundary, consistent hash ring, or directory lookup. Determines data distribution and rebalancing behavior.
Partition router
The layer (client library, proxy, or middleware) that directs each query to the correct shard.
Shard
An individual database node (or replica set) holding one slice of the data. Same schema, different rows.
Rebalancer
The process that redistributes data when shards are added, removed, or become uneven. Can run in the background (consistent hashing) or require downtime (hash modulo).
Global secondary index
An optional cross-shard index that maps non-partition-key columns to shard locations, enabling efficient lookups on secondary attributes. Updated asynchronously in most systems.
Directory service
A metadata store (used in directory-based partitioning) that maps key ranges to shard addresses. Must be highly available.
Group keys by contiguous ranges. Shard 1 holds keys A-F, Shard 2 holds G-M, Shard 3 holds N-Z. The boundary points are either fixed at setup time or adjusted dynamically as data grows.
Range partitioning shines for time-series and ordered data. A query like "get all orders from Jan 1-7" hits exactly one shard because those keys are co-located. The downside: sequential writes (timestamps, auto-increment IDs) funnel all traffic into a single "hot" shard. I'll often see teams pick range partitioning for timestamped data and then wonder why one shard is on fire while the rest are idle. The write distribution matters as much as the read pattern.
Hotspot mitigation for time-series data:
-- ❌ Naive partition key: just the date-- All writes for today → one shard, crushes itpartition_key = '2026-04-04'-- ✅ Salted partition key: hash prefix + date-- Spreads today's writes across 16 shardspartition_key = (hash(user_id) % 16) || '_' || '2026-04-04'-- Tradeoff: range queries for one date now hit 16 shards
Common range partition systems set split points manually or let the database auto-split when a range exceeds a size threshold (HBase auto-splits at 10 GB by default). The key operational concern is monitoring shard sizes and proactively splitting before a shard becomes a bottleneck.
Used in: DynamoDB sort key partitions, Cassandra (with careful partition key design), Google BigTable.
Apply a hash function to the key and use modulo N to pick the shard. hash("user_1001") % 4 → shard 2. This spreads data uniformly regardless of key distribution.
The catch: hash(key) % N breaks when N changes. Adding a fifth shard means hash(key) % 5 produces different assignments for most keys, so ~80% of your data needs to move. That is why naive hash partitioning is rarely used without consistent hashing in production.
The other downside people forget: hash partitioning destroys data locality. If you need to fetch "all orders for user_123 sorted by date," and user_123's orders are spread across 4 shards by hash(order_id), you scatter-gather across all shards and sort the merged results. Partitioning by user_id instead keeps all of one user's data on one shard, making per-user queries single-shard.
Consistent hashing maps both keys and nodes onto the same circular hash ring (typically 0 to 2^32-1). Each key lands on the ring at hash(key) and walks clockwise to find the first node. When a node is added or removed, only the keys between the new/removed node and its neighbor need to move.
Here's a concrete example. You have 4 nodes at positions 0, 90, 180, and 270 on the ring.
Key "user_42" hashes to position 150.
Walking clockwise from 150, the first node is at 180 (Node C), so Node C owns this key.
If you add Node E at position 135, the key at 150 still walks to Node C.
But a key at position 100 that previously walked to Node C (at 180) now walks to Node E (at 135).
Only keys between 90 and 135 migrate from Node C to Node E.
The practical problem with basic consistent hashing is uneven distribution. Four nodes don't divide the ring into equal arcs. Virtual nodes (vnodes) solve this: each physical node gets 100-200 positions on the ring, creating a much more uniform distribution.
Standard hash (4 to 5 nodes): ~80% of keys must move. Consistent hashing (4 to 5 nodes): only ~20% move. Virtual nodes (vnodes) improve balance further by giving each physical node 100-200 positions on the ring, reducing variance in data distribution from ~30% to under 5%.
Used in: Cassandra, DynamoDB, Redis Cluster, Amazon S3.
An explicit lookup table maps each key or key range to a specific shard. Maximum flexibility: you can migrate specific key ranges without rehashing, and put hot tenants on dedicated hardware.
The tradeoff is that the directory itself becomes a centralized dependency. Every query needs a directory lookup (or a cached copy), and the directory must be highly available. If the directory goes down, no query can be routed. In practice, teams cache the directory in every app server's memory and invalidate on topology changes. Vitess (YouTube's MySQL sharding layer) uses this approach with etcd as the directory store.
Directory-based partitioning is also the only strategy that supports online shard merging: you can consolidate two small shards into one shard by updating the directory, then migrating the data in the background. This is useful when shards become too small to justify the operational overhead of maintaining them.
In practice, many systems combine strategies. A common pattern partitions first by hash (for write distribution) and then by range within each partition (for sorted access within a shard). DynamoDB does exactly this: the partition key is hashed to pick a partition, and the sort key provides ordered access within that partition.
Another hybrid: partition by geography at the top level (US-East, EU-West) and by hash within each region. This gives you data locality for latency-sensitive workloads plus even distribution within each region.
A special case of composite partitioning: route data to the database cluster closest to the user's region. This reduces read latency (data is physically near the user) and can help with data residency regulations (EU data stays in EU data centers). CockroachDB and YugabyteDB support geo-partitioning natively using PARTITION BY LIST on a region column.
The tradeoff: cross-region queries (e.g., a global admin dashboard) must reach to all regions. And writes to data owned by another region require a cross-region round trip, adding 50-200ms of latency depending on geography.
When you add or remove nodes, data needs to move. The rebalancing strategy determines how painful this is:
Fixed partition count: Create many more partitions than nodes at the start (e.g., 1,000 partitions across 10 nodes). When you add a node, move entire partitions to it rather than re-hashing keys. Each node gives up a few partitions. Used by: Elasticsearch, Riak, Couchbase.
Dynamic splitting: Start with few partitions and split them when they grow beyond a threshold (for example, 10 GB per partition in DynamoDB). Each split produces two smaller partitions, one of which can be moved to a new node. This is operationally simple but can cause brief latency spikes during splits. Used by: HBase, MongoDB (auto-splitting), DynamoDB.
Virtual node rebalancing (consistent hashing): Each physical node owns many virtual positions on the ring (typically 128-256). Adding a node creates new virtual positions, and only keys between the new position and the previous position migrate. This is the most graceful approach and happens incrementally. The downside: more virtual nodes means more metadata to maintain and a slightly larger partition map. Used by: Cassandra, Riak.
My recommendation: for interviews, say "fixed partition count with consistent hashing" and you cover 90% of production systems. The exact mechanics vary, but the principle is always the same: pre-divide data into many small chunks so moving a few chunks is cheaper than re-partitioning the entire dataset.
For your interview: consistent hashing is the default answer when the interviewer asks how you partition. Range partitioning is the answer when they ask about time-series data.
Each shard's indexes fit in memory, keeping read latency low
Cross-shard joins are expensive or impossible without denormalization
Failure isolation: one shard going down doesn't take the whole dataset offline
Transactions spanning multiple shards need distributed 2PC or saga patterns
Independent maintenance: vacuum, reindex, backup per shard
Schema migrations must be coordinated across all shards simultaneously
Can place hot shards on better hardware (heterogeneous scaling)
Rebalancing when adding/removing nodes is operationally complex
Reduces lock contention (fewer writers per shard)
Operational overhead: monitoring N shards, N backup jobs, N failover targets
The fundamental tension is query flexibility vs. write scalability. Every partitioning strategy optimizes for one access pattern and penalizes others. The partition key determines which queries are cheap (single-shard) and which are expensive (scatter-gather). Pick the wrong partition key and your most common query becomes your most expensive one.
Once you partition, you need per-shard observability. The metrics that matter:
Per-shard QPS and latency (p50, p99): Detects hot shards. If one shard's p99 is 5x the others, it's either overloaded or has a skewed data distribution.
Per-shard data size (GB): Detects data skew. Shards should be within 20% of each other. If one shard is 3x larger, rebalance or investigate the partition key.
Scatter-gather query rate: The percentage of queries that hit all shards. If this exceeds 30%, your partition key doesn't match the dominant access pattern. Track this metric after launch and revisit the partition key if it's too high.
Replication lag per shard: In a partitioned + replicated setup, one shard's replica falling behind can cause stale reads for that subset of data. Alert on lag > 1 second.
Without per-shard monitoring, you're flying blind. Aggregate metrics hide partition-level problems. A healthy average latency of 5ms might be hiding one shard at 50ms and the rest at 2ms.
Secondary indexes are the hardest part of partitioning. I'll often see candidates design a partition key perfectly for their primary access pattern, then realize their second-most-common query doesn't include the partition key at all.
Two approaches exist:
Local secondary indexes (per-shard): Each shard maintains its own index. A query on a non-partition-key column scatters to all shards, each returns local matches, and the router merges results. Writes are fast (only the local shard's index updates), but reads on secondary columns cost O(N shards). This is the default in MySQL and PostgreSQL sharding setups.
Global secondary indexes: A separate index structure maps secondary column values to shard locations. Reads are fast (one index lookup, then targeted shard reads). But index updates are asynchronous, so reads can be stale. DynamoDB's Global Secondary Indexes (GSIs) use this approach with eventual consistency. The lag is typically under one second, but during partitions or high write load, it can spike to several seconds.
Queries that need data from multiple shards are the most expensive operations in a partitioned system. Four strategies to handle them:
Scatter-gather: Send the query to all N shards, aggregate results at the router. This is unavoidable for global aggregate queries (e.g., "count all orders"). Cost scales linearly with shard count, and your latency equals the slowest shard's response.
Denormalization: Embed frequently joined data into the same shard. If you always fetch orders with their product names, store a copy of the product name in the orders table. Cost: data redundancy and update propagation complexity.
Co-location: Ensure related data lives on the same shard. Shopify partitions by shop_id, so all products, orders, and customers for one shop are on the same shard. Cross-shop queries are expensive, but per-shop queries (the hot path) never cross shards.
Application-side join: Query shard A for a set of IDs, then query shard B with those IDs. Two round trips instead of one database join, but each round trip is a targeted single-shard query. Works well when one side of the join is small.
My recommendation: co-locate related data on the same shard whenever possible (partition orders and order_items by the same key). For the remaining cross-shard needs, build a CQRS read model. Don't try to make cross-shard joins work at scale; they will always be fragile.
Interview shortcut for secondary indexes
When the interviewer asks about querying non-partition-key columns, say: "local secondary indexes for write-heavy workloads, global secondary indexes for read-heavy workloads, with the tradeoff being write speed vs. read consistency." That covers 90% of what they want to hear.
Your team doesn't have the operational maturity to manage shard rebalancing, cross-shard migrations, and distributed schema changes.
You're doing it for read scaling only. Read replicas and caching solve that without the operational burden of partitioning.
Here's the honest answer: most applications never need partitioning. If you do need it, you'll know, because you'll have tried everything else and hit a hard ceiling.
The partition key decision is the most consequential one in the entire system. Here's my framework for picking it:
List your top 5 queries by frequency. Run pg_stat_statements or look at your APM tool.
Which columns appear in the WHERE clause most often?
Pick the column that appears in 60%+ of queries. That is your partition key candidate.
If your top query is WHERE user_id = ? and it accounts for 70% of traffic, partition by user_id.
Check for data skew. Count the rows per unique value of that column.
If one value has 1,000x more rows than the median (think celebrity users, enterprise tenants), hash partitioning on that column will create a hot shard.
Consider a composite key or directory-based assignment for outliers.
Check for write skew. Even if data is evenly distributed, writes might not be.
A partition key of region with 4 regions might look balanced, but if 80% of your users are in the US, the US shard absorbs 80% of writes.
Accept the scatter-gather tax. Every query that does not include the partition key will scatter.
Design a separate read path (CQRS, materialized views, search index) for those queries rather than trying to make one partition key serve all access patterns.
The rule of thumb: partition by the entity your application operates on most frequently. For user-facing apps, that is usually user_id or tenant_id. For event-driven systems, it is often entity_id (the thing the event is about). For time-series, it is a composite of entity_id + time_bucket.
Cassandra (Netflix, Apple): Cassandra uses consistent hashing with 256 virtual nodes per physical node by default. Netflix runs thousands of Cassandra nodes partitioned by content ID and user ID. The partition key is chosen per table based on access pattern, and Netflix engineers have spoken publicly about the importance of avoiding partition hotspots on trending content. Adding a node redistributes only ~1/N of the data. Netflix processes over 1 trillion events per day across these partitions, with each partition capped at 100 MB to keep compaction manageable.
DynamoDB (Amazon): DynamoDB automatically partitions data by the table's partition key using consistent hashing. When a partition grows beyond 10 GB or exceeds 3,000 RCU / 1,000 WCU, DynamoDB splits it transparently. Users choose the partition key; DynamoDB handles everything else. Amazon's retail platform runs entirely on DynamoDB partitions during peak events like Prime Day (over 89 million requests per second in 2023), making it the largest production use of automatic transparent partitioning.
Vitess (YouTube, Slack): Vitess adds a horizontal sharding layer on top of MySQL. It uses directory-based partitioning with a topology service (etcd) that maps key ranges (called "keyspaces") to MySQL instances. YouTube migrated from a single MySQL database to hundreds of Vitess shards to handle billions of videos and metadata rows. Vitess supports online schema changes (applying ALTER TABLE across all shards without downtime) and vReplication for live shard splits. Slack uses Vitess to partition workspace data by team ID, keeping all messages for one workspace on the same shard for efficient queries.
MongoDB (sharded clusters): MongoDB supports automatic sharding with both hash and range shard keys. The mongos router uses a config server (directory-based) to map chunks to shards. MongoDB auto-splits chunks when they exceed 128 MB and auto-balances by migrating chunks between shards. The balancer runs in the background and can be scheduled to avoid peak hours. MongoDB's sharding is widely used for document-oriented workloads where the shard key is typically the tenant ID or document owner.
Instagram (Meta): Instagram shards PostgreSQL by user_id across 512 logical shards mapped to a smaller number of physical nodes. When they first sharded in 2012 (at ~30 million users), 12 physical PostgreSQL servers hosted all 512 logical shards — logical shards (database namespaces) are far cheaper than physical machines. To add capacity, they migrate a logical shard from a busy node to a new one with no key redistribution. The user_id shard key works well for profile lookups but forces scatter-gather for global feeds, which drove them to build a separate fanout service that pre-writes events into follower timelines at write time.
Discord: Discord stores over 1 trillion messages partitioned by (channel_id, message_bucket) where message_bucket = floor(epoch_seconds / BUCKET_SIZE). Originally on Cassandra, they migrated to ScyllaDB in 2023 for lower tail latency. Every message in a channel within a time window lives on the same partition, making "load the last 50 messages in this channel" a fast single-partition scan. The trade-off was deliberate: "show all messages by a specific user across all channels" is scatter-gather across every partition. Discord optimized for the dominant read pattern (channel history) at the cost of the minority pattern (cross-channel user audit) — a textbook example of shard key selection as a read-pattern alignment decision.
When to bring it up: Any time the system stores more data than fits on a single node, or writes exceed single-node throughput. If the interviewer says "100 million users" or "petabyte-scale data," partitioning should come up in your first pass of the high-level design. Don't wait for the interviewer to ask. My recommendation is to briefly state the partition key and strategy in one sentence ("I'd partition users by user_id with consistent hashing") and move on unless the interviewer wants to go deeper.
Depth expected at senior/staff level:
Name 2-3 partitioning strategies and explain when each is appropriate
Explain consistent hashing and why it beats naive modulo for elastic clusters
Discuss the scatter-gather cost of queries that miss the partition key
Address rebalancing: what happens when you add a shard, and how much data moves
Know the secondary index tradeoff (local vs. global)
Discuss hotspots: what causes them, how salting fixes them, and the tradeoff
Articulate why the partition key choice is driven by access patterns, not the data model
Know when partitioning is premature (and what simpler alternatives to try first)
The one sentence that signals partition expertise
Say: "The partition key must match my highest-traffic query's WHERE clause, and I accept that every other query pays a scatter-gather penalty." This shows you understand both the benefit and the cost in a single sentence.
Interviewer asks
Strong answer
"How would you partition this data?"
"I'd partition by user_id using consistent hashing. Most queries are per-user, so they hit a single shard. For the admin dashboard that queries across users, I'd use a separate read-optimized store (Elasticsearch or a materialized view) rather than scatter-gather."
"What happens when you add a new shard?"
"With consistent hashing, about 1/N of the keys migrate to the new node. We'd use virtual nodes (150+ per physical node) for even distribution. The migration runs in the background with a dual-read strategy: check the new shard first, fall back to the old shard during migration."
"How do you handle hotspots?"
"First, identify the hot key (monitoring per-shard QPS). For time-series hotspots, I'd salt the partition key with a hash prefix. For celebrity-user hotspots (one key getting 100x the traffic), I'd split that key's data across sub-partitions or serve it from a dedicated cache."
"How do you query a non-partition-key column?"
"Two options: local secondary indexes (every shard maintains its own, scatter-gather on read) or global secondary indexes (single lookup, but writes are eventually consistent). Choice depends on read vs. write ratio for that access pattern."
"What about cross-shard transactions?"
"I'd avoid them where possible by co-locating related data on the same shard (partition by order_id keeps all order items together). When unavoidable, use a saga pattern with compensating transactions rather than distributed 2PC, which has poor availability characteristics."
Data partitioning splits rows across multiple nodes so each shard stores and indexes a manageable subset. It is the only way to scale writes past a single machine's ceiling.
The partition key is the most consequential decision. It should match the WHERE clause of your highest-traffic query. Every query that does not include the partition key pays a scatter-gather penalty.
Range partitioning is excellent for time-ordered queries but creates write hotspots on monotonically increasing keys. Salt the partition key to mitigate this.
Hash partitioning distributes data uniformly but makes range queries expensive (scatter-gather to all shards). Use consistent hashing to avoid the catastrophic reshuffling that hash % N causes when N changes.
Consistent hashing with virtual nodes is the production standard. It minimizes data movement to ~1/N when adding or removing nodes, and vnodes ensure even distribution.
Directory-based partitioning offers maximum flexibility for multi-tenant systems with uneven data sizes, at the cost of maintaining a highly available lookup directory.
Secondary indexes in partitioned systems come in two flavors: local (fast writes, slow cross-shard reads) and global (fast reads, eventually consistent writes). Pick based on your read-to-write ratio.
Exhaust all simpler alternatives before partitioning: caching, read replicas, vertical scaling, query optimization. Partitioning adds irreversible operational complexity that compounds as shard count grows.
Sharding: Sharding is the specific case of partitioning a database across multiple database instances. This article covers the broader partitioning strategies; the sharding article covers operational details like cross-shard joins, schema migrations, and shard management.
Consistent hashing: The hash ring algorithm that makes adding and removing partition nodes efficient. Understanding consistent hashing is essential for explaining rebalancing in interviews.
Databases: Partitioning is a scaling strategy layered on top of your database engine. Understanding how B-trees, WAL, and replication work at the single-node level helps you reason about partition-level behavior.
Replication: Partitioning and replication are complementary. Partitioning splits data for write scaling; replication copies data for read scaling and fault tolerance. Most production systems use both.
Indexing strategies: Indexes within each partition work exactly like single-node indexes, but understanding covering indexes and partial indexes helps you optimize per-shard query performance.
CQRS pattern: When partitioning makes certain read patterns expensive (scatter-gather), CQRS separates the read path into a dedicated store optimized for those queries. This is the standard escape hatch for cross-partition reads.
CAP theorem: Partitioning is the "P" in CAP. Once you distribute data across nodes, network partitions become a real concern, and you must choose between consistency and availability during those partitions.