Notion: Postgres sharding at scale
How Notion migrated from a single Postgres database to horizontally sharded clusters to handle explosive growth, and the infrastructure changes required to make it work transparently.
TL;DR
- Notion grew roughly 4x during 2020 (COVID-driven remote work adoption), and their single Postgres database hit a hard ceiling: CPU saturation above 90%, vacuum stalls blocking queries, and connection exhaustion even with PgBouncer.
- They chose workspace ID as the shard key because it colocates all pages, blocks, permissions, and users for a workspace on a single shard, keeping 99%+ of queries single-shard.
- The routing layer is an application-level library (not a separate service) backed by a lookup table (not a hash ring), which enables workspace-level rebalancing without full resharding.
- Live migration used dual-write, workspace-by-workspace backfill, checksum validation, and incremental cutover with zero downtime over a weeks-to-months timeline.
- Post-migration, Notion operates ~32 Postgres clusters with independent failover, per-shard schema migrations, and hot-shard monitoring for large enterprise workspaces.
The Trigger
By mid-2020, Notion's infrastructure team was fighting fires daily. The single Postgres primary that backed the entire product was CPU-saturated above 90% during peak hours. Vacuum operations (Postgres's mechanism for reclaiming dead rows) stalled during high-traffic windows, bloating tables and degrading query performance for everyone.
PgBouncer, their connection pooler, was maxed out at thousands of concurrent connections. Workers still queued behind it waiting for an available database connection.
The growth was extreme. Notion went from a tool popular with startups and individuals to a collaboration platform used by enterprises, schools, and remote teams worldwide. The COVID-driven shift to remote work in 2020 drove an estimated 4x increase in users over that single year.
Vertical scaling had already been pushed to its limits. They were running on the largest available Postgres-compatible instance class. Adding read replicas helped for analytics queries, but Notion's core workload (collaborative editing of blocks and pages) is write-heavy. Replicas cannot absorb write load.
I've been in this exact situation: staring at a Postgres CPU graph that flatlines at 95% during business hours and knowing that the next traffic spike will tip it into visible user impact. The question stops being "should we shard?" and becomes "how fast can we shard?"
The System Before
Notion's pre-sharding architecture was a single-primary Postgres setup with PgBouncer as the connection multiplexer. This is a common and reasonable architecture for early-to-mid-stage SaaS products, and it served Notion well until the scale outgrew it.
Notion's data model centers on blocks. Everything in Notion is a block: a paragraph, a heading, a database row, an embedded image. Pages are blocks that contain other blocks. Workspaces are the top-level container for all of a team's pages and blocks.
A simplified view of the core schema:
-- Simplified Notion block schema
CREATE TABLE blocks (
id UUID PRIMARY KEY,
workspace_id UUID NOT NULL,
parent_id UUID, -- parent block or page
type TEXT NOT NULL, -- paragraph, heading, database_row, etc.
properties JSONB, -- block content and metadata
created_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ
);
CREATE INDEX idx_blocks_workspace ON blocks(workspace_id);
CREATE INDEX idx_blocks_parent ON blocks(parent_id);
The critical access patterns are: load all blocks for a page (by parent_id), load all pages in a workspace (by workspace_id), and update individual blocks during collaborative editing (by id). All three are heavily tied to the workspace concept.
Why PgBouncer alone was not enough
PgBouncer solves connection multiplexing, not CPU saturation. It reduces the number of physical Postgres connections by pooling and reusing them across application workers. This helps when the bottleneck is connection count (Postgres's per-connection memory overhead limits it to hundreds or low thousands of connections).
When the bottleneck is CPU on the primary, multiplexing connections does not help. You are sending the same volume of queries to the same single CPU.
Notion had already tuned PgBouncer extensively: transaction-level pooling, aggressive connection recycling, queue depth limits. The connection exhaustion problem improved, but the CPU saturation remained. PgBouncer cannot split a workload across multiple Postgres instances. That requires sharding.
Why vertical scaling hit a ceiling
Postgres runs on a single machine. The largest cloud instances available at the time (roughly 96-128 vCPUs and 768GB RAM on AWS) have finite CPU and I/O bandwidth. Notion was already on or near that ceiling.
Even if bigger instances existed, vertical scaling has a second problem: single point of failure with a massive blast radius. One Postgres primary failing means the entire product is down for every user. Recovery (promote a replica, catch up replication lag) takes minutes, and those minutes affect everyone simultaneously.
Why Not Just Add More Read Replicas?
The first question any engineer asks: "Can we throw more replicas at this?" Notion's workload makes that insufficient for three reasons.
Collaborative editing is write-heavy. When a user types in a Notion page, every keystroke (or debounced batch) generates a write to update the block's content. A single active user produces dozens of writes per minute. A meeting with 10 people simultaneously editing produces hundreds. Replicas do not absorb writes.
Read-after-write consistency is required. When user A edits a block and user B views the same page 200ms later, user B must see user A's edit. This requires reading from the primary (or from a replica that has caught up past that write's LSN). In practice, Notion's core product reads mostly hit the primary to guarantee consistency.
Vacuum pressure scales with writes, not reads. More replicas do not reduce the vacuum load on the primary. Postgres's autovacuum runs on the primary, competing for CPU and I/O with the write workload. At Notion's write rate, vacuum was a constant source of contention.
The team also evaluated Citus (Postgres extension for distributed queries) and middleware proxies like Vitess. Citus would have required significant schema changes and did not support all of Notion's query patterns at the time. A middleware proxy adds latency and a new operational dependency. The team chose to build a thin application-level routing library, keeping the architecture simple and the failure domain narrow.
The Decision
Notion's engineering team decided to horizontally shard Postgres by workspace ID, with an application-level routing library and a lookup-table-based shard assignment. Three critical sub-decisions shaped the design.
Shard key analysis
The shard key determines which data lives together on the same shard. A good shard key keeps related data colocated (minimizing cross-shard queries) while distributing load evenly.
User ID fails because users belong to multiple workspaces. A query like "load all pages in workspace X" would need to scatter across every shard where any workspace member's data lives. For a 500-person company, that could mean hundreds of shards touched per query. Collaborative features (real-time editing, comments, mentions) all become cross-shard operations.
Page ID or Block ID keeps a single page's blocks together, but workspace-level operations scatter. Loading the sidebar (all top-level pages), checking permissions (workspace-level ACLs), and searching within a workspace all become scatter-gather queries.
Workspace ID colocates everything a workspace needs on a single shard: all pages, all blocks, all members, all permissions, all comments. Within a workspace, Notion can still use full Postgres capabilities (JOINs, transactions, foreign keys). The tradeoff: cross-workspace queries (admin analytics, global search) scatter, but these represent less than 1% of query volume.
I've seen teams agonize over shard key selection for weeks, running query analysis and building prototypes. The right approach is to look at your access patterns honestly. If 99% of your queries are scoped to a single entity (workspace, tenant, organization), that entity is your shard key. It really is that straightforward.
Routing mechanism: library, not service
The routing layer is an application-level library embedded in every API server process. When the application executes a query, it calls the router library with the workspace ID. The library resolves the workspace to a shard number, picks the correct connection pool, and returns a database connection.
This avoids adding a network hop (and a new failure domain) between the application and the database. If the router were a separate service, every database query would require an extra round trip, adding latency and creating a single point of failure. As a library, routing is just a function call with a lookup table cached in memory.
Shard assignment: lookup table, not consistent hashing
Notion uses a lookup table (workspace_id โ shard_number) rather than consistent hashing. This is a deliberate choice to support fine-grained workspace-level rebalancing.
With consistent hashing, moving a single hot workspace to a different shard risks displacing many other workspaces as a side effect. With a lookup table, moving workspace X from shard 5 to shard 12 is a single row update. No other workspaces are affected.
The tradeoff: the lookup table must be loaded at application startup and kept in sync across all API servers. But this is a small operational cost compared to the flexibility of per-workspace placement control, especially when workspace sizes span several orders of magnitude.
Lookup table vs. consistent hashing
Use consistent hashing when entities are roughly equal-sized and you rarely need to move individuals between shards. Use a lookup table when entity sizes vary dramatically (like workspaces ranging from 1 user to 10,000 users) and you need fine-grained placement control. Notion's workspace sizes span several orders of magnitude, making the lookup table the right choice.
The Migration Path
The migration had to be live. Notion could not take downtime for a database migration. Users were collaboratively editing documents in real-time, and any interruption would mean lost keystrokes, conflicting edits, and broken experiences. The migration ran over weeks to months, workspace by workspace.
Phase 1: Dual writes
Every write from the API servers went to both the original single Postgres and the new sharded clusters simultaneously. The original database remained the source of truth for all reads. If a write to a shard failed, it was logged and retried asynchronously but did not affect the user.
The dual-write layer lived in the routing library. When a write came in, the library resolved the workspace ID to a shard, wrote to that shard, and also wrote to the original database. This phase validated that the sharded clusters could handle Notion's full write throughput without errors or drift.
Phase 2: Backfill and validate
New writes flowed to both systems, but all historical data still lived only in the original Postgres. The backfill worker iterated workspace by workspace, reading all data for each workspace and writing it to the correct shard.
The backfill required careful engineering:
- Throttled to avoid overwhelming either the source (increased read load on an already-saturated primary) or the destinations (write amplification on top of live dual-writes).
- Idempotent so it could be restarted from any checkpoint if it crashed or was paused for load reasons.
- Validated with checksums. For each workspace, the system compared row counts and content hashes between the original and the shard to catch data drift.
This was the longest phase. Hundreds of terabytes of data, workspace by workspace, with continuous validation. The team prioritized active workspaces (those with recent edits) first, since those were most likely to be queried.
Phase 3: Workspace-by-workspace cutover
Once a workspace's data was fully backfilled and checksums matched, the team flipped a per-workspace feature flag to route that workspace's reads to the new shard. Writes continued to both systems during a soak period.
If any errors appeared for a cutover workspace, the flag could be flipped back instantly, routing reads to the original database. This made each cutover individually reversible. The team started with low-traffic internal workspaces, expanded to small external ones, and gradually moved to larger and more active workspaces.
Phase 4: Decommission
After all workspaces were cutover and a soak period passed without issues, writes to the original database were stopped and it was decommissioned. The ~32 sharded Postgres clusters became the sole source of truth.
The longest phase is always backfill
In every large-scale database migration I've been involved with, the backfill phase takes longer than anyone estimates. The bottleneck is not copy speed. It is the validation, the edge cases (orphaned rows, encoding mismatches, timezone drift), and the need to throttle to avoid impacting live traffic. Budget twice your initial estimate.
The System After
Post-migration, Notion's architecture splits the monolithic Postgres into ~32 independent Postgres clusters, each responsible for a subset of workspaces. The routing library in every API server resolves workspace IDs to shards transparently.
The architecture looks more complex on paper, but within any single workspace, the experience is identical to the pre-sharding system. The workspace's shard is a full Postgres instance with JOINs, transactions, and foreign keys working exactly as before. The complexity is in the operational layer, not the query layer.
The Results
| Metric | Before (single Postgres) | After (~32 shards) |
|---|---|---|
| Primary CPU at peak | >90% saturated | ~30-40% per shard (headroom for growth) |
| Vacuum blocking | Frequent stalls during peak hours | Per-shard vacuum, no cross-workspace impact |
| Connection exhaustion | PgBouncer saturated, workers queuing | Per-shard connection pools, no contention |
| Blast radius of failure | 100% of users affected | Only workspaces on the failed shard affected |
| Write throughput ceiling | Single machine I/O limit | Scales linearly with shard count |
| Cross-workspace queries | Simple SQL JOIN | Scatter-gather or routed to BigQuery |
| Schema migration time | Minutes (one database) | Longer (32 sequential or parallel DDL runs) |
| Operational complexity | One primary to monitor | 32 independent clusters to monitor |
The headline improvement: Notion went from a single database that could not keep up with traffic to a system with meaningful headroom on every shard. The CPU saturation problem disappeared because the write load distributed across 32 machines instead of concentrating on one.
But the wins came with new operational costs. The next section covers what changed in day-to-day operations.
Post-migration operational challenges
Cross-shard queries. Any query spanning multiple workspaces (internal dashboards, admin search, analytics) now requires scatter-gather across all 32 shards. Notion moved cross-workspace analytics off the transactional path entirely, streaming events to BigQuery for offline analysis. This is a common pattern: the transactional database handles the product, and a separate analytics pipeline handles reporting.
Schema migrations across 32 shards. An ALTER TABLE that took 2 minutes on one database now needs to run on 32 databases. Notion built tooling for orchestrated multi-shard DDL: run the migration on shard 1, validate, run on shard 2, validate, and so on. Parallel execution is possible but risky (a bad migration on all shards simultaneously is catastrophic). Sequential execution is safer but slower.
Hot shards from large enterprises. Workspace sizes follow a power-law distribution. A large enterprise customer with 10,000 users and millions of blocks generates far more load than a personal workspace. Monitoring shard CPU, I/O, and connection usage is an ongoing task. When a shard gets hot, the team moves the offending workspace to a less-loaded shard by updating the lookup table and running a targeted data migration.
Independent failover events. Each shard is a full Postgres replica set with a primary and replicas. Instead of one failover event that affects everyone, Notion now has 32 potential independent failover events. Each one only affects the workspaces on that shard (much smaller blast radius), but the operational surface area is 32x larger. Monitoring, alerting, and runbooks had to be scaled accordingly.
I've found that teams often underestimate this operational expansion. Sharding is not just a data distribution problem. It is an operational multiplier. Every database operation (backup, restore, upgrade, failover, monitoring) now happens N times instead of once.
What They'd Do Differently
Based on public talks and blog posts from Notion's engineering team, several hindsight lessons emerged:
Build shard-aware infrastructure earlier. If every query had included workspace_id from day one, the migration would have been significantly simpler. Retrofitting workspace-scoped queries into code paths that assumed a single database was a large part of the engineering effort. Requiring a tenant/workspace scope on every database call from the start is cheap insurance.
Invest in migration tooling from the start. The backfill, validation, and cutover tooling was built during the migration. Building reusable data migration infrastructure earlier would have reduced the timeline. This tooling (workspace-by-workspace copy, checksum validation, per-workspace feature flags) is useful for any future infrastructure change, not just this sharding project.
Plan for shard rebalancing as a continuous operation. The initial shard assignment was based on workspace size estimates at migration time. Workspaces grow unpredictably. A startup workspace that was tiny during migration might become a 5,000-person enterprise a year later. Rebalancing tooling needs to be always-ready, not a one-time migration concern.
Architecture Decision Guide
This flowchart helps you determine when database sharding is the right approach for your system:
Transferable Lessons
1. Shard key selection is the highest-leverage decision in any sharding project.
Notion's choice of workspace ID meant 99%+ of queries stayed single-shard with full Postgres capabilities preserved. A different choice (user ID or block ID) would have turned most queries into expensive scatter-gather operations. Before sharding, map your access patterns and find the entity that scopes the vast majority of queries. That entity is your shard key.
2. Lookup tables beat consistent hashing when entity sizes vary dramatically.
Consistent hashing is elegant but assumes roughly uniform entity sizes. When your tenants range from 1 user to 10,000 users, you need fine-grained control over placement. A lookup table lets you move individual tenants between shards with a single row update. The operational overhead of maintaining the table is small compared to the flexibility it provides.
3. Sharding is an operational multiplier, not just a data distribution technique.
Every database operation (backup, restore, schema migration, failover, monitoring) now happens N times. Teams that plan only for the data migration and ignore the ongoing operational expansion are surprised by the sustained engineering cost. Budget for tooling, monitoring, and runbooks proportional to your shard count.
4. Live migrations require workspace-by-workspace (or tenant-by-tenant) granularity.
Notion's ability to cutover individual workspaces independently made the migration safe and reversible. A big-bang cutover (switch everything at once) would have been a single point of failure for the entire migration. Design your migration to be incremental at the tenant level, with per-tenant rollback capability.
5. Build shard-aware infrastructure before you need to shard.
The cheapest time to add a tenant scope to every database call is at the beginning of the project, when it costs a few extra characters per query. The most expensive time is during a sharding migration, when engineers must audit and retrofit thousands of query paths. Even if you never shard, tenant-scoped queries improve security (preventing cross-tenant data leaks) and observability (per-tenant query metrics).
How This Shows Up in Interviews
When designing any multi-tenant SaaS system (collaborative docs, project management tools, CRM), mentioning tenant ID as a natural shard key is a high-signal move. It shows you understand data colocation, access pattern analysis, and the operational realities of sharding.
The sentence to use: "I'd shard by workspace ID so that all data for a tenant colocates on one shard, keeping 99% of queries single-shard while accepting scatter-gather for the rare cross-tenant analytics query."
| Interviewer Asks | Strong Answer Citing This Case Study |
|---|---|
| "How would you handle a database bottleneck in a collaborative editing tool?" | "Notion hit this exact problem. Vertical scaling has a ceiling. I'd shard by workspace ID so all tenant data colocates, keeping within-workspace queries as simple JOINs. Cross-workspace analytics moves to a separate pipeline like BigQuery." |
| "How do you choose a shard key?" | "Map your access patterns. If 99% of queries are scoped to one entity (workspace, org, tenant), that entity is the shard key. Notion evaluated user_id, page_id, and workspace_id. Only workspace_id kept the dominant query pattern single-shard." |
| "How would you migrate to a sharded database with zero downtime?" | "Dual-write to old and new, backfill historical data workspace by workspace with checksum validation, cutover reads per workspace with a feature flag, and decommission the old database after a soak period. Every phase is independently reversible." |
| "What are the downsides of sharding?" | "Operational multiplier: schema migrations run N times, failover happens independently per shard, cross-shard queries require scatter-gather or a separate analytics system. Notion went from monitoring one database to monitoring 32." |
| "Lookup table or consistent hashing for shard assignment?" | "Depends on entity size variance. If tenants range from 1 to 10,000 users, use a lookup table for fine-grained rebalancing. If entities are roughly uniform, consistent hashing is simpler. Notion chose lookup tables because workspace sizes vary by 1000x." |
Quick Recap
- Notion's single Postgres instance could not handle 4x user growth in 2020, with CPU saturation above 90%, vacuum stalls, and connection exhaustion even with PgBouncer.
- Vertical scaling and read replicas were insufficient because the workload was write-heavy and Postgres has a hard ceiling on single-machine resources.
- Workspace ID was chosen as the shard key because it colocates all tenant data (pages, blocks, permissions, users) on a single shard, keeping 99%+ of queries single-shard.
- The routing layer is an in-process library using a lookup table (not consistent hashing) for fine-grained workspace-level rebalancing.
- Live migration used four phases: dual-write, workspace-by-workspace backfill with checksum validation, per-workspace read cutover via feature flags, and decommission.
- Post-migration, ~32 Postgres clusters operate independently, with new operational challenges: cross-shard queries routed to BigQuery, sequential schema migrations, hot-shard monitoring, and 32 independent failover domains.
- The transferable principle: choose your shard key by mapping access patterns, and treat sharding as a permanent operational multiplier, not a one-time migration project.
Related Concepts
- Sharding - The foundational patterns for horizontal data partitioning, including hash-based vs. range-based vs. lookup-table-based shard assignment.
- Databases - Postgres internals (vacuum, MVCC, connection model) that drove Notion's scaling ceiling.
- Replication - Why read replicas could not solve Notion's write-heavy bottleneck, and how replication works within each shard post-migration.