Notion: Postgres sharding at scale

TL;DR

Notion grew roughly 4x during 2020 (COVID-driven remote work adoption), and their single Postgres database hit a hard ceiling: CPU saturation above 90%, vacuum stalls blocking queries, and connection exhaustion even with PgBouncer.
They chose workspace ID as the shard key because it colocates all pages, blocks, permissions, and users for a workspace on a single shard, keeping 99%+ of queries single-shard.
The routing layer is an application-level library (not a separate service) backed by a lookup table (not a hash ring), which enables workspace-level rebalancing without full resharding.
Live migration used dual-write, workspace-by-workspace backfill, checksum validation, and incremental cutover with zero downtime over a weeks-to-months timeline.
Post-migration, Notion operates ~32 Postgres clusters with independent failover, per-shard schema migrations, and hot-shard monitoring for large enterprise workspaces.

By mid-2020, Notion's infrastructure team was fighting fires daily. The single Postgres primary that backed the entire product was CPU-saturated above 90% during peak hours. Vacuum operations (Postgres's mechanism for reclaiming dead rows) stalled during high-traffic windows, bloating tables and degrading query performance for everyone.

PgBouncer, their connection pooler, was maxed out at thousands of concurrent connections. Workers still queued behind it waiting for an available database connection.

The growth was extreme. Notion went from a tool popular with startups and individuals to a collaboration platform used by enterprises, schools, and remote teams worldwide. The COVID-driven shift to remote work in 2020 drove an estimated 4x increase in users over that single year.

Vertical scaling had already been pushed to its limits. They were running on the largest available Postgres-compatible instance class. Adding read replicas helped for analytics queries, but Notion's core workload (collaborative editing of blocks and pages) is write-heavy. Replicas cannot absorb write load.

I've been in this exact situation: staring at a Postgres CPU graph that flatlines at 95% during business hours and knowing that the next traffic spike will tip it into visible user impact. The question stops being "should we shard?" and becomes "how fast can we shard?"

The System Before

Notion's pre-sharding architecture was a single-primary Postgres setup with PgBouncer as the connection multiplexer. This is a common and reasonable architecture for early-to-mid-stage SaaS products, and it served Notion well until the scale outgrew it.

Notion's data model centers on blocks. Everything in Notion is a block: a paragraph, a heading, a database row, an embedded image. Pages are blocks that contain other blocks. Workspaces are the top-level container for all of a team's pages and blocks.

A simplified view of the core schema:

-- Simplified Notion block schema
CREATE TABLE blocks (
    id           UUID PRIMARY KEY,
    workspace_id UUID NOT NULL,
    parent_id    UUID,          -- parent block or page
    type         TEXT NOT NULL,  -- paragraph, heading, database_row, etc.
    properties   JSONB,         -- block content and metadata
    created_at   TIMESTAMPTZ,
    updated_at   TIMESTAMPTZ
);

CREATE INDEX idx_blocks_workspace ON blocks(workspace_id);
CREATE INDEX idx_blocks_parent    ON blocks(parent_id);

The critical access patterns are: load all blocks for a page (by parent_id), load all pages in a workspace (by workspace_id), and update individual blocks during collaborative editing (by id). All three are heavily tied to the workspace concept.

Why PgBouncer alone was not enough

PgBouncer solves connection multiplexing, not CPU saturation. It reduces the number of physical Postgres connections by pooling and reusing them across application workers. This helps when the bottleneck is connection count (Postgres's per-connection memory overhead limits it to hundreds or low thousands of connections).

When the bottleneck is CPU on the primary, multiplexing connections does not help. You are sending the same volume of queries to the same single CPU.

Notion had already tuned PgBouncer extensively: transaction-level pooling, aggressive connection recycling, queue depth limits. The connection exhaustion problem improved, but the CPU saturation remained. PgBouncer cannot split a workload across multiple Postgres instances. That requires sharding.

Why vertical scaling hit a ceiling

Postgres runs on a single machine. The largest cloud instances available at the time (roughly 96-128 vCPUs and 768GB RAM on AWS) have finite CPU and I/O bandwidth. Notion was already on or near that ceiling.

Even if bigger instances existed, vertical scaling has a second problem: single point of failure with a massive blast radius. One Postgres primary failing means the entire product is down for every user. Recovery (promote a replica, catch up replication lag) takes minutes, and those minutes affect everyone simultaneously.

Why Not Just Add More Read Replicas?

The first question any engineer asks: "Can we throw more replicas at this?" Notion's workload makes that insufficient for three reasons.

Collaborative editing is write-heavy. When a user types in a Notion page, every keystroke (or debounced batch) generates a write to update the block's content. A single active user produces dozens of writes per minute. A meeting with 10 people simultaneously editing produces hundreds. Replicas do not absorb writes.

Read-after-write consistency is required. When user A edits a block and user B views the same page 200ms later, user B must see user A's edit. This requires reading from the primary (or from a replica that has caught up past that write's LSN). In practice, Notion's core product reads mostly hit the primary to guarantee consistency.

Vacuum pressure scales with writes, not reads. More replicas do not reduce the vacuum load on the primary. Postgres's autovacuum runs on the primary, competing for CPU and I/O with the write workload. At Notion's write rate, vacuum was a constant source of contention.

The team also evaluated Citus (Postgres extension for distributed queries) and middleware proxies like Vitess. Citus would have required significant schema changes and did not support all of Notion's query patterns at the time. A middleware proxy adds latency and a new operational dependency. The team chose to build a thin application-level routing library, keeping the architecture simple and the failure domain narrow.

The Decision

Notion's engineering team decided to horizontally shard Postgres by workspace ID, with an application-level routing library and a lookup-table-based shard assignment. Three critical sub-decisions shaped the design.

Shard key analysis

The shard key determines which data lives together on the same shard. A good shard key keeps related data colocated (minimizing cross-shard queries) while distributing load evenly.

TL;DR

Notion grew roughly 4x during 2020 (COVID-driven remote work adoption), and their single Postgres database hit a hard ceiling: CPU saturation above 90%, vacuum stalls blocking queries, and connection exhaustion even with PgBouncer.
They chose workspace ID as the shard key because it colocates all pages, blocks, permissions, and users for a workspace on a single shard, keeping 99%+ of queries single-shard.
The routing layer is an application-level library (not a separate service) backed by a lookup table (not a hash ring), which enables workspace-level rebalancing without full resharding.
Live migration used dual-write, workspace-by-workspace backfill, checksum validation, and incremental cutover with zero downtime over a weeks-to-months timeline.
Post-migration, Notion operates ~32 Postgres clusters with independent failover, per-shard schema migrations, and hot-shard monitoring for large enterprise workspaces.

The Trigger

PgBouncer, their connection pooler, was maxed out at thousands of concurrent connections. Workers still queued behind it waiting for an available database connection.

The System Before

A simplified view of the core schema:

-- Simplified Notion block schema
CREATE TABLE blocks (
    id           UUID PRIMARY KEY,
    workspace_id UUID NOT NULL,
    parent_id    UUID,          -- parent block or page
    type         TEXT NOT NULL,  -- paragraph, heading, database_row, etc.
    properties   JSONB,         -- block content and metadata
    created_at   TIMESTAMPTZ,
    updated_at   TIMESTAMPTZ
);

CREATE INDEX idx_blocks_workspace ON blocks(workspace_id);
CREATE INDEX idx_blocks_parent    ON blocks(parent_id);

Notion: Postgres sharding at scale

TL;DR

The Trigger

The System Before

Why PgBouncer alone was not enough

Why vertical scaling hit a ceiling

Why Not Just Add More Read Replicas?

The Decision

Shard key analysis

Continue Reading with Premium

Comments

Notion: Postgres sharding at scale

TL;DR

The Trigger

The System Before

Why PgBouncer alone was not enough

Why vertical scaling hit a ceiling

Why Not Just Add More Read Replicas?

The Decision

Shard key analysis

Continue Reading with Premium

Comments