Slack: Migrating to a relational model

TL;DR

Slack's original message storage used a flat schema with workspace_id + channel_id + ts as the primary key, optimized for append-only chat logs.
As features accumulated (threads in 2017, reactions, edits, bookmarks), every new feature required join chains keyed on the timestamp, a value that also served as sort key, identity, and permalink anchor.
Slack introduced a canonical message_id column to separate "identity" from "display order," normalized related tables around it, and moved variable metadata into a JSON body column.
The migration covered hundreds of billions of rows across a Vitess-sharded MySQL cluster, using a dual-write + backfill strategy with zero downtime over 2018-2020.
Transferable lesson: when your primary key carries multiple responsibilities (sorting, identity, foreign key), schema evolution becomes exponentially harder. Separate concerns early.

By mid-2018, Slack's message storage was buckling under feature weight that nobody anticipated in 2013. Loading a single channel view required joins across six tables, all keyed on a microsecond-precision Unix timestamp that was never designed to be a relational foreign key.

The symptom was read amplification. A "load channel" query that once scanned a single table now fanned out into multiple round trips: fetch messages, then join reactions, then join edits, then join thread metadata, then join bookmarks, then join pinned items. Each join used (workspace_id, channel_id, ts) as the key. At Slack's scale (tens of millions of active users, thousands of messages per second), this fanout was measurable in both latency and database CPU.

The tipping point came when the team started building Slack Connect (cross-organization channels). The flat schema made cross-workspace queries nearly impossible without duplicating data. The engineering team decided that patching the existing model was no longer viable: the schema itself needed to change.

The System Before

Slack started as a side project inside Tiny Speck, a gaming company building Glitch. The internal chat tool became the product. The original message schema reflected that origin: a simple append log optimized for writing messages fast and reading them back in order.

The original messages table looked roughly like this:

-- Simplified original schema (circa 2013-2017)
CREATE TABLE messages (
  workspace_id    VARCHAR(12)    NOT NULL,
  channel_id      VARCHAR(12)    NOT NULL,
  ts              DECIMAL(17,6)  NOT NULL,   -- Unix epoch, microsecond precision
  user_id         VARCHAR(12)    NOT NULL,
  type            VARCHAR(32)    NOT NULL,
  text            TEXT,
  has_attachments BOOLEAN        DEFAULT FALSE,
  has_reactions   BOOLEAN        DEFAULT FALSE,
  is_pinned       BOOLEAN        DEFAULT FALSE,
  PRIMARY KEY (workspace_id, channel_id, ts)
);

For the core use case (load 50 most recent messages in a channel), this schema was excellent. One range scan on (workspace_id, channel_id, ts DESC LIMIT 50) returned everything. No joins, no secondary lookups. The write path was equally clean: insert a row with the current timestamp. Done.

The ts column deserves special attention. Slack chose a microsecond-precision Unix timestamp as the primary key for a few reasons. It provided natural sort order (messages sort by time). It was URL-safe for permalinks (/archives/C01234/p1529000000000100). And it was unique enough within a channel because microsecond precision meant collisions were near-impossible.

I've seen this exact pattern in early-stage systems: pick a naturally ordered value as the primary key because it simplifies the first three features. The problem only surfaces two years later when feature four needs that key to do something it was never designed for.

How feature accretion broke the model

Each new feature added a table, and each new table used ts as its foreign key:

Threads (2017). Slack introduced threaded replies, which meant a message could be a "parent" with child replies. The thread relationship was modeled with a thread_ts column pointing back to the parent message's ts. Loading a thread required a self-join: find all messages where thread_ts = parent.ts. This worked, but now ts was both a sort key and a relational pointer.

Reactions. Emoji reactions lived in a separate table keyed by (workspace_id, channel_id, message_ts, user_id, reaction). Any query that needed to show reactions alongside messages required a join back through ts. Showing a channel with 50 messages, each potentially having reactions, meant 50 point lookups into the reactions table.

Edits. Slack stores edit history for messages. The edits table was keyed by (workspace_id, channel_id, ts, edit_ts). Displaying the current version of a message required joining to the edits table and taking the most recent edit. That is two range scans per message.

Bookmarks and pins. More tables, more joins, same key. By 2018, loading a channel with threads, reactions, and edit history touched six tables, all joined on a Unix timestamp.

The compound effect was devastating for read performance. A query that started as a single range scan now looked like:

-- Pseudocode: loading a channel view by 2018
SELECT m.*, 
       r.reactions,
       e.latest_edit,
       t.reply_count,
       p.is_pinned,
       b.is_bookmarked
FROM messages m
LEFT JOIN reactions r   ON (m.workspace_id = r.workspace_id 
                           AND m.channel_id = r.channel_id 
                           AND m.ts = r.message_ts)
LEFT JOIN edits e       ON (m.workspace_id = e.workspace_id 
                           AND m.channel_id = e.channel_id 
                           AND m.ts = e.message_ts)
LEFT JOIN threads t     ON (m.workspace_id = t.workspace_id 
                           AND m.channel_id = t.channel_id 
                           AND m.ts = t.thread_ts)
LEFT JOIN pins p        ON (m.workspace_id = p.workspace_id 
                           AND m.channel_id = p.channel_id 
                           AND m.ts = p.message_ts)
LEFT JOIN bookmarks b   ON (m.workspace_id = b.workspace_id 
                           AND m.channel_id = b.channel_id 
                           AND m.ts = b.message_ts)
WHERE m.workspace_id = ? AND m.channel_id = ?
ORDER BY m.ts DESC
LIMIT 50;

Every join key was a three-column composite that included a decimal timestamp. Index lookups on DECIMAL(17,6) are not catastrophically slow, but they are slower than integer lookups, and doing six of them per channel load at Slack's scale was burning real CPU.

The multi-responsibility key trap

When a single column serves as sort key, display value, permalink component, and foreign key simultaneously, you cannot change any of those behaviors independently. Slack could not switch to a more join-friendly key format without breaking permalinks. They could not add a secondary sort dimension without rethinking identity. The ts column was load-bearing in too many directions.

Why Not Just Add Indexes?

The obvious first question: why not add covering indexes and keep the schema?

TL;DR

Slack's original message storage used a flat schema with workspace_id + channel_id + ts as the primary key, optimized for append-only chat logs.
As features accumulated (threads in 2017, reactions, edits, bookmarks), every new feature required join chains keyed on the timestamp, a value that also served as sort key, identity, and permalink anchor.
Slack introduced a canonical message_id column to separate "identity" from "display order," normalized related tables around it, and moved variable metadata into a JSON body column.
The migration covered hundreds of billions of rows across a Vitess-sharded MySQL cluster, using a dual-write + backfill strategy with zero downtime over 2018-2020.
Transferable lesson: when your primary key carries multiple responsibilities (sorting, identity, foreign key), schema evolution becomes exponentially harder. Separate concerns early.

The Trigger

The System Before

The original messages table looked roughly like this:

-- Simplified original schema (circa 2013-2017)
CREATE TABLE messages (
  workspace_id    VARCHAR(12)    NOT NULL,
  channel_id      VARCHAR(12)    NOT NULL,
  ts              DECIMAL(17,6)  NOT NULL,   -- Unix epoch, microsecond precision
  user_id         VARCHAR(12)    NOT NULL,
  type            VARCHAR(32)    NOT NULL,
  text            TEXT,
  has_attachments BOOLEAN        DEFAULT FALSE,
  has_reactions   BOOLEAN        DEFAULT FALSE,
  is_pinned       BOOLEAN        DEFAULT FALSE,
  PRIMARY KEY (workspace_id, channel_id, ts)
);

How feature accretion broke the model

Each new feature added a table, and each new table used ts as its foreign key:

Bookmarks and pins. More tables, more joins, same key. By 2018, loading a channel with threads, reactions, and edit history touched six tables, all joined on a Unix timestamp.

The compound effect was devastating for read performance. A query that started as a single range scan now looked like:

-- Pseudocode: loading a channel view by 2018
SELECT m.*, 
       r.reactions,
       e.latest_edit,
       t.reply_count,
       p.is_pinned,
       b.is_bookmarked
FROM messages m
LEFT JOIN reactions r   ON (m.workspace_id = r.workspace_id 
                           AND m.channel_id = r.channel_id 
                           AND m.ts = r.message_ts)
LEFT JOIN edits e       ON (m.workspace_id = e.workspace_id 
                           AND m.channel_id = e.channel_id 
                           AND m.ts = e.message_ts)
LEFT JOIN threads t     ON (m.workspace_id = t.workspace_id 
                           AND m.channel_id = t.channel_id 
                           AND m.ts = t.thread_ts)
LEFT JOIN pins p        ON (m.workspace_id = p.workspace_id 
                           AND m.channel_id = p.channel_id 
                           AND m.ts = p.message_ts)
LEFT JOIN bookmarks b   ON (m.workspace_id = b.workspace_id 
                           AND m.channel_id = b.channel_id 
                           AND m.ts = b.message_ts)
WHERE m.workspace_id = ? AND m.channel_id = ?
ORDER BY m.ts DESC
LIMIT 50;

The multi-responsibility key trap

Why Not Just Add Indexes?

The obvious first question: why not add covering indexes and keep the schema?

Slack: Migrating to a relational model

TL;DR

The Trigger

The System Before

How feature accretion broke the model

Why Not Just Add Indexes?

Continue Reading with Premium

Comments

Slack: Migrating to a relational model

TL;DR

The Trigger

The System Before

How feature accretion broke the model

Why Not Just Add Indexes?

Continue Reading with Premium

Comments