Discord: Cassandra to ScyllaDB migration

TL;DR

Discord stored 100+ billion chat messages in Apache Cassandra, but JVM garbage collection pauses caused unpredictable p99 latency spikes up to 300ms.
ScyllaDB, a C++ rewrite of Cassandra using the Seastar framework, eliminates GC entirely with a shard-per-core architecture.
The migration required zero application code changes because ScyllaDB speaks the same CQL protocol as Cassandra.
Post-migration results: p99 read latency dropped from 40-300ms to ~15ms, and node count shrank from 14 to 4.
Transferable lesson: "protocol-compatible" does not mean "performance-equivalent." The runtime model underneath matters more than the query language on top.

By late 2019, Discord's on-call engineers had a recurring problem. Every few hours, one or more Cassandra nodes would enter a GC pause, and p99 read latencies would spike from single-digit milliseconds to 200-300ms. Users noticed: messages appeared to "hang" before loading, channels felt sluggish, and the experience degraded in ways that were hard to reproduce or predict.

The spikes correlated perfectly with JVM garbage collection cycles. Discord's Cassandra cluster held over 100 billion messages and handled roughly 1 million writes per second at peak. At that throughput, the JVM's heap was under constant memory pressure from compaction, bloom filter maintenance, and memtable flushes.

I've seen this exact pattern at multiple companies running Cassandra at scale. The JVM GC problem is not a bug, it is a fundamental property of running a high-throughput storage engine inside a garbage-collected runtime. The question is always: at what scale does it become unacceptable?

For Discord, the answer was "now." The team had already spent months tuning G1GC parameters, adjusting heap sizes, moving row caches off-heap, and experimenting with compaction strategies. None of it eliminated the spikes. It only made them less frequent and slightly less severe. The engineering team decided that incremental tuning had hit diminishing returns, and they needed a fundamentally different approach.

The System Before

Discord's message storage architecture was a fairly standard Cassandra deployment, well-designed for their access patterns but running into the JVM's hard limits at scale.

The message schema used a composite partition key designed for Discord's primary access pattern: loading the most recent messages in a channel.

-- Simplified Discord messages schema
CREATE TABLE messages (
    channel_id  bigint,
    bucket      int,          -- monthly time bucket
    message_id  bigint,       -- Snowflake ID, time-ordered
    author_id   bigint,
    content     text,
    has_attachments boolean,
    has_embeds  boolean,
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

The channel_id + bucket partition key kept each partition to roughly one month of messages per channel. This prevented unbounded partition growth (a common Cassandra anti-pattern) and made "load recent messages" queries efficient. The query engine only scans the most recent bucket.

The message_id clustering key used Discord's Snowflake ID format, which embeds a timestamp. Ordering by message_id DESC means the most recent messages sit at the top of each partition, so a "load channel" query reads sequentially from disk without random I/O.

This schema design was solid. The problem was never the data model. It was the runtime.

How the JVM GC problem manifested

Cassandra's JVM process manages a massive heap: memtables (in-memory write buffers), bloom filters (one per SSTable), row cache entries, and internal metadata. Under Discord's workload, the heap churn was enormous.

Here is the cycle that repeated every few hours on every node:

Memtable flushes generate short-lived objects at a high rate. G1GC's young generation collections handle most of these, but each flush also creates longer-lived SSTable metadata.
Compaction reads multiple SSTables, merges them, and writes new ones. During compaction, the JVM allocates large temporary buffers that survive young-gen collection and promote to old-gen.
Old-gen fills up. G1GC triggers a mixed collection or, in the worst case, a full GC. During this pause, all application threads stop. No reads, no writes, no heartbeats.
Request pile-up. Clients with pending requests to the paused node see timeouts. The coordinator retries to another replica, adding load to healthy nodes.

The pause duration depended on heap size and live object count. With 8GB heaps, pauses ranged from 50ms to 300ms. Larger heaps made pauses less frequent but longer when they hit. Smaller heaps made pauses shorter but more frequent. There was no winning configuration.

The math behind "some node is always pausing"

This is worth spelling out because it illustrates a broader systems principle. With 14 nodes in the cluster, each node pausing for ~200ms every 30 minutes, you can calculate the probability that at least one node is paused at any given moment.

Each node spends roughly 200ms / (30 * 60 * 1000ms) = 0.011% of its time in GC. The probability that a specific node is not in GC at a given instant is 99.989%. The probability that all 14 nodes are not in GC simultaneously is 0.99989^14 = 99.85%. So about 0.15% of the time, at least one node is paused.

That sounds small, but at 1 million requests per second, 0.15% means roughly 1,500 requests per second hitting a paused node. And those requests do not fail gracefully: they queue, retry, and create cascading load on other nodes. During peak hours with more frequent GC cycles, the probability was significantly higher.

GC tuning is asymptotic, not linear

Every JVM GC tuning parameter trades one dimension for another. Increasing heap reduces pause frequency but increases pause duration. Off-heap caching reduces heap pressure but introduces manual memory management complexity. Discord found that after 6 months of tuning, they were chasing single-percentage improvements at the cost of significant engineering time.

With 14 nodes in the cluster, the probability that at least one node was in a GC pause at any given moment was substantial. For a service that needs consistent sub-10ms latency to feel "instant" to users typing messages, this was a dealbreaker.

Why Not Just Tune Cassandra Harder?

The obvious question: why not keep tuning the JVM? The engineering team explored every standard approach before deciding to migrate. Here is what they tried and why each fell short.

G1GC parameter optimization. Discord experimented extensively with G1GC settings: -XX:MaxGCPauseMillis, -XX:G1HeapRegionSize, -XX:InitiatingHeapOccupancyPercent, and dozens more. Each tweak improved one metric while degrading another. Reducing MaxGCPauseMillis from 200ms to 50ms caused more frequent collections that consumed more total CPU. The team estimated they spent roughly 6 months on JVM tuning alone.

Off-heap row cache. Cassandra supports moving its row cache out of the JVM heap, which reduces GC pressure. Discord enabled this, and it helped. But the bloom filters, memtables, and compaction buffers still lived on-heap. Off-heap caching addressed maybe 30% of the problem.

Heap size adjustments. Larger heaps (12-16GB) delayed GC pauses but made each pause longer when it arrived. The p99 improved slightly, but the p999 got worse. For Discord's use case (real-time chat), the worst-case latency matters more than the average case.

Compaction strategy changes. Switching from SizeTieredCompactionStrategy to LeveledCompactionStrategy reduced write amplification but increased disk I/O and, paradoxically, created more compaction-related heap allocations. The JVM was doing more work more consistently, which meant GC pressure was smoother but never absent.

The JVM GC wall

This is a well-documented pattern in the Cassandra community. DataStax, LinkedIn, and Apple have all published engineering blogs about JVM GC challenges at scale. The problem is not Cassandra-specific per se, it is JVM-specific for any storage engine doing heavy allocation under sustained throughput. Elasticsearch, Kafka, and HBase all face variants of this problem.

TL;DR

Discord stored 100+ billion chat messages in Apache Cassandra, but JVM garbage collection pauses caused unpredictable p99 latency spikes up to 300ms.
ScyllaDB, a C++ rewrite of Cassandra using the Seastar framework, eliminates GC entirely with a shard-per-core architecture.
The migration required zero application code changes because ScyllaDB speaks the same CQL protocol as Cassandra.
Post-migration results: p99 read latency dropped from 40-300ms to ~15ms, and node count shrank from 14 to 4.
Transferable lesson: "protocol-compatible" does not mean "performance-equivalent." The runtime model underneath matters more than the query language on top.

The Trigger

The System Before

Discord's message storage architecture was a fairly standard Cassandra deployment, well-designed for their access patterns but running into the JVM's hard limits at scale.

The message schema used a composite partition key designed for Discord's primary access pattern: loading the most recent messages in a channel.

-- Simplified Discord messages schema
CREATE TABLE messages (
    channel_id  bigint,
    bucket      int,          -- monthly time bucket
    message_id  bigint,       -- Snowflake ID, time-ordered
    author_id   bigint,
    content     text,
    has_attachments boolean,
    has_embeds  boolean,
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

This schema design was solid. The problem was never the data model. It was the runtime.

How the JVM GC problem manifested

Here is the cycle that repeated every few hours on every node:

Memtable flushes generate short-lived objects at a high rate. G1GC's young generation collections handle most of these, but each flush also creates longer-lived SSTable metadata.
Compaction reads multiple SSTables, merges them, and writes new ones. During compaction, the JVM allocates large temporary buffers that survive young-gen collection and promote to old-gen.
Old-gen fills up. G1GC triggers a mixed collection or, in the worst case, a full GC. During this pause, all application threads stop. No reads, no writes, no heartbeats.
Request pile-up. Clients with pending requests to the paused node see timeouts. The coordinator retries to another replica, adding load to healthy nodes.

The math behind "some node is always pausing"

GC tuning is asymptotic, not linear

Why Not Just Tune Cassandra Harder?

The obvious question: why not keep tuning the JVM? The engineering team explored every standard approach before deciding to migrate. Here is what they tried and why each fell short.

The JVM GC wall

Discord: Cassandra to ScyllaDB migration

TL;DR

The Trigger

The System Before

How the JVM GC problem manifested

The math behind "some node is always pausing"

Why Not Just Tune Cassandra Harder?

Continue Reading with Premium

Comments

Discord: Cassandra to ScyllaDB migration

TL;DR

The Trigger

The System Before

How the JVM GC problem manifested

The math behind "some node is always pausing"

Why Not Just Tune Cassandra Harder?

Continue Reading with Premium

Comments