Discord: Cassandra to ScyllaDB migration
How Discord migrated from Apache Cassandra to ScyllaDB to eliminate latency spikes caused by JVM garbage collection, achieving 2-5x throughput improvement at lower cost.
TL;DR
- Discord stored 100+ billion chat messages in Apache Cassandra, but JVM garbage collection pauses caused unpredictable p99 latency spikes up to 300ms.
- ScyllaDB, a C++ rewrite of Cassandra using the Seastar framework, eliminates GC entirely with a shard-per-core architecture.
- The migration required zero application code changes because ScyllaDB speaks the same CQL protocol as Cassandra.
- Post-migration results: p99 read latency dropped from 40-300ms to ~15ms, and node count shrank from 14 to 4.
- Transferable lesson: "protocol-compatible" does not mean "performance-equivalent." The runtime model underneath matters more than the query language on top.
The Trigger
By late 2019, Discord's on-call engineers had a recurring problem. Every few hours, one or more Cassandra nodes would enter a GC pause, and p99 read latencies would spike from single-digit milliseconds to 200-300ms. Users noticed: messages appeared to "hang" before loading, channels felt sluggish, and the experience degraded in ways that were hard to reproduce or predict.
The spikes correlated perfectly with JVM garbage collection cycles. Discord's Cassandra cluster held over 100 billion messages and handled roughly 1 million writes per second at peak. At that throughput, the JVM's heap was under constant memory pressure from compaction, bloom filter maintenance, and memtable flushes.
I've seen this exact pattern at multiple companies running Cassandra at scale. The JVM GC problem is not a bug, it is a fundamental property of running a high-throughput storage engine inside a garbage-collected runtime. The question is always: at what scale does it become unacceptable?
For Discord, the answer was "now." The team had already spent months tuning G1GC parameters, adjusting heap sizes, moving row caches off-heap, and experimenting with compaction strategies. None of it eliminated the spikes. It only made them less frequent and slightly less severe. The engineering team decided that incremental tuning had hit diminishing returns, and they needed a fundamentally different approach.
The System Before
Discord's message storage architecture was a fairly standard Cassandra deployment, well-designed for their access patterns but running into the JVM's hard limits at scale.
The message schema used a composite partition key designed for Discord's primary access pattern: loading the most recent messages in a channel.
-- Simplified Discord messages schema
CREATE TABLE messages (
channel_id bigint,
bucket int, -- monthly time bucket
message_id bigint, -- Snowflake ID, time-ordered
author_id bigint,
content text,
has_attachments boolean,
has_embeds boolean,
PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
The channel_id + bucket partition key kept each partition to roughly one month of messages per channel. This prevented unbounded partition growth (a common Cassandra anti-pattern) and made "load recent messages" queries efficient. The query engine only scans the most recent bucket.
The message_id clustering key used Discord's Snowflake ID format, which embeds a timestamp. Ordering by message_id DESC means the most recent messages sit at the top of each partition, so a "load channel" query reads sequentially from disk without random I/O.
This schema design was solid. The problem was never the data model. It was the runtime.
How the JVM GC problem manifested
Cassandra's JVM process manages a massive heap: memtables (in-memory write buffers), bloom filters (one per SSTable), row cache entries, and internal metadata. Under Discord's workload, the heap churn was enormous.
Here is the cycle that repeated every few hours on every node:
- Memtable flushes generate short-lived objects at a high rate. G1GC's young generation collections handle most of these, but each flush also creates longer-lived SSTable metadata.
- Compaction reads multiple SSTables, merges them, and writes new ones. During compaction, the JVM allocates large temporary buffers that survive young-gen collection and promote to old-gen.
- Old-gen fills up. G1GC triggers a mixed collection or, in the worst case, a full GC. During this pause, all application threads stop. No reads, no writes, no heartbeats.
- Request pile-up. Clients with pending requests to the paused node see timeouts. The coordinator retries to another replica, adding load to healthy nodes.
The pause duration depended on heap size and live object count. With 8GB heaps, pauses ranged from 50ms to 300ms. Larger heaps made pauses less frequent but longer when they hit. Smaller heaps made pauses shorter but more frequent. There was no winning configuration.
The math behind "some node is always pausing"
This is worth spelling out because it illustrates a broader systems principle. With 14 nodes in the cluster, each node pausing for ~200ms every 30 minutes, you can calculate the probability that at least one node is paused at any given moment.
Each node spends roughly 200ms / (30 * 60 * 1000ms) = 0.011% of its time in GC. The probability that a specific node is not in GC at a given instant is 99.989%. The probability that all 14 nodes are not in GC simultaneously is 0.99989^14 = 99.85%. So about 0.15% of the time, at least one node is paused.
That sounds small, but at 1 million requests per second, 0.15% means roughly 1,500 requests per second hitting a paused node. And those requests do not fail gracefully: they queue, retry, and create cascading load on other nodes. During peak hours with more frequent GC cycles, the probability was significantly higher.
GC tuning is asymptotic, not linear
Every JVM GC tuning parameter trades one dimension for another. Increasing heap reduces pause frequency but increases pause duration. Off-heap caching reduces heap pressure but introduces manual memory management complexity. Discord found that after 6 months of tuning, they were chasing single-percentage improvements at the cost of significant engineering time.
With 14 nodes in the cluster, the probability that at least one node was in a GC pause at any given moment was substantial. For a service that needs consistent sub-10ms latency to feel "instant" to users typing messages, this was a dealbreaker.
Why Not Just Tune Cassandra Harder?
The obvious question: why not keep tuning the JVM? The engineering team explored every standard approach before deciding to migrate. Here is what they tried and why each fell short.
G1GC parameter optimization. Discord experimented extensively with G1GC settings: -XX:MaxGCPauseMillis, -XX:G1HeapRegionSize, -XX:InitiatingHeapOccupancyPercent, and dozens more. Each tweak improved one metric while degrading another. Reducing MaxGCPauseMillis from 200ms to 50ms caused more frequent collections that consumed more total CPU. The team estimated they spent roughly 6 months on JVM tuning alone.
Off-heap row cache. Cassandra supports moving its row cache out of the JVM heap, which reduces GC pressure. Discord enabled this, and it helped. But the bloom filters, memtables, and compaction buffers still lived on-heap. Off-heap caching addressed maybe 30% of the problem.
Heap size adjustments. Larger heaps (12-16GB) delayed GC pauses but made each pause longer when it arrived. The p99 improved slightly, but the p999 got worse. For Discord's use case (real-time chat), the worst-case latency matters more than the average case.
Compaction strategy changes. Switching from SizeTieredCompactionStrategy to LeveledCompactionStrategy reduced write amplification but increased disk I/O and, paradoxically, created more compaction-related heap allocations. The JVM was doing more work more consistently, which meant GC pressure was smoother but never absent.
The JVM GC wall
This is a well-documented pattern in the Cassandra community. DataStax, LinkedIn, and Apple have all published engineering blogs about JVM GC challenges at scale. The problem is not Cassandra-specific per se, it is JVM-specific for any storage engine doing heavy allocation under sustained throughput. Elasticsearch, Kafka, and HBase all face variants of this problem.
My recommendation for interviews: if you mention Cassandra in a design, know that JVM GC is the operational ceiling. You do not need to memorize tuning parameters. Just acknowledge that garbage-collected runtimes introduce non-deterministic latency under memory pressure, and that this becomes visible at high throughput. That single sentence shows operational maturity.
The team also considered vertical scaling (bigger machines) and horizontal scaling (more nodes). More nodes would distribute load but also increase the probability that some node is always in GC. Vertical scaling (more RAM per node) would make individual pauses worse when they occurred. Neither addressed the root cause.
The root cause was architectural: a managed-memory runtime running a storage engine that allocates and deallocates at extreme rates. No configuration change fixes that. You either accept the latency profile or change the runtime.
Here is a summary of what the team tried and why each approach hit a ceiling:
| Tuning Approach | Improvement | Why It Wasn't Enough |
|---|---|---|
| G1GC parameter optimization | ~20% fewer pauses | Pauses still occurred, just less often |
| Off-heap row cache | ~30% less heap pressure | Memtables and bloom filters still on-heap |
| Larger heap (12-16GB) | Fewer pauses overall | Individual pauses lasted longer (worse p999) |
| LeveledCompactionStrategy | Smoother compaction load | More compaction I/O, more heap allocations |
| More nodes (horizontal scale) | Lower per-node load | Higher probability of cluster-wide GC overlap |
The Decision
Discord evaluated three options:
| Option | Pros | Cons |
|---|---|---|
| Keep Cassandra, accept GC latency | Zero migration effort | p99 stays unpredictable, on-call burden |
| Switch to a different DB (DynamoDB, CockroachDB) | Different performance profile | Requires rewriting all data access code |
| Switch to ScyllaDB | Same CQL protocol, no GC, shard-per-core | Newer project, smaller community |
The team also briefly considered managed Cassandra offerings (DataStax Astra, Amazon Keyspaces). These services abstract away cluster management but still run on the JVM under the hood. They would not solve the fundamental GC problem, just move it behind an API. The latency profile would remain the same.
ScyllaDB won because of one critical property: CQL protocol compatibility. ScyllaDB speaks the exact same query language and wire protocol as Cassandra. Discord's application code, all the CQL queries, prepared statements, and connection pooling, needed zero changes. The migration was purely an infrastructure swap.
What makes ScyllaDB different
ScyllaDB is not a fork of Cassandra. It is a from-scratch rewrite in C++ using the Seastar framework. The architectural differences are fundamental:
Shard-per-core model. ScyllaDB assigns each CPU core exclusive ownership of a memory region and a partition range. Core 0 owns partitions A-F, core 1 owns G-L, and so on. Each core runs a single event loop (like Node.js, but for a database). There are no locks, no mutexes, no context switches between cores for most operations.
No garbage collection. Memory is managed manually through C++ RAII patterns and Seastar's memory allocator. Allocation and deallocation are deterministic. There are no stop-the-world pauses, no heap scanning, no generational promotion.
Per-shard compaction. In Cassandra, compaction is a cluster-wide concern that competes with query threads for JVM heap and CPU. In ScyllaDB, each shard compacts its own SSTables independently using its own memory budget. Compaction on core 3 does not affect query latency on core 7.
Userspace I/O scheduling. Seastar bypasses the OS kernel for I/O using the io_uring or AIO interfaces. Each core manages its own I/O queue with priority scheduling, so compaction I/O never starves read I/O on the same core.
I've found that when engineers hear "C++ rewrite of a Java database," they assume it is a marginal performance improvement. That is the wrong mental model. The shard-per-core architecture eliminates entire categories of overhead (lock contention, GC pauses, context switching) that dominate tail latency in thread-pool architectures. The improvement is structural, not incremental.
The Migration Path
Discord used a four-phase dual-write migration strategy. The key constraint: zero downtime and instant rollback at every phase. With 150 million monthly active users sending messages constantly, any migration that required a maintenance window was off the table.
Phase 1: Dual writes
The message service wrote every new message to both Cassandra and ScyllaDB simultaneously. Cassandra remained the source of truth for reads. ScyllaDB received shadow writes only.
Because ScyllaDB speaks the same CQL protocol, the dual-write implementation was minimal. The service opened a second connection pool to the ScyllaDB cluster and sent the same prepared statements. If a ScyllaDB write failed, it was logged but did not affect the user.
This phase ran for several weeks. The team monitored ScyllaDB's write success rate, replication lag, and disk usage to validate that the cluster handled Discord's write throughput.
Phase 2: Backfill and validate
New messages were flowing into ScyllaDB, but the 100+ billion historical messages still only lived in Cassandra. The backfill worker read partitions from Cassandra in bulk and wrote them to ScyllaDB.
The backfill was not a simple bulk copy. It had to handle:
- Rate limiting to avoid overwhelming ScyllaDB while it was also absorbing live writes.
- Partition ordering to backfill recent data first (more likely to be queried).
- Idempotency because the backfill could be restarted from any checkpoint without creating duplicates.
A consistency checker ran in parallel, sampling random partitions from both databases and comparing results. This caught schema drift, encoding issues, and edge cases in the backfill logic. The backfill was the longest phase of the migration. With 100+ billion rows, even at hundreds of thousands of rows per second, the process took weeks. The team prioritized recent data (last 30 days) first, since that covered the vast majority of active read traffic. Older historical data was backfilled in the background with lower priority.
Interview insight: dual-write migrations
When proposing a database migration in an interview, always mention the dual-write pattern. Explain that you write to both old and new stores, backfill historical data, switch reads, then decommission. Mention that rollback is possible at every phase. This pattern applies to any "swap the storage layer" migration, not just Cassandra-to-ScyllaDB.
Phase 3: Switch reads
Once the backfill was complete and the consistency checker showed no discrepancies, Discord flipped the read path to ScyllaDB. Cassandra stayed warm as a fallback for the first week.
The message service had a feature flag controlling read routing. If ScyllaDB returned an error or timed out, the service fell back to Cassandra transparently. In practice, the fallback was rarely triggered. ScyllaDB's read latency was immediately and dramatically better than Cassandra's.
I've seen teams rush this phase and skip the soak period. That is a mistake. The first week after switching reads is when you discover edge cases: queries against very old partitions, unusual access patterns from bots, tombstone-heavy channels where users bulk-deleted messages. Give the new system time to surface surprises.
Phase 4: Decommission Cassandra
After a 7-day soak period with ScyllaDB handling all reads and writes, Discord turned off Cassandra writes and began decommissioning the cluster. The 14-node Cassandra cluster was replaced by a 4-node ScyllaDB cluster handling the same (and growing) workload.
The total migration, from first dual-write to Cassandra decommission, took approximately 3 months.
The System After
The post-migration architecture looks deceptively similar on the surface. Same schema, same queries, same application code. The difference is entirely in the runtime and operational model underneath.
What changed operationally
The shift from Cassandra to ScyllaDB changed more than just latency numbers. The entire operational posture of the team shifted.
No more GC tuning. The team eliminated an entire category of operational work. No more JVM heap monitoring dashboards, no more G1GC flag experiments, no more "which node is in GC right now" alerts. The on-call burden for the message storage tier dropped significantly.
Simpler capacity planning. With Cassandra, capacity planning had to account for GC overhead, which consumed a variable percentage of CPU and memory depending on workload mix. ScyllaDB's resource consumption is more predictable because there is no background GC process competing for resources. If a node uses 60% CPU at current load, you can reliably estimate when you need to add capacity.
Fewer nodes to manage. Going from 14 to 4 nodes means fewer machines to monitor, patch, and replace. It also means fewer failure domains: with 14 nodes, hardware failures were a monthly occurrence. With 4 nodes, individual node failures are less frequent, though the blast radius per failure is larger.
Compaction no longer affects reads. In Cassandra, compaction and read queries compete for the same JVM heap and I/O bandwidth. A compaction storm could degrade read latency even without GC involvement. In ScyllaDB, per-shard compaction runs on its own I/O budget and cannot starve read requests on the same core.
Monitoring shifted from JVM internals to system metrics. With Cassandra, the team needed deep JVM monitoring: heap utilization by generation, GC pause frequency and duration, promotion rates from young to old gen, and compaction pending bytes. With ScyllaDB, the relevant metrics are standard system-level concerns: CPU usage per core, disk I/O latency, and network throughput. This is a simpler mental model for on-call engineers.
Operational complexity is a real cost
When comparing databases, engineers focus on throughput and latency benchmarks. In practice, the operational burden (tuning, monitoring, incident response, capacity planning) often dominates total cost of ownership. A database that requires less tuning and produces fewer surprises is cheaper to run even if the license costs more.
The Results
Here are the concrete before-and-after numbers from Discord's published data:
| Metric | Cassandra (Before) | ScyllaDB (After) |
|---|---|---|
| p99 read latency | 40-300ms (spiky, GC-dependent) | ~15ms (stable) |
| p99 write latency | ~5ms average, spikes to 50ms+ | ~1ms (stable) |
| Node count | 14 nodes | 4 nodes |
| JVM heap per node | 8GB, heavily tuned G1GC | N/A (no JVM) |
| GC pause events | Multiple per hour across cluster | Zero |
| Compaction impact on reads | Measurable latency degradation | Isolated per-shard, no read impact |
| On-call GC-related incidents | Regular (weekly) | None |
| Application code changes | N/A | Zero (same CQL protocol) |
The p99 improvement was the headline number, but the operational simplification was arguably more valuable. The team reclaimed engineering hours previously spent on JVM tuning, GC monitoring, and incident response for GC-related latency spikes.
The node count reduction (14 to 4) also translated directly to infrastructure cost savings. Fewer nodes means lower compute bills, fewer disks to provision, and less network bandwidth consumed by intra-cluster replication. The reduction also simplified failure handling: with fewer nodes, the probability of hardware failure in any given week drops proportionally.
A subtlety worth highlighting: the improvement was not just in the median. The standard deviation of latency dropped dramatically. Cassandra's latency distribution was bimodal (fast when no GC, slow during GC). ScyllaDB's distribution was tight and unimodal. For user experience, predictable latency matters more than average latency. A p99 of 15ms that never spikes is better than a p50 of 2ms that occasionally jumps to 300ms.
For your interviews: the specific numbers matter less than the pattern. "Moving from a JVM-based to a native storage engine eliminated GC latency entirely and reduced our node count by 70%" is the kind of concrete, quantified claim that signals real operational experience.
What They'd Do Differently
Discord's engineers noted a few things they would adjust in hindsight:
Start the consistency checker earlier. The team built the consistency checker (which compared random partitions between Cassandra and ScyllaDB) during Phase 2. In hindsight, they would have built and deployed it in Phase 1, checking real-time writes as they happened. Earlier validation catches encoding mismatches and edge cases sooner.
Invest more in backfill observability. The backfill of 100+ billion rows took weeks. During that time, the team wanted better visibility into progress, throughput, and estimated completion time. A more instrumented backfill pipeline would have reduced anxiety and improved communication with stakeholders.
Evaluate ScyllaDB's Alternator (DynamoDB-compatible) API. ScyllaDB also offers a DynamoDB-compatible API called Alternator. Discord did not evaluate this because CQL compatibility was sufficient. But for teams already using DynamoDB, Alternator could be an interesting path to self-hosted, lower-latency wide-column storage.
Run a load test with production-shaped traffic before Phase 3. While the dual-write phase validated write throughput, it did not fully exercise ScyllaDB under Discord's read patterns. A replay of production read traffic against the ScyllaDB cluster would have caught more edge cases before switching the live read path.
I've noticed that every migration retrospective includes "we should have built better observability for the migration itself." This is nearly universal. If you are planning a data migration, invest in migration-specific dashboards and alerting from day one. Track rows migrated, rows remaining, error rate, throughput, and estimated completion time. It always pays for itself.
Architecture Decision Guide
When should you consider switching from a JVM-based database to a non-JVM alternative? This decision flowchart captures the reasoning pattern from Discord's experience.
The key branching point is whether a protocol-compatible alternative exists. Discord's migration was fast (roughly 3 months) specifically because ScyllaDB speaks CQL natively. If they had needed to rewrite their entire data access layer, the cost-benefit analysis would have been very different.
Transferable Lessons
1. Protocol compatibility is your migration insurance
ScyllaDB being CQL-compatible meant Discord could swap the storage engine without touching application code. When choosing a database, prefer ecosystems with multiple compatible implementations. If your primary choice develops problems at scale, a protocol-compatible alternative lets you migrate without rewriting your data layer.
This applies beyond databases. GraphQL, gRPC, and OpenTelemetry all have multiple implementations. Designing around open protocols, not vendor-specific SDKs, gives you options when your first choice hits a ceiling.
2. GC problems are architectural, not configurable
Discord spent 6 months tuning Cassandra's JVM. The fundamental issue was that garbage-collected runtimes introduce non-deterministic latency under memory pressure. No amount of flag-tuning eliminates this, it only shifts the tradeoff curve.
My advice: if your interview design involves a JVM-based datastore at high throughput, acknowledge the GC tradeoff explicitly. "At very high write rates, JVM GC pauses may affect tail latency. We can mitigate with tuning, but for sub-10ms p99 guarantees, we might need a non-JVM engine." That sentence alone shows depth.
3. Shard-per-core eliminates categories of overhead
The shard-per-core model (used by ScyllaDB, Redpanda, and others) assigns each CPU core exclusive ownership of its data. No cross-core synchronization for most operations. This eliminates lock contention, reduces context switching, and makes performance more predictable.
In system design, "shared nothing" architectures consistently outperform "shared everything" designs at scale. This principle applies to database internals, microservice boundaries, and even team organization.
4. Dual-write migrations are the safe path for storage swaps
Discord's four-phase approach (dual-write, backfill, switch reads, decommission) provided rollback safety at every stage. This is the industry-standard pattern for swapping storage layers without downtime.
If you are migrating any stateful system, default to dual-write unless the migration is truly trivial. The cost of running two systems in parallel for a few months is almost always lower than the cost of a failed big-bang cutover.
5. Operational complexity is an underrated cost
Fourteen Cassandra nodes required more operational work than four ScyllaDB nodes, not because Cassandra is a bad database, but because JVM tuning, GC monitoring, and larger clusters create more surface area for problems. When comparing databases, factor in the operational burden: alerts, tuning knobs, failure modes, and on-call toil.
The cheapest database is not the one with the lowest license fee. It is the one that consumes the least engineering time to keep running. In interviews, when comparing two technologies, mentioning operational cost alongside performance benchmarks sets you apart from candidates who only discuss throughput numbers.
How This Shows Up in Interviews
This case study is directly relevant when you propose Cassandra (or any JVM-based datastore) in a system design interview. It demonstrates operational awareness beyond "Cassandra is good for wide-column workloads."
When to cite this case study: Any time you propose a high-write, latency-sensitive system using Cassandra or another JVM-based database. Mentioning the GC tradeoff proactively shows you understand operational realities, not just data modeling.
The sentence to say: "At very high throughput, JVM garbage collection in Cassandra can cause non-deterministic p99 latency spikes. Discord solved this by migrating to ScyllaDB, which is CQL-compatible but runs on C++ with no GC."
| Interviewer Asks | Strong Answer |
|---|---|
| "What database would you use for a messaging system?" | "Cassandra or ScyllaDB for the message store, partitioned by channel + time bucket. If latency SLAs are strict, ScyllaDB avoids JVM GC spikes that Cassandra hits at high throughput." |
| "What are the risks of using Cassandra at scale?" | "The main operational risk is JVM GC pauses under heavy write load. At Discord's scale (1M writes/sec), GC caused p99 spikes to 200-300ms. Tuning helps but doesn't eliminate the problem." |
| "How would you migrate between two databases?" | "Dual-write pattern: write to both old and new, backfill historical data, switch reads with a feature flag, soak test for a week, then decommission. Each phase has an instant rollback path." |
| "What is a shard-per-core architecture?" | "Each CPU core owns a subset of partitions exclusively. No cross-core locking for most operations. ScyllaDB and Redpanda use this model. It eliminates lock contention and gives predictable tail latency." |
| "Why not just scale horizontally?" | "More nodes means more chances that some node is in GC at any moment. Discord had 14 nodes and at least one was always pausing. The fix was eliminating GC, not adding more nodes to the lottery." |
Quick Recap
- Discord's Cassandra cluster (14 nodes, 100B+ messages) suffered non-deterministic JVM GC pauses causing p99 latency spikes up to 300ms.
- Six months of JVM tuning (G1GC parameters, off-heap caching, compaction strategies) reduced but never eliminated the problem.
- ScyllaDB is a C++ Cassandra rewrite using the Seastar framework: no JVM, no GC, shard-per-core architecture with per-shard compaction.
- CQL protocol compatibility meant the migration required zero application code changes.
- Discord used a four-phase dual-write migration: shadow writes, backfill, switch reads, decommission Cassandra.
- Post-migration results: p99 dropped from 40-300ms to ~15ms, node count reduced from 14 to 4, GC incidents dropped to zero.
- The transferable principle: runtime architecture (GC vs. manual memory, thread pool vs. shard-per-core) determines tail latency at scale more than query language or data model choices.
Related Concepts
- Databases - Covers the fundamentals of database selection, including wide-column stores like Cassandra and ScyllaDB, and when each model fits.
- Replication - Discord's Cassandra and ScyllaDB clusters both use multi-datacenter replication. Understanding replication strategies clarifies why node count and consistency settings matter.
- Sharding - ScyllaDB's shard-per-core model is a form of internal sharding. The principles of partition key design and data distribution apply directly to this case study's schema design.