Figma's multiplayer architecture
How Figma built real-time collaborative design on top of a Rust-based operational transformation engine, scaling to millions of simultaneous editors with sub-100ms sync latency.
TL;DR
- Figma rewrote their multiplayer engine from Node.js to Rust to eliminate GC pauses and achieve 3-5x throughput per server for CPU-intensive document operations.
- Collaborative vector graphics is fundamentally harder than collaborative text because elements have spatial positions, z-order stacking, and parent-child nesting that all conflict independently.
- Figma uses server-sequenced operational transformation: each client applies operations optimistically, the server assigns a global order, and clients reconcile on divergence.
- Every document lives in exactly one server process. The horizontal scaling unit is active documents, not connections.
- Popular community files handle ~5,000 simultaneous editors on a single server, making "hot documents" the primary scaling bottleneck.
- Transferable lesson: when your collaboration domain has multi-dimensional conflicts (not just text insertion), a central sequencer with deterministic resolution rules is simpler and more predictable than full CRDT.
The Trigger
By 2019, Figma's multiplayer editing had gone from a clever feature to the defining product experience. Design teams at Microsoft, Uber, and Airbnb were running live collaborative sessions with dozens of participants on a single file. Community files (shared public templates) regularly attracted hundreds to thousands of simultaneous viewers and editors.
The Node.js multiplayer engine was buckling under this load. GC pauses in V8 caused latency spikes on large documents with 10,000+ design nodes. Walking the document tree to apply operations was CPU-bound work that competed with the single-threaded event loop. Memory usage per document was unpredictable because JavaScript object overhead made large in-memory graph structures expensive.
Here is the core tension: Figma's product is a canvas, not a text editor. A typical Figma design file contains thousands of vector nodes organized in a hierarchical tree with spatial coordinates, rotation, opacity, constraints, and parent-child relationships. Every collaborative operation requires traversing or mutating this tree. That makes the multiplayer engine CPU-intensive in a way that, say, a chat server's WebSocket layer never is.
The team had a choice: optimize the Node.js engine incrementally or rewrite the multiplayer core in a systems language. I've seen teams agonize over this kind of rewrite decision for months. Figma's path was clear because the bottleneck was the runtime itself, not the application logic.
Why collaborative vector graphics is harder than text
Before diving into the architecture, it is worth understanding why Figma's problem is fundamentally harder than Google Docs or Notion.
Collaborative text has one primary conflict dimension: two users inserting characters at the same position. The merge semantics are well-understood (decades of OT research for linear text).
Collaborative vector graphics has at least four independent conflict dimensions:
| Conflict type | Example | Resolution complexity |
|---|---|---|
| Spatial positioning | Two users drag the same rectangle to different coordinates | LWW per axis (x, y independently) |
| Z-order stacking | Two users move elements within the same layer stack | Index adjustment after sequencing |
| Parent-child nesting | Two users move the same element into different parent groups | LWW on parent assignment |
| Group bounding boxes | Two users move different children of the same group simultaneously | Derived recalculation after all ops apply |
| Constraint propagation | Resizing a frame updates constrained children | All constraint recalculations must converge identically |
Each of these conflict types requires its own resolution rule. And they interact: moving a node into a new parent changes its spatial coordinates (relative to the new parent's frame), which changes the parent's bounding box, which may trigger constraint recalculation on sibling nodes. The resolution logic must handle all these cascades deterministically.
This is why Figma's multiplayer engine is CPU-intensive. It is not just relaying messages between WebSocket clients. It is walking a complex tree, applying operations, resolving conflicts, propagating constraints, and recomputing derived state on every single operation.
The System Before
Figma's original multiplayer stack was a Node.js service handling WebSocket connections. Every open document had a corresponding process holding the full document state in memory. The process received operations from connected clients, sequenced them, and broadcast the results.
The architecture was built around a key insight: collaborative editing on a single document is fundamentally a single-writer problem. Only one server needs to sequence operations for a given document. This simplification avoided the complexity of distributed coordination protocols.
The architecture was conceptually clean. One document, one process, one sequencer. The problem was not the design, it was the execution environment. The architectural topology was so sound that it survived the Rust rewrite unchanged. What needed to change was what happened inside each process.
What worked
The single-server-per-document model eliminated distributed coordination for the common case. No cross-server operation merging, no distributed locks, no consensus protocol for operation ordering. Simplicity was the strength.
Sticky routing by document ID meant that all clients editing the same file connected to the same server process. The load balancer hashed the document ID and routed accordingly. This made the operation sequencing problem local: one process, one event loop, one ordered stream of operations.
What didn't work
Node.js V8 garbage collection caused latency spikes of 50-200ms on large documents. A design file with 10,000+ nodes meant a large in-memory object graph. Every major GC cycle froze the event loop, blocking all WebSocket I/O for every user on that document. Users experienced this as a periodic "freeze" where their cursor stops and changes from other users appear all at once after the GC pause ends.
CPU-intensive operations (tree walks for parent-child resolution, bounding box recalculation, constraint propagation) competed with I/O on the single-threaded event loop. A complex operation on a large doc could block incoming WebSocket frames for all connected users. Node.js worker threads could theoretically offload this work, but the document tree is too large and deeply referenced to serialize efficiently across the thread boundary.
Memory overhead was substantial. JavaScript objects carry hidden class metadata, property maps, and V8 heap overhead. A simple rectangle node in Figma has dozens of properties (x, y, width, height, rotation, opacity, fill, stroke, constraints, etc.). Each property in V8 is a heap-allocated value with object metadata. The same logical document consumed 3-5x more memory in Node.js than a compact native representation.
Why Not Just Optimize Node.js?
The obvious fix: profile the Node.js code, optimize hot paths, reduce allocations, tune GC settings.
Figma's team tried incremental optimization first. They reduced unnecessary object allocations, restructured the document tree for better GC behavior, and tuned V8 flags. These changes helped but hit a ceiling defined by the runtime itself.
V8's garbage collector is generational and concurrent, but it still has stop-the-world phases for major collections. Large, long-lived heaps (which document trees inherently are) trigger these major collections more frequently. You can tune GC thresholds, but you cannot eliminate pauses for a heap-heavy, long-lived process.
The event loop architecture was the other constraint. CPU-intensive tree walks block I/O. You can offload work to worker threads, but then you serialize/deserialize document state across the thread boundary, adding overhead and complexity that partially defeats the purpose. The document tree is a deeply interconnected graph with parent-child references, sibling ordering, and shared constraint data. Serializing it for cross-thread communication is expensive.
You might also consider using a native addon (C++ binding called from Node.js via N-API). Figma explored this path and found that the marshalling cost between JavaScript and native code on every operation was significant. Every operation requires reading from and writing to the document tree, which means data crosses the JS/native boundary multiple times per operation. A pure Rust process avoids this boundary entirely.
I've watched teams spend six months optimizing a runtime-constrained system only to hit the same wall at 2x scale. Figma's engineering blog makes clear they reached the rational point: the ROI on further Node.js optimization was diminishing, and the multiplayer engine was critical enough to justify a rewrite.
The math that forced the decision
Consider a document with 20,000 nodes. Each node has ~30 properties. That is 600,000 JavaScript objects in V8's heap, each with hidden class pointers, property storage, and GC metadata. V8's major GC must scan and potentially compact this entire heap.
At Figma's scale (millions of active documents, some with 50,000+ nodes), the tail latency from GC pauses was not an optimization opportunity. It was a user experience problem. Designers reported "lag spikes" that corresponded exactly to V8 major GC events in server logs. The spike duration scaled with document size: larger documents produced longer GC pauses, and the largest documents (community files with thousands of components) produced the worst user experience.
The team measured that, even with aggressive V8 tuning (increasing the old-generation heap size, reducing promotion thresholds), they could only reduce GC pause frequency, not eliminate it. The largest documents still produced pauses exceeding 150ms at least once per minute. For a real-time collaboration tool, that is unacceptable.
When to rewrite vs. optimize
Rewrite when the bottleneck is the runtime, not the algorithm. If profiling shows GC pauses and memory overhead as the top contributors, no amount of application-level optimization fixes the fundamental constraint. If the bottleneck is your algorithm or data structure, fix that first in any language.
The Decision
Figma chose Rust for the multiplayer engine rewrite. The key factors:
No garbage collector. Rust uses ownership and borrowing for memory management. Document trees stay in memory for the lifetime of an editing session (minutes to hours). In a GC language, these long-lived large heap objects trigger expensive major collections. In Rust, memory is freed deterministically when ownership transfers or scope ends. For a real-time system, "no surprise pauses" is the single most important runtime property.
Arena allocation for document nodes. Figma's document tree maps naturally to arena allocation: allocate all nodes for a document from a single memory pool, deallocate the entire arena when the session ends. This gives cache-friendly traversals (nodes are contiguous in memory) and zero per-node allocation overhead. A document tree with 50,000 nodes fits in a compact, cache-efficient arena instead of scattered across the V8 heap.
Predictable latency. No GC pauses means the 99th percentile latency is determined by the algorithm, not runtime housekeeping. For a real-time collaboration system where users feel any delay over 100ms, this predictability matters more than raw throughput. You can optimize an algorithm iteratively. You cannot optimize away a GC pause in a language that has GC.
Strong type system. The operation model (SET_PROPERTY, CREATE_NODE, DELETE_NODE, MOVE_NODE) maps well to Rust enums with exhaustive pattern matching. The compiler catches unhandled operation types at build time, eliminating an entire class of bugs. When you have four operation types with dozens of conflict resolution paths, exhaustive matching is a safety net that catches things code review misses.
Memory safety without GC. The team explicitly considered C++ and rejected it. Rust's ownership model prevents use-after-free, buffer overflows, and data races at compile time. For a service handling millions of WebSocket connections with complex in-memory state, eliminating those bug classes was worth the learning curve. A use-after-free bug in a long-running multiplayer server is the kind of issue that manifests as a rare, irreproducible crash under load, exactly the class of bug you most want to prevent.
What stayed the same
The Rust rewrite was surgical. Only the multiplayer engine (document state management, operation application, conflict resolution, WebSocket handling per document) was rewritten. The following components stayed in their original stack:
- API servers (authentication, file metadata, team management) remained in their existing language and framework.
- Persistence layer (document snapshot storage, operation log) was unchanged. The Rust engine wrote to the same storage layer.
- Frontend (the design canvas, UI components, file browser) obviously stayed as JavaScript/TypeScript in the browser.
- Load balancer and routing configuration stayed the same; the Rust processes accepted WebSocket connections at the same endpoints.
This is the critical point about surgical rewrites: you replace only the component that is the bottleneck. The interfaces stay the same. The rest of the system does not know or care that the multiplayer engine changed languages.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.