Figma's multiplayer architecture
How Figma built real-time collaborative design on top of a Rust-based operational transformation engine, scaling to millions of simultaneous editors with sub-100ms sync latency.
TL;DR
- Figma rewrote their multiplayer engine from Node.js to Rust to eliminate GC pauses and achieve 3-5x throughput per server for CPU-intensive document operations.
- Collaborative vector graphics is fundamentally harder than collaborative text because elements have spatial positions, z-order stacking, and parent-child nesting that all conflict independently.
- Figma uses server-sequenced operational transformation: each client applies operations optimistically, the server assigns a global order, and clients reconcile on divergence.
- Every document lives in exactly one server process. The horizontal scaling unit is active documents, not connections.
- Popular community files handle ~5,000 simultaneous editors on a single server, making "hot documents" the primary scaling bottleneck.
- Transferable lesson: when your collaboration domain has multi-dimensional conflicts (not just text insertion), a central sequencer with deterministic resolution rules is simpler and more predictable than full CRDT.
The Trigger
By 2019, Figma's multiplayer editing had gone from a clever feature to the defining product experience. Design teams at Microsoft, Uber, and Airbnb were running live collaborative sessions with dozens of participants on a single file. Community files (shared public templates) regularly attracted hundreds to thousands of simultaneous viewers and editors.
The Node.js multiplayer engine was buckling under this load. GC pauses in V8 caused latency spikes on large documents with 10,000+ design nodes. Walking the document tree to apply operations was CPU-bound work that competed with the single-threaded event loop. Memory usage per document was unpredictable because JavaScript object overhead made large in-memory graph structures expensive.
Here is the core tension: Figma's product is a canvas, not a text editor. A typical Figma design file contains thousands of vector nodes organized in a hierarchical tree with spatial coordinates, rotation, opacity, constraints, and parent-child relationships. Every collaborative operation requires traversing or mutating this tree. That makes the multiplayer engine CPU-intensive in a way that, say, a chat server's WebSocket layer never is.
The team had a choice: optimize the Node.js engine incrementally or rewrite the multiplayer core in a systems language. I've seen teams agonize over this kind of rewrite decision for months. Figma's path was clear because the bottleneck was the runtime itself, not the application logic.
Why collaborative vector graphics is harder than text
Before diving into the architecture, it is worth understanding why Figma's problem is fundamentally harder than Google Docs or Notion.
Collaborative text has one primary conflict dimension: two users inserting characters at the same position. The merge semantics are well-understood (decades of OT research for linear text).
Collaborative vector graphics has at least four independent conflict dimensions:
| Conflict type | Example | Resolution complexity |
|---|---|---|
| Spatial positioning | Two users drag the same rectangle to different coordinates | LWW per axis (x, y independently) |
| Z-order stacking | Two users move elements within the same layer stack | Index adjustment after sequencing |
| Parent-child nesting | Two users move the same element into different parent groups | LWW on parent assignment |
| Group bounding boxes | Two users move different children of the same group simultaneously | Derived recalculation after all ops apply |
| Constraint propagation | Resizing a frame updates constrained children | All constraint recalculations must converge identically |
Each of these conflict types requires its own resolution rule. And they interact: moving a node into a new parent changes its spatial coordinates (relative to the new parent's frame), which changes the parent's bounding box, which may trigger constraint recalculation on sibling nodes. The resolution logic must handle all these cascades deterministically.
This is why Figma's multiplayer engine is CPU-intensive. It is not just relaying messages between WebSocket clients. It is walking a complex tree, applying operations, resolving conflicts, propagating constraints, and recomputing derived state on every single operation.
The System Before
Figma's original multiplayer stack was a Node.js service handling WebSocket connections. Every open document had a corresponding process holding the full document state in memory. The process received operations from connected clients, sequenced them, and broadcast the results.
The architecture was built around a key insight: collaborative editing on a single document is fundamentally a single-writer problem. Only one server needs to sequence operations for a given document. This simplification avoided the complexity of distributed coordination protocols.
The architecture was conceptually clean. One document, one process, one sequencer. The problem was not the design, it was the execution environment. The architectural topology was so sound that it survived the Rust rewrite unchanged. What needed to change was what happened inside each process.
What worked
The single-server-per-document model eliminated distributed coordination for the common case. No cross-server operation merging, no distributed locks, no consensus protocol for operation ordering. Simplicity was the strength.
Sticky routing by document ID meant that all clients editing the same file connected to the same server process. The load balancer hashed the document ID and routed accordingly. This made the operation sequencing problem local: one process, one event loop, one ordered stream of operations.
What didn't work
Node.js V8 garbage collection caused latency spikes of 50-200ms on large documents. A design file with 10,000+ nodes meant a large in-memory object graph. Every major GC cycle froze the event loop, blocking all WebSocket I/O for every user on that document. Users experienced this as a periodic "freeze" where their cursor stops and changes from other users appear all at once after the GC pause ends.
CPU-intensive operations (tree walks for parent-child resolution, bounding box recalculation, constraint propagation) competed with I/O on the single-threaded event loop. A complex operation on a large doc could block incoming WebSocket frames for all connected users. Node.js worker threads could theoretically offload this work, but the document tree is too large and deeply referenced to serialize efficiently across the thread boundary.
Memory overhead was substantial. JavaScript objects carry hidden class metadata, property maps, and V8 heap overhead. A simple rectangle node in Figma has dozens of properties (x, y, width, height, rotation, opacity, fill, stroke, constraints, etc.). Each property in V8 is a heap-allocated value with object metadata. The same logical document consumed 3-5x more memory in Node.js than a compact native representation.
Why Not Just Optimize Node.js?
The obvious fix: profile the Node.js code, optimize hot paths, reduce allocations, tune GC settings.
Figma's team tried incremental optimization first. They reduced unnecessary object allocations, restructured the document tree for better GC behavior, and tuned V8 flags. These changes helped but hit a ceiling defined by the runtime itself.
V8's garbage collector is generational and concurrent, but it still has stop-the-world phases for major collections. Large, long-lived heaps (which document trees inherently are) trigger these major collections more frequently. You can tune GC thresholds, but you cannot eliminate pauses for a heap-heavy, long-lived process.
The event loop architecture was the other constraint. CPU-intensive tree walks block I/O. You can offload work to worker threads, but then you serialize/deserialize document state across the thread boundary, adding overhead and complexity that partially defeats the purpose. The document tree is a deeply interconnected graph with parent-child references, sibling ordering, and shared constraint data. Serializing it for cross-thread communication is expensive.
You might also consider using a native addon (C++ binding called from Node.js via N-API). Figma explored this path and found that the marshalling cost between JavaScript and native code on every operation was significant. Every operation requires reading from and writing to the document tree, which means data crosses the JS/native boundary multiple times per operation. A pure Rust process avoids this boundary entirely.
I've watched teams spend six months optimizing a runtime-constrained system only to hit the same wall at 2x scale. Figma's engineering blog makes clear they reached the rational point: the ROI on further Node.js optimization was diminishing, and the multiplayer engine was critical enough to justify a rewrite.
The math that forced the decision
Consider a document with 20,000 nodes. Each node has ~30 properties. That is 600,000 JavaScript objects in V8's heap, each with hidden class pointers, property storage, and GC metadata. V8's major GC must scan and potentially compact this entire heap.
At Figma's scale (millions of active documents, some with 50,000+ nodes), the tail latency from GC pauses was not an optimization opportunity. It was a user experience problem. Designers reported "lag spikes" that corresponded exactly to V8 major GC events in server logs. The spike duration scaled with document size: larger documents produced longer GC pauses, and the largest documents (community files with thousands of components) produced the worst user experience.
The team measured that, even with aggressive V8 tuning (increasing the old-generation heap size, reducing promotion thresholds), they could only reduce GC pause frequency, not eliminate it. The largest documents still produced pauses exceeding 150ms at least once per minute. For a real-time collaboration tool, that is unacceptable.
When to rewrite vs. optimize
Rewrite when the bottleneck is the runtime, not the algorithm. If profiling shows GC pauses and memory overhead as the top contributors, no amount of application-level optimization fixes the fundamental constraint. If the bottleneck is your algorithm or data structure, fix that first in any language.
The Decision
Figma chose Rust for the multiplayer engine rewrite. The key factors:
No garbage collector. Rust uses ownership and borrowing for memory management. Document trees stay in memory for the lifetime of an editing session (minutes to hours). In a GC language, these long-lived large heap objects trigger expensive major collections. In Rust, memory is freed deterministically when ownership transfers or scope ends. For a real-time system, "no surprise pauses" is the single most important runtime property.
Arena allocation for document nodes. Figma's document tree maps naturally to arena allocation: allocate all nodes for a document from a single memory pool, deallocate the entire arena when the session ends. This gives cache-friendly traversals (nodes are contiguous in memory) and zero per-node allocation overhead. A document tree with 50,000 nodes fits in a compact, cache-efficient arena instead of scattered across the V8 heap.
Predictable latency. No GC pauses means the 99th percentile latency is determined by the algorithm, not runtime housekeeping. For a real-time collaboration system where users feel any delay over 100ms, this predictability matters more than raw throughput. You can optimize an algorithm iteratively. You cannot optimize away a GC pause in a language that has GC.
Strong type system. The operation model (SET_PROPERTY, CREATE_NODE, DELETE_NODE, MOVE_NODE) maps well to Rust enums with exhaustive pattern matching. The compiler catches unhandled operation types at build time, eliminating an entire class of bugs. When you have four operation types with dozens of conflict resolution paths, exhaustive matching is a safety net that catches things code review misses.
Memory safety without GC. The team explicitly considered C++ and rejected it. Rust's ownership model prevents use-after-free, buffer overflows, and data races at compile time. For a service handling millions of WebSocket connections with complex in-memory state, eliminating those bug classes was worth the learning curve. A use-after-free bug in a long-running multiplayer server is the kind of issue that manifests as a rare, irreproducible crash under load, exactly the class of bug you most want to prevent.
What stayed the same
The Rust rewrite was surgical. Only the multiplayer engine (document state management, operation application, conflict resolution, WebSocket handling per document) was rewritten. The following components stayed in their original stack:
- API servers (authentication, file metadata, team management) remained in their existing language and framework.
- Persistence layer (document snapshot storage, operation log) was unchanged. The Rust engine wrote to the same storage layer.
- Frontend (the design canvas, UI components, file browser) obviously stayed as JavaScript/TypeScript in the browser.
- Load balancer and routing configuration stayed the same; the Rust processes accepted WebSocket connections at the same endpoints.
This is the critical point about surgical rewrites: you replace only the component that is the bottleneck. The interfaces stay the same. The rest of the system does not know or care that the multiplayer engine changed languages.
Rust is not always the answer
Figma's rewrite made sense because the bottleneck was CPU-intensive tree operations on large in-memory heaps. If your service is I/O-bound (proxying requests, reading from databases, serving static content), Rust's advantages over Node.js or Go are marginal. The rewrite overhead only pays off when the runtime is the measurable bottleneck.
The Migration Path
Figma did not do a big-bang rewrite. They migrated incrementally over several months, running both engines in parallel during the transition. This is the standard pattern for any rewrite of a critical production system, and it is worth understanding each phase because interviewers will ask about migration strategy.
Phase 1: Rust engine in shadow mode
The Rust multiplayer engine ran alongside the Node.js engine on the same document traffic. Both received identical operations. The Rust engine processed them independently but did not serve results to clients. Engineers compared the output of both engines to verify correctness.
Shadow mode caught edge cases in the Rust implementation's conflict resolution logic. Collaborative vector graphics has hundreds of edge cases around parent-child moves, z-order conflicts, and constraint propagation. Shadow mode validated each case against production Node.js behavior. When the outputs diverged, engineers debugged the Rust implementation and fixed it before any user was affected.
This phase ran long enough to process millions of operations across thousands of documents. The cost was running double the compute for the multiplayer tier, but the confidence was worth it. The team built automated comparison tooling that flagged divergences between the two engines in real-time, allowing engineers to investigate and fix issues within hours of them occurring.
The types of bugs shadow mode caught were instructive:
- Floating-point differences. Bounding box calculations that produced slightly different results due to operation ordering and floating-point non-associativity.
- Edge cases in deletion sequencing. A delete arriving between two related property updates (e.g., changing x and y of the same node) where one property change was applied and the other was dropped.
- Constraint propagation ordering. Two constraint-dependent nodes updated in different order, producing different final positions due to propagation sequence sensitivity.
Phase 2: Rust engine serves reads, Node.js handles writes
Clients began receiving document state from the Rust engine while Node.js still acted as the authoritative sequencer. This validated that the Rust engine's in-memory document state matched the Node.js engine's state exactly after applying the same operation sequence.
This phase is particularly clever because it tests the most dangerous class of bugs: state divergence. If the Rust engine's document tree ever differs from the Node.js engine's after applying the same operations, you catch it here before the Rust engine has any authority. State divergence bugs are the hardest to diagnose in production because they manifest as "my screen shows something different from my colleague's screen," which users report as a vague "sync issue."
Phase 3: Rust engine takes full ownership
The Rust engine became the authoritative sequencer and the Node.js engine was decommissioned for each migrated document. Rollback was possible at each phase boundary because both engines maintained independent state from the same operation log. Documents were migrated gradually, not all at once, starting with smaller, less active files and progressing to larger, more active ones.
Interview tip: migration strategy
When discussing rewrites in interviews, always describe the shadow-mode pattern: run old and new systems in parallel, compare outputs, then gradually shift traffic. Interviewers want to hear you won't do a big-bang cutover. The phrase "shadow mode" is a signal of production experience.
Phase 4: Optimization without constraints
With the Rust engine in production, the team optimized memory layout, added arena allocation for document nodes, and tuned the operation application hot paths. They could restructure the in-memory document representation without worrying about V8's object model or GC behavior. These optimizations were impossible in Node.js because they required fine-grained control over memory layout and allocation patterns.
Specific optimizations in this phase included:
- Struct-of-arrays layout for frequently accessed properties (x, y, width, height), enabling SIMD-friendly access patterns during bounding box calculations.
- Pre-allocated operation buffers to avoid per-operation heap allocation in the WebSocket receive path.
- Inline small values (enums, booleans, small strings) directly in the node struct rather than behind pointers, reducing cache misses during tree traversal.
The System After
The Rust-based architecture preserves Figma's core design principle: one document, one process, one sequencer. The change was the execution environment, not the architectural topology.
The topology is identical to the Node.js version. The difference is inside each process: arena-allocated document trees, zero GC pauses, and 3-5x better CPU efficiency per operation.
How arena allocation works for document trees
Arena allocation is central to the Rust engine's performance, and worth understanding because it comes up in performance-focused interviews.
In a traditional allocator (or a GC-managed heap), each document node is allocated independently. Creating a node is a heap allocation; deleting it frees memory. The GC must track every allocation individually.
In an arena allocator, all nodes for a single document are allocated from a contiguous block of memory. Creating a node bumps a pointer within the arena. Deleting a node marks it as dead but does not free memory immediately. When the editing session ends and all users disconnect, the entire arena is freed in one operation.
The benefits are significant:
- Cache locality. Nodes allocated sequentially from an arena tend to be contiguous in memory. Tree traversals hit L1/L2 cache much more frequently than traversals over scattered heap allocations.
- Zero per-node allocation overhead. No object headers, no GC metadata, no reference counting. Each node is exactly the size of its data.
- Instant deallocation. When a document session ends, freeing the arena is O(1). No need to trace or finalize individual nodes.
- No fragmentation. Long-running sessions (hours of editing) do not fragment the heap because allocations are sequential within the arena. In a traditional allocator, hours of create/delete cycles would fragment memory. In an arena, dead nodes leave gaps that do not affect the allocator's performance.
For a document tree that is created, mutated heavily for hours, and then fully discarded, arena allocation is the ideal memory strategy. This is why the Rust rewrite unlocked performance that was architecturally impossible in V8. The memory model matches the workload perfectly: allocate for the session, mutate in place, free everything at once.
Document isolation as the scaling unit
Each document runs in its own process. All WebSocket connections for a document route to the same server. This gives you:
- Scaling unit = active documents. Adding servers increases the number of documents you can serve simultaneously. It does not increase the capacity of any single document.
- No distributed state. Operation sequencing is local to one process. No distributed consensus, no cross-server coordination for the common case.
- Document migration happens by serializing the full document state and handing it to a new process. This is infrequent (server scaling events, not per-request).
The "hot document" problem is the primary scaling challenge. A viral community file with 5,000+ simultaneous editors loads one server heavily. Figma addresses this with optimized per-process throughput (the Rust rewrite helps here) and by ensuring the operation application path is as efficient as possible.
Server 3 in this diagram is the "hot document" problem. One viral file with 5,000 editors consumes an entire server's capacity. The only mitigation is making each server as efficient as possible, which is exactly what the Rust rewrite achieved.
How document migration works
When Figma needs to move a document from one server to another (during scaling events, server maintenance, or rebalancing), the process is:
- Pause incoming operations. The current server stops accepting new operations from clients. Existing in-flight operations finish processing.
- Serialize document state. The full document tree (all nodes, properties, current operation sequence number) is serialized into a binary format.
- Transfer to new server. The serialized state is sent to the target server process, which deserializes it into a new arena.
- Redirect connections. The load balancer updates routing for that document ID. All WebSocket clients reconnect to the new server.
- Resume operations. The new server begins accepting operations. Clients that buffered operations during the pause replay them.
The pause duration depends on document size. Small documents (a few hundred nodes) migrate in under 100ms. Very large community files (100,000+ nodes) can take several seconds. This is the primary motivation for the incremental migration protocol on Figma's roadmap.
How the Operation Model Works
Understanding Figma's operation model is essential because it is the core of their multiplayer system. I find this is the part candidates struggle with most when discussing collaborative editing in interviews. Most people jump to "just use CRDTs" without understanding the underlying operation semantics.
Why operations, not state snapshots
The first question to answer: why does Figma send operations instead of state snapshots?
If you send full document state (or even a diff/patch) after every change, concurrent edits overwrite each other. Designer A changes fill = blue and sends the full document. Designer B changes x = 200 and sends the full document. Whichever arrives second overwrites the other's change entirely.
Operations solve this by expressing intent: "change property P of node N to value V." Two concurrent operations on different properties of the same node can both apply without conflict. This is only possible because the system works at the operation level, not the state level.
The operation types
Every mutation to a Figma document is expressed as one of four operation types:
// Simplified Figma operation model
type Operation =
| { type: 'SET_PROPERTY'; nodeId: string; property: string; value: any }
| { type: 'CREATE_NODE'; nodeId: string; parentId: string; index: number }
| { type: 'DELETE_NODE'; nodeId: string }
| { type: 'MOVE_NODE'; nodeId: string; newParentId: string; index: number };
A user dragging a rectangle to a new position generates SET_PROPERTY(nodeId, 'x', 350) and SET_PROPERTY(nodeId, 'y', 200). Changing a fill color generates SET_PROPERTY(nodeId, 'fill', '#FF0000'). Creating a new frame generates CREATE_NODE with a client-generated unique ID. Nesting an element inside a group generates MOVE_NODE with the new parent reference.
This operation model is deliberately minimal. Four operation types cover every possible mutation to the document tree. The simplicity is critical: fewer operation types means fewer conflict resolution rules, which means fewer edge cases and easier reasoning about correctness.
Why not just send diffs?
An alternative design sends the full modified state (or a JSON diff) instead of structured operations. This approach fails for concurrent editing because two simultaneous diffs to different properties would overwrite each other. Operations are composable and order-independent (when they commute), making them the right abstraction for collaborative systems.
Commutativity rules
The commutativity rules are the heart of Figma's conflict resolution system. Understanding them is essential for both using this case study in interviews and designing your own collaborative systems.
Rule 1: Different properties on the same node commute. If Designer A sets x = 100 and Designer B sets fill = blue on the same rectangle, the order does not matter. Both operations apply cleanly regardless of sequence. This is the most common case in practice because designers rarely edit the exact same property at the exact same instant.
Rule 2: Same property on the same node conflicts. If Designer A sets x = 100 and Designer B sets x = 200, the resolution is last-write-wins (LWW) based on the server-assigned sequence number. Both clients converge to the same value. One designer "loses" their change, but in practice this is rare and the lost change is usually a drag operation that the designer simply repeats.
Rule 3: Different nodes always commute. Two designers working on different parts of the canvas never conflict. Their operations are independent and apply in any order with the same result. This is the vast majority of collaborative editing: different people working on different elements.
Rule 4: Structural operations have explicit precedence. CREATE, DELETE, and MOVE operations follow specific precedence rules. DELETE takes precedence over concurrent property edits (you cannot edit a deleted node). MOVE operations on the same node use LWW for the parent assignment. CREATE operations never conflict because each node has a unique ID generated by the creating client.
These four rules cover every possible operation interaction. The beauty of this design is that most operations fall into Rules 1 and 3 (no conflict), making conflict resolution an edge case rather than the common path.
The tricky cases: spatial and structural conflicts
This is where collaborative vector graphics diverges from collaborative text editing.
Parent-child move conflicts. Designer A moves element E into Group G1. Designer B moves the same element E into Group G2. Both operations are MOVE_NODE with the same nodeId but different newParentId. Resolution: LWW. The later operation wins. The element ends up in one group, not both.
Z-order conflicts. Two designers reorder elements in the same layer stack. Both insert at a specific index. Resolution: the server sequences the inserts and adjusts indices. The second insert shifts to accommodate the first.
Bounding box consistency. A group's bounding box is computed from its children's positions. If two designers simultaneously move different children of the same group, both clients must compute the same final bounding box. Since each SET_PROPERTY on a child's position commutes (different nodes), the bounding box recomputation is a derived calculation that converges once all operations apply.
Deletion during edit. Designer A deletes a node while Designer B is editing a property of that same node. The delete takes precedence: once a node is deleted, any subsequent property changes targeting it are silently dropped. The server's sequence order determines whether B's edit arrives before or after the delete. If B's edit arrives first, it applies, then the delete removes the node. If the delete arrives first, B's edit is a no-op targeting a non-existent node.
Concurrent creation at the same index. Two designers both create a new element inside the same parent at the same index position. Both CREATE_NODE operations have unique IDs (generated client-side with UUIDs), so there is no ID conflict. The index conflict is resolved by the server sequencing: the second creation shifts to index+1 to accommodate the first. Both elements exist; they just end up in a slightly different order than either designer intended.
The client-side reconciliation loop
The sequence diagram above shows the happy path. The full client-side reconciliation loop is more nuanced and worth understanding because it explains why the system feels instant despite server round-trips.
Step 1: Optimistic apply. When a user drags a rectangle, the client immediately applies SET_PROPERTY(rect1, x, newValue) to its local document state and renders the update. The user sees zero latency. The operation is simultaneously sent to the server over the WebSocket.
Step 2: Server sequencing. The server receives the operation, assigns it a global sequence number, applies it to the authoritative document state, and broadcasts the sequenced operation to all connected clients (including the sender).
Step 3: Acknowledgment. When the originating client receives its own operation back from the server (with a sequence number), it knows the server accepted it. If no conflicting operations arrived in between, the local state already matches the server state. No action needed.
Step 4: Reconciliation on conflict. If the server broadcasts a conflicting operation (from another client) that changes the same property, the originating client must reconcile. It rolls back its optimistic state, replays the server's authoritative sequence, and re-renders. The rollback is typically invisible to the user because it happens within a single frame.
Step 5: Divergence handling. In rare cases (network partition, extreme latency), the client's local state may diverge significantly from the server's. When the client reconnects, it receives the full authoritative operation sequence it missed, replays it from its last confirmed state, and discards any optimistic operations that conflict.
This loop is the reason Figma feels instant while maintaining consistency. The optimistic apply hides network latency; the reconciliation maintains correctness. The combination is what makes real-time collaboration feel real-time.
For your interview: the key insight is that optimistic local application gives instant feedback, server sequencing gives consistency, and rollback handles divergence. Say that in one sentence and you have nailed the core mechanism.
The three-word summary
Optimistic apply, server sequence, client reconcile. That is Figma's entire multiplayer loop in six words. If an interviewer asks how real-time collaboration works, start there and expand.
OT vs CRDT: Why Figma Chose a Central Sequencer
Figma describes their system as "CRDT-inspired" but it is, architecturally, an operational transformation system with a central sequencer. The distinction matters in interviews, and I find it is one of the most commonly confused points.
How OT with a central sequencer works (Figma's model)
A server assigns a global order to all operations. Clients send operations to the server, apply them locally for instant feedback, and reconcile when the server's order differs from their optimistic state. The server is the single source of truth for ordering.
The key property: the server sees every operation before any client sees a peer's operation. This means conflict resolution is centralized, deterministic, and simple. The server does not need to reason about concurrent branches or vector clocks. It processes operations one at a time in the order they arrive, applies its resolution rules, and broadcasts the result.
The simplicity is striking. There is no need for:
- Vector clocks (tracking causal relationships between operations across replicas).
- Merge functions (combining divergent states from multiple replicas).
- Tombstone garbage collection (cleaning up markers for deleted elements).
- Anti-entropy protocols (detecting and repairing inconsistencies between replicas).
All of these are required in CRDT systems. None are needed when a single server sequences everything.
The tradeoff: requires a live server connection. Offline editing is limited to buffering operations locally and replaying them when connectivity returns. For a design tool where you need to see collaborator cursors and live changes, this tradeoff is acceptable.
How CRDT works (the alternative)
Each client applies operations independently with no central authority. Operations carry metadata (logical timestamps, vector clocks, unique IDs) that allow any replica to merge them in any order and arrive at the same state.
CRDTs enable fully peer-to-peer editing. No server needed for consistency. Libraries like Yjs and Automerge implement text CRDTs that work well for documents and notes. These libraries handle the complexity of causal ordering, tombstones, and merge resolution internally.
The tradeoff: CRDT metadata overhead is significant for complex data types. Every node in a CRDT needs a unique ID, a logical timestamp, and tombstone tracking for deletes. For a vector graphics document with 50,000 nodes, each with dozens of properties, this metadata adds non-trivial memory and network overhead. Vector graphics CRDTs are an active research area with high implementation complexity. No production-quality library exists for the full set of operations Figma needs (spatial positioning, z-order, parent-child nesting, constraint propagation).
Why Figma picked OT
Figma chose OT because their product requires a live connection anyway (you need to see other cursors, selection highlights, and real-time presence indicators). The central sequencer is simpler to implement correctly for the complex conflict cases in vector graphics. CRDT would add metadata overhead and implementation complexity without enabling a use case Figma needs (offline-first editing).
Consider the implementation complexity difference concretely. Figma's OT system needs:
- A server that assigns sequence numbers and broadcasts operations.
- Client-side optimistic apply with rollback on conflict.
- Four conflict resolution rules (different properties commute, same property LWW, structural precedence, deletion priority).
A CRDT system for the same use case would need:
- Unique IDs per node and per operation (UUID generation, deduplication).
- Logical clocks or vector clocks per client.
- Merge functions for every data type (positions, z-order lists, tree structure).
- Tombstone tracking for all deleted nodes (with garbage collection to bound memory).
- Anti-entropy protocol for detecting missed operations after reconnection.
The OT approach has fewer moving parts and fewer places for bugs to hide.
The simplicity argument is underrated. Figma's conflict resolution for vector graphics already involves four operation types, multiple commutativity rules, cascading constraint propagation, and derived state recomputation. Adding CRDT merge semantics on top of that would roughly double the complexity of the conflict resolution layer for zero product benefit.
My recommendation for interviews: if the system requires a live server connection anyway, default to OT with a central sequencer. Only reach for CRDT when offline-first editing is a core product requirement. When in doubt, start with the simpler architecture and migrate to CRDT later if the product requirements change.
Don't conflate Figma's system with pure CRDT
Figma's engineering blog uses the phrase "inspired by CRDTs" because their operations are designed to commute when possible. But the system relies on a central server for ordering and conflict resolution. In an interview, be precise: Figma uses server-sequenced OT with commutative operations, not a true CRDT.
My recommendation for interviews: if the system requires a live server connection anyway, default to OT with a central sequencer. Only reach for CRDT when offline-first editing is a core product requirement.
The Results
The Rust rewrite delivered measurable improvements across every dimension Figma cared about.
| Metric | Node.js Engine | Rust Engine |
|---|---|---|
| Operation latency (p99) | 80-200ms (GC spikes) | <50ms (deterministic) |
| Throughput per server | Baseline | 3-5x higher |
| Memory per document | High (V8 object overhead) | 60-70% less (arena allocation) |
| Server count for same load | Baseline | ~40-60% fewer servers |
| Max editors per hot document | ~1,000-2,000 practical limit | ~5,000+ simultaneous editors |
| GC pause impact | 50-200ms stop-the-world | None (no GC) |
The p99 latency improvement was the most user-visible change. Designers on large files no longer experienced intermittent lag spikes during collaboration. In a tool where cursor movements, selection changes, and drag operations all happen at 60fps, even a 100ms pause is perceptible and frustrating.
The throughput improvement meant fewer servers for the same traffic, directly reducing infrastructure cost. Figma's multiplayer tier shrank by roughly half in server count while handling more concurrent documents.
The ~5,000 simultaneous editor capacity on community files was a product differentiator. Figma's community files (shared design templates) can go viral when a popular designer publishes a new UI kit. Before the Rust rewrite, these viral files required careful capacity management and sometimes degraded under load. After, they worked within the engine's normal operating envelope.
The memory reduction deserves special attention. Arena-allocated document nodes in Rust use a fraction of the memory that V8 objects require for the same logical structure. This means each server can hold more documents in memory simultaneously, which directly increases the number of active documents the fleet can support before needing to scale horizontally.
The operational impact was also significant. Before the rewrite, the on-call team monitored GC-related latency alerts. After, those alerts disappeared entirely. The p99 latency became stable and predictable, determined only by document size and operation complexity. This predictability made capacity planning straightforward: given a document of size N with M editors, the server's resource usage is deterministic, not probabilistic.
Interview tip: always quantify
When citing Figma's rewrite in an interview, lead with the concrete numbers: 3-5x throughput, p99 under 50ms, 5,000 simultaneous editors. Vague claims like "it was faster" do not demonstrate engineering judgment. Specific numbers show you understand what mattered and why.
What They'd Do Differently
Figma's engineering team has been relatively quiet about regrets, but a few themes emerge from conference talks and blog posts.
More automated correctness testing earlier. Shadow mode caught many edge cases, but some conflict resolution bugs only surfaced months after the Rust engine went live. Property-based testing (generating random operation sequences and verifying that all clients converge to the same state) would have caught more issues during development. The operation model has enough combinatorial complexity that manual test cases cannot cover all interaction patterns.
I've seen this pattern repeatedly in rewrite projects: the team is so focused on matching the old system's behavior that they under-invest in generative testing that finds behaviors neither system handled correctly. If I were advising the Figma team, I would have recommended a property-based test harness from day one that generates random sequences of all four operation types against random document trees and asserts convergence.
The Rust learning curve was real. Several engineers on the team had no prior Rust experience. The borrow checker's learning curve slowed initial development velocity significantly. Patterns that are trivial in JavaScript (passing a mutable reference to a callback) require careful structuring in Rust. In hindsight, investing in Rust training before the project started (rather than learning during it) would have smoothed the ramp and avoided some early design decisions that had to be refactored once the team internalized Rust idioms.
Document migration could be smoother. Moving a live document from one server to another requires serializing the full document state and briefly pausing editing. For very large documents (100,000+ nodes), this pause can last several seconds and is noticeable to users. A more incremental migration protocol, where document state streams to the new server while operations continue flowing, remains on their long-term roadmap. This is a hard problem because you need to transfer a consistent snapshot while the document is actively being mutated.
Hot document mitigation is still imperfect. When a community file goes viral and attracts 5,000+ simultaneous editors, the single server hosting that document is under heavy load. The Rust rewrite raised the ceiling significantly, but the fundamental architecture (one document, one process) means there is always a per-document capacity limit. Sharding a single document across multiple servers would require distributed operation sequencing, which defeats the simplicity benefit that makes the whole architecture work. Figma has not publicly described a solution to this constraint beyond "make each server faster."
This is a honest limitation. Every architecture has a ceiling, and the single-server-per-document model's ceiling is the throughput of one machine. For Figma, the Rust rewrite pushed that ceiling high enough that it is rarely hit, but it has not been eliminated.
Architecture Decision Guide
Use this flowchart when deciding whether your system needs a similar architecture: a central sequencer with a potential rewrite to a systems language. The first flowchart covers the runtime decision; the second covers the collaboration strategy.
Should you rewrite in a systems language?
The following flowchart walks through the decision criteria Figma applied. Start from the top and follow the branches.
The key filter: a runtime rewrite only makes sense when the bottleneck is the runtime itself (GC pauses, memory overhead), not your algorithms. If your algorithms are inefficient, rewriting in Rust just makes them inefficient faster.
Which collaboration strategy should you use?
The OT vs CRDT decision flowchart from the earlier section also serves as an architecture decision guide. To summarize the decision criteria as a table:
| Factor | Favors OT (central sequencer) | Favors CRDT |
|---|---|---|
| Network requirement | Users always online | Offline-first is core feature |
| Data complexity | Rich structures (graphics, CAD) | Text, simple key-value |
| Implementation complexity | Lower (centralized logic) | Higher (distributed merge) |
| Metadata overhead | Minimal | Significant per-node |
| Scaling model | Vertical (per-document) | Horizontal (any replica) |
| Consistency guarantee | Sequential (server-ordered) | Eventual (convergence) |
If you are building a product where users are always connected and data types are complex, OT with a central sequencer is the pragmatic choice. This is Figma's position, and it is the right default for most collaborative design and editing tools.
Transferable Lessons
1. The scaling unit determines your architecture.
Figma's scaling unit is documents, not connections. Every architectural decision follows from this: one process per document, sticky routing by document ID, horizontal scaling by adding document capacity. When designing any system, identify the scaling unit first. Everything else follows from that choice. For Twitter, the scaling unit is the celebrity account (fan-out). For Uber, the scaling unit is the geographic region. For Figma, it is the individual document.
2. A central sequencer is simpler than distributed consensus for real-time collaboration.
Figma avoids the complexity of CRDTs and distributed conflict resolution by routing all operations for a document through one server. This eliminates an entire category of distributed systems problems: no vector clocks, no merge functions, no tombstone garbage collection, no causal ordering across replicas. If your collaboration system requires a live connection anyway, a central sequencer is the right default. The simplicity pays dividends in debuggability, correctness, and development velocity.
3. Rewrite the bottleneck, not the system.
Figma did not rewrite their entire backend in Rust. They rewrote the multiplayer engine, the one component where GC pauses and CPU efficiency were the actual bottleneck. The rest of their stack (API servers, persistence, frontend, authentication) stayed in their original languages. Surgical rewrites beat big-bang rewrites every time. The multiplayer engine is maybe 10% of Figma's total codebase, but it accounted for 90% of the runtime performance problems.
4. Deterministic conflict resolution beats flexible conflict resolution.
Figma's conflict rules are simple: independent properties commute, same-property conflicts use LWW, structural conflicts have explicit resolution rules. These rules are not always "fair" (the last writer wins, even if the first writer's intent was better), but they are predictable. Every client converges to the same state regardless of network timing or operation ordering.
I've seen teams try to build "smart" conflict resolution that considers user intent, analyzes operation semantics, or prompts users to resolve conflicts manually. It always collapses under edge cases. The combinatorial explosion of possible conflict scenarios in a rich data model like vector graphics makes any "smart" resolution brittle. Predictability beats sophistication.
5. Shadow-mode validation is non-negotiable for rewrites.
Running old and new systems in parallel, comparing outputs, and only cutting over after extensive validation is the standard pattern for zero-downtime rewrites. Figma ran shadow mode long enough to process millions of operations across thousands of documents, catching hundreds of edge cases in their conflict resolution logic. I've seen teams skip shadow mode to save time and regret it within the first week of production. The confidence it provides is worth every dollar of the duplicated compute cost.
How This Shows Up in Interviews
Figma's architecture is directly relevant when you are asked to design any collaborative editing system, real-time document editor, or multiplayer feature. It also comes up in discussions about language/runtime tradeoffs and scaling strategies.
When to cite it: Any question involving "design a collaborative X" (whiteboard, document editor, design tool, spreadsheet). Also relevant for "when would you rewrite in a systems language?" questions. And it is a strong example for "how would you migrate a production system without downtime?"
The one sentence to say: "Figma routes all operations for a document through a single Rust server process that acts as the sequencer, using LWW for property conflicts and deterministic rules for structural conflicts, avoiding the complexity of distributed CRDTs."
Depth expected at senior/staff level:
- Explain why vector graphics collaboration is harder than text collaboration (multiple conflict dimensions)
- Describe the operation model and commutativity rules, including structural conflicts
- Articulate the OT vs CRDT tradeoff and when each is appropriate
- Explain why document isolation is the scaling unit and what the hot-document limit is
- Describe the shadow-mode migration pattern and why big-bang rewrites are risky
- Discuss arena allocation and why it matters for GC-free document management
| Interviewer asks | Strong answer citing Figma |
|---|---|
| "How would you handle conflicts in a collaborative editor?" | "I'd use a central sequencer model like Figma: operations on different properties commute automatically, same-property conflicts resolve with LWW based on server-assigned sequence numbers. Simpler than full CRDT and sufficient when users have a live connection." |
| "When would you rewrite a service in Rust or C++?" | "When the bottleneck is the runtime, not the algorithm. Figma rewrote their multiplayer engine because V8 GC pauses caused user-visible latency spikes on large in-memory document trees. The rest of their stack stayed in its original language." |
| "How do you scale a real-time collaboration system?" | "Figma's scaling unit is documents, not connections. Each document runs in one process with sticky routing. Horizontal scaling adds document capacity. The hard problem is hot documents (viral files with 5,000+ editors), which you address with per-process throughput optimization, not by sharding the document." |
| "OT vs CRDT, which would you choose?" | "OT with a central sequencer if the product requires a live connection (like Figma). CRDT if offline-first editing is a core requirement (like a local-first notes app). OT is simpler to implement correctly for complex data types like vector graphics." |
| "How would you migrate from one runtime to another safely?" | "Shadow mode first: run both systems on the same traffic, compare outputs, fix divergences. Then shift reads to the new system, then writes. Figma ran shadow mode for months before the Rust engine took ownership. Each phase had a rollback path." |
Quick Recap
- Figma rewrote their multiplayer engine from Node.js to Rust to eliminate GC pauses and reduce memory overhead for large in-memory document trees.
- Every document runs in a single server process acting as the operation sequencer, avoiding distributed coordination for the common case.
- The operation model uses four types (SET_PROPERTY, CREATE_NODE, DELETE_NODE, MOVE_NODE) with deterministic conflict resolution: independent properties commute, same-property conflicts use LWW.
- Collaborative vector graphics is harder than collaborative text because of spatial positioning, z-order, parent-child nesting, and bounding box consistency conflicts.
- Figma uses OT with a central sequencer, not true CRDT, because the product requires a live connection and the central sequencer is simpler for complex conflict cases.
- The migration used a four-phase shadow-mode approach: shadow processing, read shifting, write cutover, then optimization, with rollback possible at each phase.
- The horizontal scaling unit is active documents, with "hot documents" (viral community files, 5,000+ editors) as the primary scaling challenge.
- Transferable principle: identify your scaling unit first, rewrite only the bottleneck component, and prefer deterministic conflict resolution over flexible resolution in collaborative systems.
- The key interview sentence: "Optimistic apply, server sequence, client reconcile" captures the entire multiplayer loop.
Related Concepts
- Consistency models: Figma's server-sequenced OT enforces sequential consistency within a document. Understanding consistency guarantees helps you reason about why the central sequencer works and what it gives up (no offline writes without buffering).
- Design collaborative docs: The system design question that directly applies Figma's architecture. Use this case study as your primary reference for conflict resolution strategy, operation model, and scaling approach.
- WebSockets and real-time communication: Figma's multiplayer engine relies on persistent WebSocket connections for bi-directional operation streaming. Understanding WebSocket lifecycle, reconnection, and sticky session routing is essential context.
- Scalability: The document-as-scaling-unit model is a specific instance of vertical partitioning. Understanding when to scale vertically (bigger servers per unit) vs. horizontally (more units) clarifies why Figma optimized per-process throughput rather than distributing single documents.
- Replication: Figma's operation log is conceptually a replicated log. The server is the leader, clients are followers. Understanding leader-based replication helps reason about the client reconciliation loop and why the server's ordering is authoritative.