Collaborative Docs
Design a real-time collaborative document editor like Google Docs or Notion, covering conflict-free concurrent edits, operational transforms vs CRDTs, persistent storage, and live presence at millions of concurrent editors.
What is a collaborative document editing system?
Google Docs lets multiple people edit the same paragraph at once while seeing each other's cursors move in real time. The engineering challenge is not the UI. It is that two users can type at the same cursor position within the same millisecond, and both keystrokes must survive without either one disappearing. This question tests real-time communication protocol design, distributed state merge algorithms (OT vs CRDTs), and the append-only operation log patterns that let a document survive server crashes mid-session.
Functional Requirements
Core Requirements
- Multiple users can edit the same document simultaneously with their changes visible to all collaborators.
- Changes from all editors appear in every connected client within 500ms.
- Concurrent edits never overwrite or lose each other's work.
- Documents are persisted durably and can be reopened after the session ends (or after a server crash).
Below the Line (out of scope)
- Rich media editing (images, tables, embedded spreadsheets) - focus on plain text editing.
- Version history and rollback - the operation log makes this possible, but building the UI and diffing logic is a separate concern.
- Spell check, grammar suggestions, and AI writing assistance.
The hardest part in scope: Merging concurrent edits without data loss. When two users type at the same position at revision 42, both operations claim to be "against revision 42." Applying them naively in arrival order destroys the second user's intent. We will dedicate a full deep dive to the merge algorithm that makes this work.
Rich media editing is below the line because it changes the data model fundamentally. Text is a linear sequence of characters. Images and tables are embedded objects with their own dimensions and layout properties. Adding them requires a tree-structured document model instead of a flat string, which is a multi-month engineering project separate from the concurrency problem. To build it, I would model the document as a tree of blocks (similar to Notion's block model) and apply CRDT semantics at the block level.
Version history and rollback is below the line but nearly free once you have a persistent operation log. Every operation is already recorded with a user ID and timestamp. The rollback feature is essentially a query over that log: "give me the document state at revision N." Implementation is a read path addition on top of the storage design we will build anyway.
Non-Functional Requirements
Core Requirements
- Consistency: Zero data loss. Every committed operation survives a server crash. Concurrent edits must both appear in the final document with no operation silently discarded.
- Latency: Server acknowledges each operation in under 100ms. The operation broadcasts to all collaborators within 500ms (p99).
- Scale: 10M DAU, up to 500K concurrently active documents. Support up to 100 simultaneous editors per document. Peak system-wide operation rate: approximately 1M ops/sec during business hours.
- Availability: 99.99% uptime for document serving. A user should never lose their work due to an infrastructure failure mid-session.
- Storage: Retain the full operation history for 90 days. Keep compressed snapshots indefinitely.
Below the Line
- Sub-50ms operation propagation (WebSocket pub/sub achieves 100-200ms; sub-50ms requires region-local servers and is outside scope)
- Real-time spell-check or grammar feedback
Write pattern: Collaborative documents are write-intensive in a way most systems are not. During an active session with 10 users typing at normal speed (40 WPM), a single document generates roughly 40 operations per second. At 500K concurrently active documents with an average of 3 active users each, peak write load hits approximately 60K ops/sec. The storage layer must handle this without becoming a bottleneck on the hot path.
The 100ms server-acknowledge target is deliberately generous. It lets us use a standard relational database for the operation log on the hot path rather than forcing a specialized write buffer. Above 100ms, users perceive their own keystrokes as laggy, which breaks the feeling of local responsiveness.
The 99.99% availability target drives the replication strategy. We need at least two replicas of the operation log, and the WebSocket serving layer must restart sessions transparently on instance failure.
Core Entities
- Document: The container. Carries a document ID, title, owner user ID, created timestamp, and a pointer to the latest materialized snapshot revision.
- Operation: One atomic edit event. Carries an op ID, document ID, authoring user ID, the client revision it was written against, the server-assigned revision after serialization, op type (insert or delete), position, and content.
- Snapshot: A materialized full-text copy of the document at a specific revision. Used on load to avoid replaying thousands of operations from scratch.
- Session: A live editing session. Carries session ID, document ID, user ID, WebSocket connection ID, current cursor position, and last-heartbeat timestamp.
- User: Account entity. Carries user ID, display name, and a color used for cursor rendering in the editor.
The full schema (indexes, foreign keys, partition strategy) is deferred to the data model deep dive. These five entities drive the API and High-Level Design.
API Design
Group endpoints by the functional requirement they satisfy.
FR 1 and FR 4 - Create, list, and open documents:
POST /documents
Body: { title }
Response: { doc_id, title, created_at }
A POST because this creates a resource. The response gives back the doc_id the client uses for all subsequent calls.
GET /documents/{doc_id}
Response: { doc_id, title, content, revision, owner_id }
The revision field in the response is critical. The client uses it to stamp every outgoing operation with the document revision it was written against. Without it, the server cannot detect concurrent edits.
GET /documents
Response: { documents: [...], next_cursor: "..." }
Cursor-based pagination over the user's document list. Users with hundreds of documents need paginated results.
FR 2 and FR 3 - Real-time editing and conflict-free merges:
The naive approach is a REST endpoint:
POST /documents/{doc_id}/operations
Body: { revision, op_type, position, content }
Response: { server_revision }
This fails the 500ms latency requirement immediately. Every collaborator would need to poll GET /documents/{doc_id}/operations?since=revision to pick up others' changes. With 100 editors polling every 100ms, that is 1,000 HTTP requests per second for a single document. Worse, polling introduces up to 100ms of additional latency per poll cycle.
The evolved shape uses a persistent WebSocket connection:
WebSocket: wss://collab.example.com/documents/{doc_id}
Client connects β Server sends: { type: "sync", content, revision }
Client sends:
{ type: "op", revision: 42, op_type: "insert", position: 15, content: "hello" }
Server sends to client (acknowledgment):
{ type: "op_ack", server_revision: 43 }
Server broadcasts to all other clients on this document:
{ type: "op_broadcast", user_id: "u-123", server_revision: 43,
op_type: "insert", position: 15, content: "hello" }
The client sends its local revision in every operation. The server assigns a monotonically increasing server_revision and broadcasts to all other connected clients. This is the entire real-time editing contract.
FR - Live presence (bonus, out of scope for NFRs but cheap to add):
Client sends over existing WebSocket:
{ type: "cursor", position: 47 }
Server broadcasts to other clients:
{ type: "cursor_update", user_id: "u-123", position: 47 }
Presence piggybacks on the same WebSocket connection at zero additional infrastructure cost. I'd add it from day one.
High-Level Design
1. Users can create and open documents
The document load path: fetch the latest snapshot, replay any operations applied after that snapshot, and return the reconstructed content.
Components:
- Client: Web browser running the editor UI.
- API Server: Handles document CRUD, serves document content on load.
- Document DB (PostgreSQL): Stores documents, operation log, and snapshots.
Request walkthrough:
- Client sends
GET /documents/{doc_id}. - API Server queries for the latest snapshot for this document (carries revision N and full text).
- API Server queries for all operations where
server_revision > N. - API Server reconstructs the current content by replaying ops onto the snapshot text.
- Returns
{ content, revision }to the client.
The API Server queries both the snapshots table and the operation log, reconstructs the content in memory, and returns it to the client. The numbered walkthrough above describes the full response path.
This handles single-user document viewing. It breaks immediately when two users open the same document and start typing.
2. Multiple users edit simultaneously
Phase 1 - Naive (HTTP polling):
Components:
- Same API Server and PostgreSQL.
- Client polls for new operations via
GET /documents/{doc_id}/operations?since={revision}.
Walkthrough:
- User A types a character. Client sends
POST /operations. - User B's client polls
GET /operations?since=42every 500ms to find it. - This meets the 500ms requirement only if the poll happens immediately after the operation is committed.
Break it: With 100 collaborators each polling every 100ms, a single active document generates 1,000 HTTP requests per second. At 500K active documents, that is 500M requests per second - two orders of magnitude above what any reasonable server fleet handles. Polling also delivers worst-case 200ms latency (just missed a poll cycle), not best-case.
Phase 2 - Evolved (WebSocket relay):
The key insight is that the server knows who is editing which document. Push the operation to every connected client instead of waiting for them to ask. A WebSocket connection keeps state server-side that HTTP cannot.
Components:
- Client: Opens a WebSocket connection to the Collab Server on document open. Closes it on document close.
- Collab Server: Manages WebSocket connections. Receives operations from any connected client, assigns a
server_revision, writes to the DB, and broadcasts to all other clients connected to the same document. - Redis Pub/Sub: When there are multiple Collab Server instances, they cannot share in-process subscriber lists. Redis acts as the message bus: each Collab Server subscribes to a channel per active document. When one instance receives and commits an operation, it publishes to Redis. All other instances subscribed to that document channel receive it and relay to their local WebSocket connections.
- PostgreSQL: Appends every committed operation to the operation log.
Request walkthrough:
- Both User A and User B open
doc_123. Each establishes a WebSocket to one of the Collab Server instances (may be different instances behind a load balancer). - User A types "h". Client A sends
{ type: "op", revision: 42, op_type: "insert", position: 0, content: "h" }. - Collab Server Instance 1 assigns
server_revision: 43, writes to the operation log in PostgreSQL. - Instance 1 publishes
{ doc_id: "doc_123", op, server_revision: 43 }to the Redisdoc:doc_123channel. - Instance 2 (where User B is connected) receives the Redis message and broadcasts it over User B's WebSocket.
- User B's editor applies the operation at
server_revision: 43. User A sees the server ack.
User A gets an immediate ACK from their local server instance. User B receives the broadcast within one Redis pub/sub hop, typically under 5ms. The WebSocket relay layer adds under 20ms of latency on top of the DB write, well inside the 500ms budget.
3. Concurrent edits never overwrite each other's work
This is where the architecture gets interesting. The WebSocket relay solves the delivery problem. It does not solve the consistency problem.
Consider this scenario. User A and User B both have the document at revision 42. The document contains the text "cat". User A inserts "s" at position 3 to make "cats". At the same instant, User B inserts "!" at position 3 to make "cat!". Both operations are written against revision 42.
If the server applies A's op first (revision 43: insert "s" at 3), the document becomes "cats". Now B's op arrives: insert "!" at position 3. Applied literally, this produces "cat!s" instead of the correct "cats!". User B's cursor was pointing after the "t", not after the "s". The position has shifted, and naive application misplaces the insert.
I'll treat the actual transform algorithm as a black box for now and cover OT vs CRDTs in detail in the deep dives. At the High-Level Design level, the key architectural decision is: the server is the serialization point.
Components:
- Collab Server: All existing components. Additionally, when an operation arrives with
client_revision < server_revision, the server transforms the operation against all ops applied betweenclient_revisionandserver_revisionbefore committing it. - PostgreSQL: The operation log serves double duty here. It is both durable storage and the transform history the server queries when it needs to reconcile concurrent ops.
Request walkthrough (concurrent ops):
- Both A and B are at revision 42. Both send ops claiming
revision: 42. - A's op arrives first. Server assigns
server_revision: 43, commits, broadcasts. - B's op arrives:
{ revision: 42, op_type: "insert", position: 3, content: "!" }. - Server sees
client_revision: 42but current doc is at43. It fetches all ops between rev 42 and 43 (just A's op). - Server transforms B's op against A's op: A inserted at position 3, so B's position 3 shifts to position 4. Transformed op:
{ op_type: "insert", position: 4, content: "!" }. - Server applies the transformed op at
server_revision: 44. Document is now"cats!". - Server broadcasts the transformed op to all clients. Every client applies the same sequence: revision 43 = A's insert, revision 44 = B's shifted insert. All clients converge.
The server transforms and commits at server_revision: 44. The final document "cats!" is what both users intended.
The transform function is the heart of Operational Transformation. For simple insert/delete on plain text, only two cases exist: insert vs insert (shift positions) and insert vs delete (shift or cancel positions). The full algorithm becomes significantly more complex for rich text with formatting, nested structure, or undo operations. That complexity is why production systems either use a proven OT library or switch to CRDTs entirely.
4. Documents persist durably and survive server crashes
Components:
- Collab Server: On every committed operation, the server atomically writes the op to the operation log and publishes to Redis in the same transaction scope. If the write fails, the operation is rejected and the client retries.
- Snapshot Worker: A background process that periodically materializes a snapshot. The trigger is operation count, not time: every 1,000 operations, take a snapshot of the current document state and write it to the snapshots table.
- PostgreSQL: Two tables matter here. The
operationstable is an append-only log (doc_id, server_revision, op data). Thesnapshotstable stores full text content at periodic revisions.
Request walkthrough (crash recovery):
- During an active session, a Collab Server instance crashes after committing operation rev 5,043 but before broadcasting it.
- The client's WebSocket disconnects. Client reconnects to a different instance.
- Client sends
GET /documents/{doc_id}with its last known revision5,040. - New server fetches the latest snapshot (say, at revision 5,000) and replays operations 5,001 through 5,043 from the log.
- Returns the current content at revision 5,043 to the client. Session resumes.
The Snapshot Worker must checkpoint the snapshot revision atomically with the snapshot content. Writing the text without updating the snapshot revision pointer means a future load will replay ops the snapshot already includes, producing duplicated characters. Use a single transaction: INSERT snapshot + UPDATE document.snapshot_revision.
The operation log is never deleted within the 90-day retention window. This means the server can reconstruct the document at any historical revision, which is the foundation of version history if we ever build it.
Potential Deep Dives
1. How do we handle concurrent edit conflicts without losing data?
There are three hard constraints:
- Both concurrent operations must appear in the final document (no data loss).
- The final document must be the same on every client (convergence).
- The algorithm must be fast enough to run on every keystroke without buffering.
2. How do we persist document state durably without bottlenecking writes?
Two constraints pull in opposite directions. Writes must be fast (every keystroke generates one operation). Reads must be fast (loading a 3-year-old document should not replay tens of thousands of operations sequentially).
3. How do we show live cursors and presence of other editors?
Presence is the feature that makes collaborative editing feel alive: seeing the other person's cursor move tells you where attention is focused and prevents two people from editing the same sentence simultaneously.
Final Architecture
The central design insight is the clean separation of concerns across four layers. The Collab Servers are the serialization layer: they own the server revision counter and the transform/CRDT merge logic, but they hold no durable state. Redis handles the fan-out and ephemeral presence in memory. PostgreSQL is the durable ledger: the operation log is the source of truth, and snapshots are a load-time optimization built on top of it. A Collab Server can crash and restart with zero data loss because every committed operation is already in PostgreSQL.
Interview Cheat Sheet
- Start by establishing scale: 10M DAU, up to 100 simultaneous editors per document, 500ms max propagation latency.
- A collaborative editor is fundamentally a distributed state convergence problem. Two users typing at the same position in the same millisecond is your hardest constraint.
- HTTP polling fails immediately: 100 editors polling every 100ms means 1,000 requests per second per document. Use WebSocket connections that push operations to all connected clients.
- The server must be the serialization point: it assigns a global
server_revisionto every committed operation, in strict order. This global order is what all clients converge to. - For plain text in a server-controlled product, OT works. For rich text, offline editing, or P2P architectures, use a CRDT library (Yjs/YATA is the production default).
- CRDT key insight: assign a unique ID to every character insert. Operations reference parent character IDs, not numeric positions. Positions never shift, so concurrent inserts are always safe to merge by ID tiebreak.
- OT key insight: transform the incoming operation's position against all committed ops since the client's revision. insert-vs-insert shifts positions; delete-vs-insert cancels or shifts.
- Never delete characters from a CRDT - mark them as tombstoned. Tombstoned characters hold the positional references that keep the list structure valid. Run a GC pass periodically to collect them.
- Persistence is an append-only operation log. Never update the document row in place.
- Snapshots are a load-time optimization, not the source of truth. Trigger them by operation count (every 1,000 ops), not by a timer. This bounds load-time replay to at most 999 operations regardless of document age.
- Presence piggybacks on the existing WebSocket at zero additional infrastructure cost. Use a Redis sorted set scored by heartbeat timestamp. Entries older than 10 seconds are considered offline.
- On Collab Server crash, clients reconnect, send their last known revision, and receive the tail of operations since that revision. No data is lost because all ops are in PostgreSQL.
- Multi-instance fan-out uses Redis Pub/Sub with one channel per active document. Collab Server instances subscribe on document open and unsubscribe on last client disconnect.
- For the interview: pick one merge strategy and commit to it. "CRDT with Yjs, server still provides ordering and durability, clients merge independently" is a complete and defensible architecture.