Collaborative Docs

What is a collaborative document editing system?

Google Docs lets multiple people edit the same paragraph at once while seeing each other's cursors move in real time. The engineering challenge is not the UI. It is that two users can type at the same cursor position within the same millisecond, and both keystrokes must survive without either one disappearing. This question tests real-time communication protocol design, distributed state merge algorithms (OT vs CRDTs), and the append-only operation log patterns that let a document survive server crashes mid-session.

Functional Requirements

Core Requirements

Multiple users can edit the same document simultaneously with their changes visible to all collaborators.
Changes from all editors appear in every connected client within 500ms.
Concurrent edits never overwrite or lose each other's work.
Documents are persisted durably and can be reopened after the session ends (or after a server crash).

Below the Line (out of scope)

Rich media editing (images, tables, embedded spreadsheets) - focus on plain text editing.
Version history and rollback - the operation log makes this possible, but building the UI and diffing logic is a separate concern.
Spell check, grammar suggestions, and AI writing assistance.

The hardest part in scope: Merging concurrent edits without data loss. When two users type at the same position at revision 42, both operations claim to be "against revision 42." Applying them naively in arrival order destroys the second user's intent. We will dedicate a full deep dive to the merge algorithm that makes this work.

Rich media editing is below the line because it changes the data model fundamentally. Text is a linear sequence of characters. Images and tables are embedded objects with their own dimensions and layout properties. Adding them requires a tree-structured document model instead of a flat string, which is a multi-month engineering project separate from the concurrency problem. To build it, I would model the document as a tree of blocks (similar to Notion's block model) and apply CRDT semantics at the block level.

Version history and rollback is below the line but nearly free once you have a persistent operation log. Every operation is already recorded with a user ID and timestamp. The rollback feature is essentially a query over that log: "give me the document state at revision N." Implementation is a read path addition on top of the storage design we will build anyway.

Non-Functional Requirements

Core Requirements

Consistency: Zero data loss. Every committed operation survives a server crash. Concurrent edits must both appear in the final document with no operation silently discarded.
Latency: Server acknowledges each operation in under 100ms. The operation broadcasts to all collaborators within 500ms (p99).
Scale: 10M DAU, up to 500K concurrently active documents. Support up to 100 simultaneous editors per document. Peak system-wide operation rate: approximately 1M ops/sec during business hours.
Availability: 99.99% uptime for document serving. A user should never lose their work due to an infrastructure failure mid-session.
Storage: Retain the full operation history for 90 days. Keep compressed snapshots indefinitely.

Below the Line

Sub-50ms operation propagation (WebSocket pub/sub achieves 100-200ms; sub-50ms requires region-local servers and is outside scope)
Real-time spell-check or grammar feedback

Write pattern: Collaborative documents are write-intensive in a way most systems are not. During an active session with 10 users typing at normal speed (40 WPM), a single document generates roughly 40 operations per second. At 500K concurrently active documents with an average of 3 active users each, peak write load hits approximately 60K ops/sec. The storage layer must handle this without becoming a bottleneck on the hot path.

The 100ms server-acknowledge target is deliberately generous. It lets us use a standard relational database for the operation log on the hot path rather than forcing a specialized write buffer. Above 100ms, users perceive their own keystrokes as laggy, which breaks the feeling of local responsiveness.

The 99.99% availability target drives the replication strategy. We need at least two replicas of the operation log, and the WebSocket serving layer must restart sessions transparently on instance failure.

Core Entities

Document: The container. Carries a document ID, title, owner user ID, created timestamp, and a pointer to the latest materialized snapshot revision.
Operation: One atomic edit event. Carries an op ID, document ID, authoring user ID, the client revision it was written against, the server-assigned revision after serialization, op type (insert or delete), position, and content.
Snapshot: A materialized full-text copy of the document at a specific revision. Used on load to avoid replaying thousands of operations from scratch.
Session: A live editing session. Carries session ID, document ID, user ID, WebSocket connection ID, current cursor position, and last-heartbeat timestamp.
User: Account entity. Carries user ID, display name, and a color used for cursor rendering in the editor.

The full schema (indexes, foreign keys, partition strategy) is deferred to the data model deep dive. These five entities drive the API and High-Level Design.

API Design

Group endpoints by the functional requirement they satisfy.

FR 1 and FR 4 - Create, list, and open documents:

POST /documents
Body: { title }
Response: { doc_id, title, created_at }

A POST because this creates a resource. The response gives back the doc_id the client uses for all subsequent calls.

GET /documents/{doc_id}
Response: { doc_id, title, content, revision, owner_id }

The revision field in the response is critical. The client uses it to stamp every outgoing operation with the document revision it was written against. Without it, the server cannot detect concurrent edits.

GET /documents
Response: { documents: [...], next_cursor: "..." }

Cursor-based pagination over the user's document list. Users with hundreds of documents need paginated results.

FR 2 and FR 3 - Real-time editing and conflict-free merges:

The naive approach is a REST endpoint:

POST /documents/{doc_id}/operations
Body: { revision, op_type, position, content }
Response: { server_revision }

This fails the 500ms latency requirement immediately. Every collaborator would need to poll GET /documents/{doc_id}/operations?since=revision to pick up others' changes. With 100 editors polling every 100ms, that is 1,000 HTTP requests per second for a single document. Worse, polling introduces up to 100ms of additional latency per poll cycle.

The evolved shape uses a persistent WebSocket connection:

WebSocket: wss://collab.example.com/documents/{doc_id}

Client connects → Server sends: { type: "sync", content, revision }

Client sends:
{ type: "op", revision: 42, op_type: "insert", position: 15, content: "hello" }

Server sends to client (acknowledgment):
{ type: "op_ack", server_revision: 43 }

Server broadcasts to all other clients on this document:
{ type: "op_broadcast", user_id: "u-123", server_revision: 43,
  op_type: "insert", position: 15, content: "hello" }

The client sends its local revision in every operation. The server assigns a monotonically increasing server_revision and broadcasts to all other connected clients. This is the entire real-time editing contract.

FR - Live presence (bonus, out of scope for NFRs but cheap to add):

Client sends over existing WebSocket:
{ type: "cursor", position: 47 }

Server broadcasts to other clients:
{ type: "cursor_update", user_id: "u-123", position: 47 }

Presence piggybacks on the same WebSocket connection at zero additional infrastructure cost. I'd add it from day one.

High-Level Design

1. Users can create and open documents

The document load path: fetch the latest snapshot, replay any operations applied after that snapshot, and return the reconstructed content.

Components:

Client: Web browser running the editor UI.
API Server: Handles document CRUD, serves document content on load.
Document DB (PostgreSQL): Stores documents, operation log, and snapshots.

Request walkthrough:

Client sends GET /documents/{doc_id}.
API Server queries for the latest snapshot for this document (carries revision N and full text).
API Server queries for all operations where server_revision > N.
API Server reconstructs the current content by replaying ops onto the snapshot text.
Returns { content, revision } to the client.

The API Server queries both the snapshots table and the operation log, reconstructs the content in memory, and returns it to the client. The numbered walkthrough above describes the full response path.

This handles single-user document viewing. It breaks immediately when two users open the same document and start typing.

2. Multiple users edit simultaneously

Phase 1 - Naive (HTTP polling):

Components:

Same API Server and PostgreSQL.
Client polls for new operations via GET /documents/{doc_id}/operations?since={revision}.

Walkthrough:

User A types a character. Client sends POST /operations.
User B's client polls GET /operations?since=42 every 500ms to find it.
This meets the 500ms requirement only if the poll happens immediately after the operation is committed.

Break it: With 100 collaborators each polling every 100ms, a single active document generates 1,000 HTTP requests per second. At 500K active documents, that is 500M requests per second - two orders of magnitude above what any reasonable server fleet handles. Polling also delivers worst-case 200ms latency (just missed a poll cycle), not best-case.

Phase 2 - Evolved (WebSocket relay):

The key insight is that the server knows who is editing which document. Push the operation to every connected client instead of waiting for them to ask. A WebSocket connection keeps state server-side that HTTP cannot.

Components:

What is a collaborative document editing system?

Functional Requirements

Core Requirements

Multiple users can edit the same document simultaneously with their changes visible to all collaborators.
Changes from all editors appear in every connected client within 500ms.
Concurrent edits never overwrite or lose each other's work.
Documents are persisted durably and can be reopened after the session ends (or after a server crash).

Below the Line (out of scope)

Rich media editing (images, tables, embedded spreadsheets) - focus on plain text editing.
Version history and rollback - the operation log makes this possible, but building the UI and diffing logic is a separate concern.
Spell check, grammar suggestions, and AI writing assistance.

The hardest part in scope: Merging concurrent edits without data loss. When two users type at the same position at revision 42, both operations claim to be "against revision 42." Applying them naively in arrival order destroys the second user's intent. We will dedicate a full deep dive to the merge algorithm that makes this work.

Non-Functional Requirements

Core Requirements

Consistency: Zero data loss. Every committed operation survives a server crash. Concurrent edits must both appear in the final document with no operation silently discarded.
Latency: Server acknowledges each operation in under 100ms. The operation broadcasts to all collaborators within 500ms (p99).
Scale: 10M DAU, up to 500K concurrently active documents. Support up to 100 simultaneous editors per document. Peak system-wide operation rate: approximately 1M ops/sec during business hours.
Availability: 99.99% uptime for document serving. A user should never lose their work due to an infrastructure failure mid-session.
Storage: Retain the full operation history for 90 days. Keep compressed snapshots indefinitely.

Below the Line

Sub-50ms operation propagation (WebSocket pub/sub achieves 100-200ms; sub-50ms requires region-local servers and is outside scope)
Real-time spell-check or grammar feedback

Write pattern: Collaborative documents are write-intensive in a way most systems are not. During an active session with 10 users typing at normal speed (40 WPM), a single document generates roughly 40 operations per second. At 500K concurrently active documents with an average of 3 active users each, peak write load hits approximately 60K ops/sec. The storage layer must handle this without becoming a bottleneck on the hot path.

Core Entities

Document: The container. Carries a document ID, title, owner user ID, created timestamp, and a pointer to the latest materialized snapshot revision.
Operation: One atomic edit event. Carries an op ID, document ID, authoring user ID, the client revision it was written against, the server-assigned revision after serialization, op type (insert or delete), position, and content.
Snapshot: A materialized full-text copy of the document at a specific revision. Used on load to avoid replaying thousands of operations from scratch.
Session: A live editing session. Carries session ID, document ID, user ID, WebSocket connection ID, current cursor position, and last-heartbeat timestamp.
User: Account entity. Carries user ID, display name, and a color used for cursor rendering in the editor.

The full schema (indexes, foreign keys, partition strategy) is deferred to the data model deep dive. These five entities drive the API and High-Level Design.

API Design

Group endpoints by the functional requirement they satisfy.

FR 1 and FR 4 - Create, list, and open documents:

POST /documents
Body: { title }
Response: { doc_id, title, created_at }

A POST because this creates a resource. The response gives back the doc_id the client uses for all subsequent calls.

GET /documents/{doc_id}
Response: { doc_id, title, content, revision, owner_id }

GET /documents
Response: { documents: [...], next_cursor: "..." }

Cursor-based pagination over the user's document list. Users with hundreds of documents need paginated results.

FR 2 and FR 3 - Real-time editing and conflict-free merges:

The naive approach is a REST endpoint:

POST /documents/{doc_id}/operations
Body: { revision, op_type, position, content }
Response: { server_revision }

The evolved shape uses a persistent WebSocket connection:

WebSocket: wss://collab.example.com/documents/{doc_id}

Client connects → Server sends: { type: "sync", content, revision }

Client sends:
{ type: "op", revision: 42, op_type: "insert", position: 15, content: "hello" }

Server sends to client (acknowledgment):
{ type: "op_ack", server_revision: 43 }

Server broadcasts to all other clients on this document:
{ type: "op_broadcast", user_id: "u-123", server_revision: 43,
  op_type: "insert", position: 15, content: "hello" }

FR - Live presence (bonus, out of scope for NFRs but cheap to add):

Client sends over existing WebSocket:
{ type: "cursor", position: 47 }

Server broadcasts to other clients:
{ type: "cursor_update", user_id: "u-123", position: 47 }

Presence piggybacks on the same WebSocket connection at zero additional infrastructure cost. I'd add it from day one.

High-Level Design

1. Users can create and open documents

The document load path: fetch the latest snapshot, replay any operations applied after that snapshot, and return the reconstructed content.

Components:

Client: Web browser running the editor UI.
API Server: Handles document CRUD, serves document content on load.
Document DB (PostgreSQL): Stores documents, operation log, and snapshots.

Request walkthrough:

Client sends GET /documents/{doc_id}.
API Server queries for the latest snapshot for this document (carries revision N and full text).
API Server queries for all operations where server_revision > N.
API Server reconstructs the current content by replaying ops onto the snapshot text.
Returns { content, revision } to the client.

This handles single-user document viewing. It breaks immediately when two users open the same document and start typing.

2. Multiple users edit simultaneously

Phase 1 - Naive (HTTP polling):

Components:

Same API Server and PostgreSQL.
Client polls for new operations via GET /documents/{doc_id}/operations?since={revision}.

Walkthrough:

User A types a character. Client sends POST /operations.
User B's client polls GET /operations?since=42 every 500ms to find it.
This meets the 500ms requirement only if the poll happens immediately after the operation is committed.

Phase 2 - Evolved (WebSocket relay):

Components:

Collaborative Docs

What is a collaborative document editing system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Users can create and open documents

2. Multiple users edit simultaneously

Continue Reading with Premium

Comments

Collaborative Docs

What is a collaborative document editing system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Users can create and open documents

2. Multiple users edit simultaneously

Continue Reading with Premium

Comments