Document Vault
Design a scalable document management system like Google Docs or Notion: versioned storage, collaborative editing, access control, full-text search, and real-time sync across clients.
What is a document management system?
A document management system lets users create, organize, and retrieve documents in a folder hierarchy, with version history and per-node access control. The interesting engineering challenge isn't the CRUD; it's storing 1,000 edits without consuming 50MB per document, enforcing permission inheritance across a deep folder tree without an O(depth) query on every read, and making 1 billion documents searchable under 500ms.
I'd frame this question around the version storage and ACL inheritance tradeoffs early, because they force completely different design decisions depending on which constraint you prioritize. This combination of storage efficiency, hierarchical access control, and full-text indexing is why the question shows up consistently in senior interviews.
Functional Requirements
Core Requirements
- Users can create, read, update, and delete documents organized in folders.
- The system maintains a version history for every document and supports restoring to any previous version.
- Users can share documents and folders with specific people, controlling read, write, and manage permissions.
- Users can search documents by title and full-text content.
Below the Line (out of scope)
- Real-time collaborative editing (concurrent multi-user editing using OT or CRDT)
- Large binary attachments embedded in documents (images, PDFs, videos)
- Comment threads on specific document sections
- Document templates and publishing or export to PDF
The hardest parts in scope: Version history storage and hierarchical access control. Storing 1,000 edits naively at 50KB per copy consumes 50MB per document. Permission inheritance across a deep folder tree requires careful design to avoid a slow tree walk on every read request.
Real-time collaborative editing is below the line because it requires Operational Transformation or CRDT-based merge logic plus a WebSocket broadcast layer. To add it, introduce a Collaboration Service that serializes concurrent operations and broadcasts change deltas to all connected clients over WebSocket, sitting alongside the Document Service rather than inside it. The server resolves conflicts by serializing concurrent ops against a shared document state, so two users editing the same paragraph see a merged result rather than a last-write-wins overwrite.
Large binary attachments are below the line because they introduce a separate upload flow (chunked multipart, virus scanning, CDN delivery) without changing document metadata or versioning semantics. To add them, store the binary in S3, embed an attachment_id reference link in the document body, and lazy-load the binary on the client. The document itself stays lightweight text; the attachment reference is just a URL pointer.
Comment threads are below the line because they introduce a separate content type with its own read patterns, including threading, reactions, and notifications. To add them, store comments as a separate entity anchored to a (document_id, anchor_offset) tuple, using the same ACL table to control visibility.
Document templates and export are below the line because they are rendering concerns that don't affect the storage or access control design. To add export, a background renderer consumes the latest document version and produces a PDF, storing it in S3 as a derived artifact.
Non-Functional Requirements
Core Requirements
- Scale: 100M registered users, 1B documents total.
- Storage: Average document 50KB; 50TB for text content. Version history adds 2-3x storage overhead with delta encoding, bringing the total to roughly 100-150TB.
- Write latency: Document save acknowledges in under 500ms p99.
- Read latency: Document load completes in under 300ms p99. Search results return in under 500ms p99.
- Availability: 99.9% uptime. Consistency over availability for document writes: a confirmed save must never be silently lost.
- Durability: Document content and version history must survive any single-region failure.
Below the Line
- Sub-100ms global read latency via CDN edge caching (achievable with aggressive caching but not a core NFR here)
- Exactly-once change event delivery to the search indexer (at-least-once with idempotent indexing is sufficient)
Read/write ratio: 10:1. Documents are read far more than written. This opens the door for a Redis cache in front of hot metadata reads and justifies async search index updates rather than synchronous writes on every save.
The 500ms write latency budget is generous enough to absorb a synchronous write to PostgreSQL plus an async delta computation. It is not generous enough to also synchronously update the search index, so search replication goes through an async pipeline with a few seconds of lag.
Storage math drives a key design decision: 1B documents at 50KB each is 50TB for content alone. With version history averaging 10 deltas per document at 1-2KB per delta, that adds another 10-20TB. Delta encoding is not optional at this scale.
I'd run this math on the whiteboard early. It anchors the interviewer's expectations and makes the case for delta encoding before you propose it.
Core Entities
- Document: The primary content item. Contains
document_id,title,owner_id,parent_folder_id,created_at, andupdated_at. The content body lives in object storage; the database row holds only metadata and a pointer to the latest version. - Folder: A container node in the hierarchy. Shares the same
nodestable as documents, distinguished by anode_typediscriminator column. Folders can nest arbitrarily deep. - Version: A recorded revision of a document. Contains
version_id,document_id,created_by,created_at,parent_version_id, and an S3 key pointing to the compressed delta (or a periodic full snapshot anchor). - AclEntry: A permission record linking a principal (user or group) to a node (document or folder). Contains
node_id,principal_id,permission(read, write, manage), and anis_overrideflag that stops permission propagation from overwriting this entry.
Full schema, index design, and delta format are covered in the deep dives. These four entities are sufficient to drive the API and high-level architecture.
API Design
One endpoint group per functional requirement, evolved where the naive shape breaks down.
FR 1 - CRUD on documents and folders:
# Create a document inside a folder
POST /nodes
Body: { type: "document", title, parent_folder_id, content }
Response: 201 Created · { node_id, version_id }
# Read a document (content fetched directly from S3 via presigned URL)
GET /nodes/{node_id}
Response: 200 · { title, parent_folder_id, current_version_id, content_url }
# Update a document
PATCH /nodes/{node_id}
Body: { title?, content? }
Response: 200 · { version_id }
# Delete a node
DELETE /nodes/{node_id}
Response: 204 No Content
PATCH over PUT for updates: the client sends only changed fields rather than the full document, keeping request payloads small. The content_url in the GET response is a presigned S3 URL so the client fetches the body directly from S3, avoiding large payload routing through the API tier.
I'd call out the presigned URL pattern here because interviewers often probe it. The alternative (streaming the document body through the API server) works at small scale but becomes a CPU and bandwidth bottleneck at 1B documents. Presigned URLs let S3 handle the heavy lifting of content delivery.
FR 2 - Version history and restore:
# List version history
GET /nodes/{node_id}/versions?cursor=<opaque>&limit=20
Response: 200 · { versions: [...], next_cursor }
# Restore to a specific version (creates a new version, never mutates history)
POST /nodes/{node_id}/versions/{version_id}/restore
Response: 201 Created · { new_version_id }
POST for restore because restoring appends a new version to history rather than mutating the document in place. The old version is never deleted; the timeline gains a restore marker pointing back to the target version.
FR 3 - Access control:
# Grant or update access for a principal
PUT /nodes/{node_id}/acl
Body: { principal_id, permission, is_override }
Response: 200 · { acl_entry_id }
# List current ACL for a node
GET /nodes/{node_id}/acl
Response: 200 · { entries: [...] }
FR 4 - Search:
# Full-text search across accessible documents
GET /search?q=quarterly+report&cursor=<opaque>&limit=20
Response: 200 · { results: [...], next_cursor }
Search returns only results the requesting user has at least read permission for. The Document Service resolves the user's accessible node_ids before forwarding the query to Elasticsearch, rather than relying on Elasticsearch alone for access control. Cursor-based pagination handles deep result sets without the expensive OFFSET scans that would kill a 1B-document index.
High-Level Design
1. CRUD and folder hierarchy
The simplest design that satisfies FR 1 separates metadata from content: metadata (title, parent, timestamps) lives in PostgreSQL while the document body lives in S3. The Document Service mediates all access and generates presigned S3 URLs for direct client downloads.
Components:
- Client: Web or mobile app sending REST requests to the Document Service.
- Document Service: Validates requests, enforces authorization, reads and writes metadata to PostgreSQL, writes content blobs to S3, and generates presigned URLs for client downloads.
- PostgreSQL: Stores the
nodestable (documents and folders), theversionstable, and theacl_entriestable added in FR 3. - S3 (Object Storage): Stores document content blobs at unique keys per version. Presigned URLs let clients upload and download content directly without routing payloads through the API tier.
Request walkthrough (create document):
- Client sends
POST /nodeswith title, parent_folder_id, and content body. - Document Service inserts a row into
nodesin PostgreSQL. - Document Service writes the content blob to S3 at key
docs/{node_id}/v1. - Document Service inserts a row into
versionspointing to the S3 key. - Document Service returns
{ node_id, version_id }to the client.
Storing content in S3 rather than as a database BLOB keeps PostgreSQL row sizes small and avoids streaming large payloads through the application tier. I treat S3 as a key-value store at this stage and defer CDN caching to the final architecture.
2. Version history
FR 2 adds version recording. The naive approach stores a full document snapshot on every save. This section shows why that fails and how delta encoding solves it.
Naive approach: full snapshots
On every PATCH, the Document Service writes the entire content body to a new S3 key and inserts a new versions row pointing to it.
This is correct but unsustainable. At 50KB per save and 20 saves per day per active document, a single document accumulates 1MB of version data daily. At 1B documents, total version storage grows linearly with no cross-version deduplication.
I often see candidates jump straight to "use diffs" without showing why full snapshots fail first. Walking through this math in the interview demonstrates that you reason from constraints, not from memorized answers.
Evolved approach: delta encoding
The key insight is that most edits change only a few lines out of potentially hundreds. Store the diff (delta) between the new content and its parent version rather than a full snapshot.
Updated components:
- Version Service: On every PATCH, fetches the current content from S3, computes a Myers diff against the new content, stores the compressed delta, and inserts a
versionsrow with the delta key and parent version pointer.
Request walkthrough (save with delta):
- Client sends
PATCH /nodes/{id}with updated content. - Document Service calls Version Service to record the new version.
- Version Service fetches the current document body from S3.
- Version Service computes the diff between old and new body, compresses the result with Zstandard.
- Version Service stores the delta at
docs/{id}/deltas/{version_id}and inserts theversionsrow.
When a user restores to an old version, the system calls save_version with the reconstructed content as the new body, creating a forward version that matches the target. No history is ever deleted; the timeline shows a restore marker.
Periodic snapshot anchors every 50 versions cap the reconstruction chain. Without anchors, restoring version 1,001 from scratch requires replaying 1,000 diffs. With an anchor at version 1,000, reconstruction replays at most 50 diffs, completing well within the 300ms read budget.
3. Access control
FR 3 introduces an acl_entries table. The Document Service checks this table before serving any request. I defer the inheritance strategy to the deep dive but establish the data model here.
Components (new or changed):
- AclEntry table: Rows contain
node_id,principal_id,permission, andis_override. Added to the PostgreSQL schema alongsidenodesandversions.
Request walkthrough (access check on read):
- Client sends
GET /nodes/{id}with an auth token. - Document Service resolves the user's effective permissions for this node.
- If the user has at least read permission, the service returns the metadata and a presigned S3 URL. Otherwise it returns 403.
Never grant access based only on folder membership without also checking whether a document-level override exists. A document can be privately shared with a user who has no access to the containing folder. Skipping this check leaks private documents.
4. Full-text search
FR 4 adds asynchronous search indexing. Updating Elasticsearch synchronously on every write would add latency beyond the 500ms budget and couple the write path to search cluster availability.
New components:
- Kafka (CDC events): PostgreSQL's logical replication (via Debezium) publishes a change event to Kafka on every write to the
nodestable. The Document Service can alternately publish an explicit application event after each successful write. - Search Indexer: A Kafka consumer that reads change events, fetches the latest document content from S3, tokenizes it, and writes an inverted index entry to Elasticsearch.
- Elasticsearch: Stores the full-text inverted index across document titles and content. Queries include a
node_idallow-list filter derived from the user's effective permissions.
Request walkthrough (search):
- Client sends
GET /search?q=quarterly+report. - Document Service fetches the list of
node_ids the user has read access to. - Document Service sends the query plus the
node_idallow-list to Elasticsearch as a filtered query. - Elasticsearch returns ranked results. Document Service enriches them with titles and metadata.
- Client receives the result list in under 500ms.
Search results are eventually consistent: a newly saved document appears in search after the indexer processes the CDC event, typically within a few seconds. This lag is acceptable under the stated NFRs; real-time search consistency would require synchronous indexing and a fundamentally different write path.
One point I'd raise proactively in an interview: the node_id allow-list approach works at moderate scale, but a user with access to 500,000 documents generates a massive filter clause. For power users, precompute the accessible document set in a dedicated permissions index that Elasticsearch references, rather than passing 500K IDs in the query.
Potential Deep Dives
1. How do we store version history without blowing up storage?
Three constraints drive this design:
- A typical document is 50KB. At 20 saves per day across 1B documents, full-snapshot storage is economically unsustainable.
- Version reconstruction must complete in under 300ms, since restoring a version is a user-facing read operation with the same latency budget as a normal document load.
- Content that is identical across versions (unchanged paragraphs present in both version 5 and version 6) should not be stored twice.
2. How do we implement hierarchical access control efficiently?
Three constraints drive this design:
- Permissions must inherit from parent folders unless overridden on a specific node.
- Access checks happen on every read and write, not just sharing operations, and must complete in under 10ms.
- A user can have access to a specific document even if they have no access to its containing folder.
Final Architecture
PostgreSQL is the source of truth for all metadata; S3 holds all content, whether full snapshots or deltas. Elasticsearch holds a derived index that can be rebuilt from S3 and PostgreSQL if lost. Redis holds only effective permissions and hot metadata with short TTLs, so any cache failure degrades latency but never correctness.
Interview Cheat Sheet
- Store document content in S3, not in the database. PostgreSQL holds only metadata (title, parent, timestamps, version pointers). This keeps row sizes small and avoids routing large payloads through the application tier.
- Use delta encoding (Myers diff) rather than full snapshots for version history. A typical edit changes 1-5 lines; the compressed delta is 0.5-2KB versus 12KB for a compressed full snapshot.
- Write periodic snapshot anchors every 50 versions to cap reconstruction chain length. Without anchors, restoring an old version requires replaying every delta from the document's creation forward.
- Restore is always append, never delete. When a user restores to version 42, the system creates a new forward version with the same content as version 42. Version history is immutable.
- Content-addressable storage deduplicates identical snapshot anchors. Two documents that diverged from the same starting state share one anchor blob in S3, stored once.
- Store ACL entries at the point of explicit grant and maintain a separate
effective_permstable for fast lookups. Access checks read a single indexed row rather than traversing the ancestor tree. - The
is_overrideflag on an ACL entry stops propagation from overwriting a document-level permission. A privately shared document retains its access even when the containing folder permissions change. - Use Elasticsearch for full-text search with an inverted index across title and content. Update the index asynchronously via CDC events from PostgreSQL through Kafka.
- Never rely on Elasticsearch alone for access control. Resolve the user's accessible
node_ids from PostgreSQL first, then pass that allow-list as a filter to Elasticsearch. - Search index updates are eventually consistent. A newly saved document appears in search after the indexer processes the CDC event, typically within a few seconds. This is acceptable at 99.9% availability without real-time search guarantees.
- Cache effective permissions in Redis with a 30-60 second TTL to absorb read spikes without a PostgreSQL query on every document load. Stale permissions for 60 seconds are acceptable; silent data leakage is not.
- The 10:1 read/write ratio justifies read replicas for PostgreSQL and a Redis cache in front of hot metadata. The primary database handles only writes during normal operation.
- Hierarchical permission propagation is asynchronous. Batch large recursive CTEs into chunks of 10,000 rows to avoid long-running transactions that block reads during folder-level sharing operations.