Design a text summarization API

TL;DR

The core architectural decision is document length detection and routing: documents under 128K tokens go through a single-pass summary, documents over 128K tokens go through chunked map-reduce with hierarchical merging.
Use GPT-4o-mini ($0.15/1M input tokens) for 90%+ of requests. Reserve GPT-4o ($2.50/1M input) for high-fidelity summaries only when the caller explicitly requests premium quality.
Semantic chunking (splitting at paragraph or section boundaries) produces 30-40% better coherence scores than fixed-size token splitting, because it preserves the author's logical structure.
Stream partial results using Server-Sent Events (SSE). Users see the first summary tokens in under 2 seconds even for 100-page documents.
The production lesson: most summarization failures are not model failures. They are chunking failures where a key argument spans two chunks and gets lost at the boundary.

Requirements

Functional requirements

Users can submit a document (plain text, PDF, or URL) and receive a text summary within a specified length (short, medium, or detailed).
Users can stream the summary response in real time as it is generated, instead of waiting for the full result.
The system handles documents from 500 tokens to 500K tokens without manual intervention from the user.
Users can specify a target language for the summary, enabling cross-lingual summarization.
The system provides a faithfulness score with each summary, indicating how accurately the summary reflects the source document.

Non-functional requirements

P95 latency under 5 seconds for documents under 10K tokens (single-pass path).
P95 latency under 30 seconds for documents up to 500K tokens (chunked path).
Cost per summary under $0.01 for documents under 50K tokens using the default model tier.
Throughput: 500 concurrent summarization requests with graceful degradation under load.
Availability: 99.9% uptime with automatic failover between model providers.
Faithfulness: ROUGE-L score above 0.35 on the CNN/DailyMail benchmark for short summaries.

The hardest engineering problem here: maintaining coherence when a 500K-token document is split into chunks. Each chunk is summarized independently, and the merge step must produce a summary that reads like it was written from the full document, not stitched from fragments.

The core entities

SummarizationRequest

request_id, user_id, input_type (text/pdf/url), raw_content, token_count, target_length (short/medium/detailed), target_language, model_tier (standard/premium), created_at

DocumentChunk

chunk_id, request_id, chunk_index, text, token_count, overlap_tokens, section_boundary (boolean)

SummaryResult

result_id, request_id, summary_text, model_used, total_input_tokens, total_output_tokens, cost_usd, faithfulness_score, latency_ms, chunks_processed

ModelConfig

model_id, provider, model_name, max_context_tokens, cost_per_1m_input, cost_per_1m_output, avg_latency_ms, is_active

API design

POST /api/summarize - submit a document for summarization

Request: {
  "content": "full document text or base64-encoded PDF",
  "input_type": "text" | "pdf" | "url",
  "target_length": "short" | "medium" | "detailed",
  "target_language": "en",
  "model_tier": "standard",
  "stream": true
}
Response (stream=false): {
  "request_id": "req_abc123",
  "summary": "The document discusses...",
  "model_used": "gpt-4o-mini",
  "token_count": { "input": 45200, "output": 512 },
  "cost_usd": 0.0072,
  "faithfulness_score": 0.82,
  "latency_ms": 3400
}
Response (stream=true): SSE stream
  data: {"type": "chunk", "text": "The document "}
  data: {"type": "chunk", "text": "discusses "}
  data: {"type": "done", "request_id": "req_abc123", "faithfulness_score": 0.82, "cost_usd": 0.0072}

GET /api/summarize/:request_id - retrieve a previously generated summary

Response: {
  "request_id": "req_abc123",
  "status": "completed" | "processing" | "failed",
  "summary": "...",
  "metadata": { "model_used": "gpt-4o-mini", "chunks_processed": 1, "cost_usd": 0.0072 }
}

GET /api/summarize/:request_id/chunks - inspect chunk-level details for debugging

Response: {
  "request_id": "req_abc123",
  "chunks": [
    { "chunk_index": 0, "token_count": 3800, "chunk_summary": "...", "section_boundary": true },
    { "chunk_index": 1, "token_count": 4100, "chunk_summary": "...", "section_boundary": false }
  ]
}

The summarization service has two main paths determined by document length. The "short path" handles documents that fit within a single context window (under 128K tokens for GPT-4o-mini). The "long path" handles everything else using map-reduce chunking with hierarchical merge.

Every request enters through an API gateway that authenticates, rate-limits, and routes to the summarization service. The service first detects the document length by running a fast tokenizer count. If it is under the context window limit, the document goes straight to the LLM. If it exceeds the limit, the chunking pipeline splits it, summarizes each chunk independently, then merges the chunk summaries into a final output.

I have seen candidates skip the length detection step and jump straight to "just call the API." The interviewer will immediately ask: "What happens when someone sends a 200-page PDF?" If you do not have a chunking story, you have not designed the system. Always start with the routing decision.

The pipeline comes alive when you see the step-by-step execution. Here is the full flow for a long document that requires chunking:

TL;DR

The core architectural decision is document length detection and routing: documents under 128K tokens go through a single-pass summary, documents over 128K tokens go through chunked map-reduce with hierarchical merging.
Use GPT-4o-mini ($0.15/1M input tokens) for 90%+ of requests. Reserve GPT-4o ($2.50/1M input) for high-fidelity summaries only when the caller explicitly requests premium quality.
Semantic chunking (splitting at paragraph or section boundaries) produces 30-40% better coherence scores than fixed-size token splitting, because it preserves the author's logical structure.
Stream partial results using Server-Sent Events (SSE). Users see the first summary tokens in under 2 seconds even for 100-page documents.
The production lesson: most summarization failures are not model failures. They are chunking failures where a key argument spans two chunks and gets lost at the boundary.

Requirements

Functional requirements

Users can submit a document (plain text, PDF, or URL) and receive a text summary within a specified length (short, medium, or detailed).
Users can stream the summary response in real time as it is generated, instead of waiting for the full result.
The system handles documents from 500 tokens to 500K tokens without manual intervention from the user.
Users can specify a target language for the summary, enabling cross-lingual summarization.
The system provides a faithfulness score with each summary, indicating how accurately the summary reflects the source document.

Non-functional requirements

P95 latency under 5 seconds for documents under 10K tokens (single-pass path).
P95 latency under 30 seconds for documents up to 500K tokens (chunked path).
Cost per summary under $0.01 for documents under 50K tokens using the default model tier.
Throughput: 500 concurrent summarization requests with graceful degradation under load.
Availability: 99.9% uptime with automatic failover between model providers.
Faithfulness: ROUGE-L score above 0.35 on the CNN/DailyMail benchmark for short summaries.

The hardest engineering problem here: maintaining coherence when a 500K-token document is split into chunks. Each chunk is summarized independently, and the merge step must produce a summary that reads like it was written from the full document, not stitched from fragments.

The core entities

SummarizationRequest

request_id, user_id, input_type (text/pdf/url), raw_content, token_count, target_length (short/medium/detailed), target_language, model_tier (standard/premium), created_at

DocumentChunk

chunk_id, request_id, chunk_index, text, token_count, overlap_tokens, section_boundary (boolean)

SummaryResult

result_id, request_id, summary_text, model_used, total_input_tokens, total_output_tokens, cost_usd, faithfulness_score, latency_ms, chunks_processed

ModelConfig

model_id, provider, model_name, max_context_tokens, cost_per_1m_input, cost_per_1m_output, avg_latency_ms, is_active

API design

POST /api/summarize - submit a document for summarization

Request: {
  "content": "full document text or base64-encoded PDF",
  "input_type": "text" | "pdf" | "url",
  "target_length": "short" | "medium" | "detailed",
  "target_language": "en",
  "model_tier": "standard",
  "stream": true
}
Response (stream=false): {
  "request_id": "req_abc123",
  "summary": "The document discusses...",
  "model_used": "gpt-4o-mini",
  "token_count": { "input": 45200, "output": 512 },
  "cost_usd": 0.0072,
  "faithfulness_score": 0.82,
  "latency_ms": 3400
}
Response (stream=true): SSE stream
  data: {"type": "chunk", "text": "The document "}
  data: {"type": "chunk", "text": "discusses "}
  data: {"type": "done", "request_id": "req_abc123", "faithfulness_score": 0.82, "cost_usd": 0.0072}

GET /api/summarize/:request_id - retrieve a previously generated summary

Response: {
  "request_id": "req_abc123",
  "status": "completed" | "processing" | "failed",
  "summary": "...",
  "metadata": { "model_used": "gpt-4o-mini", "chunks_processed": 1, "cost_usd": 0.0072 }
}

GET /api/summarize/:request_id/chunks - inspect chunk-level details for debugging

Response: {
  "request_id": "req_abc123",
  "chunks": [
    { "chunk_index": 0, "token_count": 3800, "chunk_summary": "...", "section_boundary": true },
    { "chunk_index": 1, "token_count": 4100, "chunk_summary": "...", "section_boundary": false }
  ]
}

High-level design

The pipeline comes alive when you see the step-by-step execution. Here is the full flow for a long document that requires chunking:

Design a text summarization API

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design a text summarization API

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments