Design a text summarization API
Walk through designing a text summarization service that handles documents from 500 to 500K tokens, with chunking strategies, streaming output, and cost optimization under $0.01 per summary.
TL;DR
- The core architectural decision is document length detection and routing: documents under 128K tokens go through a single-pass summary, documents over 128K tokens go through chunked map-reduce with hierarchical merging.
- Use GPT-4o-mini ($0.15/1M input tokens) for 90%+ of requests. Reserve GPT-4o ($2.50/1M input) for high-fidelity summaries only when the caller explicitly requests premium quality.
- Semantic chunking (splitting at paragraph or section boundaries) produces 30-40% better coherence scores than fixed-size token splitting, because it preserves the author's logical structure.
- Stream partial results using Server-Sent Events (SSE). Users see the first summary tokens in under 2 seconds even for 100-page documents.
- The production lesson: most summarization failures are not model failures. They are chunking failures where a key argument spans two chunks and gets lost at the boundary.
Requirements
Functional requirements
- Users can submit a document (plain text, PDF, or URL) and receive a text summary within a specified length (short, medium, or detailed).
- Users can stream the summary response in real time as it is generated, instead of waiting for the full result.
- The system handles documents from 500 tokens to 500K tokens without manual intervention from the user.
- Users can specify a target language for the summary, enabling cross-lingual summarization.
- The system provides a faithfulness score with each summary, indicating how accurately the summary reflects the source document.
Non-functional requirements
- P95 latency under 5 seconds for documents under 10K tokens (single-pass path).
- P95 latency under 30 seconds for documents up to 500K tokens (chunked path).
- Cost per summary under $0.01 for documents under 50K tokens using the default model tier.
- Throughput: 500 concurrent summarization requests with graceful degradation under load.
- Availability: 99.9% uptime with automatic failover between model providers.
- Faithfulness: ROUGE-L score above 0.35 on the CNN/DailyMail benchmark for short summaries.
The hardest engineering problem here: maintaining coherence when a 500K-token document is split into chunks. Each chunk is summarized independently, and the merge step must produce a summary that reads like it was written from the full document, not stitched from fragments.
The core entities
SummarizationRequest
request_id,user_id,input_type(text/pdf/url),raw_content,token_count,target_length(short/medium/detailed),target_language,model_tier(standard/premium),created_at
DocumentChunk
chunk_id,request_id,chunk_index,text,token_count,overlap_tokens,section_boundary(boolean)
SummaryResult
result_id,request_id,summary_text,model_used,total_input_tokens,total_output_tokens,cost_usd,faithfulness_score,latency_ms,chunks_processed
ModelConfig
model_id,provider,model_name,max_context_tokens,cost_per_1m_input,cost_per_1m_output,avg_latency_ms,is_active
API design
POST /api/summarize - submit a document for summarization
Request: {
"content": "full document text or base64-encoded PDF",
"input_type": "text" | "pdf" | "url",
"target_length": "short" | "medium" | "detailed",
"target_language": "en",
"model_tier": "standard",
"stream": true
}
Response (stream=false): {
"request_id": "req_abc123",
"summary": "The document discusses...",
"model_used": "gpt-4o-mini",
"token_count": { "input": 45200, "output": 512 },
"cost_usd": 0.0072,
"faithfulness_score": 0.82,
"latency_ms": 3400
}
Response (stream=true): SSE stream
data: {"type": "chunk", "text": "The document "}
data: {"type": "chunk", "text": "discusses "}
data: {"type": "done", "request_id": "req_abc123", "faithfulness_score": 0.82, "cost_usd": 0.0072}
GET /api/summarize/:request_id - retrieve a previously generated summary
Response: {
"request_id": "req_abc123",
"status": "completed" | "processing" | "failed",
"summary": "...",
"metadata": { "model_used": "gpt-4o-mini", "chunks_processed": 1, "cost_usd": 0.0072 }
}
GET /api/summarize/:request_id/chunks - inspect chunk-level details for debugging
Response: {
"request_id": "req_abc123",
"chunks": [
{ "chunk_index": 0, "token_count": 3800, "chunk_summary": "...", "section_boundary": true },
{ "chunk_index": 1, "token_count": 4100, "chunk_summary": "...", "section_boundary": false }
]
}
High-level design
The summarization service has two main paths determined by document length. The "short path" handles documents that fit within a single context window (under 128K tokens for GPT-4o-mini). The "long path" handles everything else using map-reduce chunking with hierarchical merge.
Every request enters through an API gateway that authenticates, rate-limits, and routes to the summarization service. The service first detects the document length by running a fast tokenizer count. If it is under the context window limit, the document goes straight to the LLM. If it exceeds the limit, the chunking pipeline splits it, summarizes each chunk independently, then merges the chunk summaries into a final output.
I have seen candidates skip the length detection step and jump straight to "just call the API." The interviewer will immediately ask: "What happens when someone sends a 200-page PDF?" If you do not have a chunking story, you have not designed the system. Always start with the routing decision.
The pipeline comes alive when you see the step-by-step execution. Here is the full flow for a long document that requires chunking:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.