Design a real-time AI translation service
Walk through designing a low-latency translation service supporting 50 language pairs, streaming output, glossary enforcement, and 10K concurrent translation sessions with quality on par with DeepL.
TL;DR
- Use a tiered model architecture: fine-tuned NMT models (Marian/NLLB) at <100ms for high-traffic pairs like EN-ES and EN-ZH, NLLB-200 for mid-traffic pairs, and LLM fallback for the long tail. This strategy cuts cost from ~$25/1M chars (DeepL API) to ~$2-5/1M chars for 80% of traffic.
- Streaming translation uses chunk-based speculative output. Buffer input by sentence boundary detection (punctuation + 300ms pause heuristic), speculatively translate partial buffers, then replace with the refined output when the full sentence arrives. Users see translations appearing within 200ms of typing.
- Glossary enforcement through constrained decoding (logit biasing during beam search) achieves 98%+ term compliance, compared to 75-80% with naive prompt injection and ~60% with post-processing string replacement.
- Quality estimation without human references combines COMETkiwi (0.85+ correlation with human judgment), a glossary compliance checker, and back-translation semantic similarity. Translations scoring below the threshold route to a human post-editing queue automatically.
- The production lesson: language pair difficulty varies by orders of magnitude. EN-FR translation is nearly solved (BLEU 45+), while EN-JA or EN-ZH requires dealing with completely different word order, writing systems, and formality registers. Your model tier routing is the single most important architectural decision.
Requirements
Functional requirements
- Users submit text (up to 5,000 characters per request) and receive a translation in their selected target language, with support for at least 50 language pairs.
- The system detects the source language automatically when not specified by the user.
- Translation output streams to the client in real time as the user types, with partial translations appearing within 500ms.
- Customers can configure per-account glossaries (brand names, technical terms) that the translation engine must preserve exactly in the target language.
- The system provides a quality score for each translation, flagging low-confidence outputs for optional human review.
- The system supports document-level context, maintaining consistency in terminology and formality across paragraphs within a single session.
Non-functional requirements
- P95 latency under 500ms for single-sentence translation (excluding network round-trip).
- P95 latency under 100ms for high-traffic language pairs (EN-ES, EN-FR, EN-DE, EN-ZH) using optimized NMT models.
- Support 10,000 concurrent translation sessions over WebSocket connections.
- Translation quality: COMET score within 2 points of DeepL on WMT23 benchmarks for top-10 language pairs.
- Cost per 1M characters under $5 for NMT-served traffic and under $20 for LLM-served traffic.
- Availability: 99.9% uptime. Translation is a real-time user-facing feature with zero tolerance for extended outages.
The hardest engineering problem here: streaming translation coherence. Unlike LLM chat where you stream tokens sequentially, translation must handle word reordering across languages (English SVO to Japanese SOV, for example). You cannot translate word-by-word because the target word order may be completely different. The system must buffer intelligently, produce speculative partial translations, and then gracefully replace them when the full sentence context is available.
The core entities
TranslationRequest
request_id,session_id,source_text,source_lang(nullable, auto-detected),target_lang,glossary_id(nullable),context_window(previous sentences for consistency),stream(boolean),created_at
TranslationResponse
response_id,request_id,translated_text,model_used,model_tier(nmt/nllb/llm),latency_ms,quality_score,glossary_terms_enforced(count),is_speculative(boolean),created_at
LanguagePair
pair_id,source_lang,target_lang,model_tier(nmt_optimized/nllb_general/llm_fallback),model_id,avg_latency_ms,avg_comet_score,daily_request_volume
Glossary
glossary_id,account_id,source_lang,target_lang,entries(list of source_term/target_term pairs),entry_count,created_at,updated_at
QualityEstimate
estimate_id,response_id,comet_score,glossary_compliance_pct,fluency_score,back_translation_similarity,aggregate_score,flagged_for_review(boolean)
TranslationMemory
memory_id,account_id,source_hash,source_lang,target_lang,translated_text,model_used,quality_score,hit_count,created_at,ttl
API design
POST /v1/translate - single request translation (REST)
Request: {
"text": "Kubernetes orchestrates containerized applications across clusters.",
"source_lang": "en",
"target_lang": "ja",
"glossary_id": "gloss_abc123",
"context": ["Previous sentence for consistency."],
"quality_check": true
}
Response: {
"request_id": "tr_req_xyz789",
"translated_text": "Kubernetesは、クラスター全体でコンテナ化されたアプリケーションをオーケストレーションします。",
"source_lang_detected": "en",
"model_used": "nmt-en-ja-v3",
"model_tier": "nmt_optimized",
"latency_ms": 82,
"quality": {
"comet_score": 0.91,
"glossary_compliance": 1.0,
"aggregate_score": 0.94,
"flagged": false
},
"glossary_terms": [
{ "source": "Kubernetes", "target": "Kubernetes", "enforced": true }
]
}
The REST endpoint handles batch and single-shot requests. Teams integrating translation for document processing or email workflows use this. The context field passes up to 3 previous sentences for cross-sentence consistency.
WebSocket /v1/translate/stream - real-time streaming translation
Client sends: {
"type": "text_chunk",
"session_id": "sess_abc",
"text": "Kubernetes orchestrates",
"source_lang": "en",
"target_lang": "ja",
"glossary_id": "gloss_abc123",
"is_final": false
}
Server sends: {
"type": "partial_translation",
"session_id": "sess_abc",
"translated_text": "Kubernetesはオーケストレーション...",
"is_speculative": true,
"confidence": 0.72
}
Server sends (after sentence completes): {
"type": "final_translation",
"session_id": "sess_abc",
"translated_text": "Kubernetesは、コンテナ化されたアプリケーションをオーケストレーションします。",
"is_speculative": false,
"quality_score": 0.91
}
The WebSocket endpoint powers the real-time typing experience. Clients send text chunks as the user types. The server returns speculative partial translations (marked is_speculative: true) that get replaced by final translations once the full sentence context is available. This is the critical UX differentiator over batch translation APIs.
POST /v1/glossaries - create or update a customer glossary
Request: {
"account_id": "acct_456",
"source_lang": "en",
"target_lang": "ja",
"entries": [
{ "source": "Kubernetes", "target": "Kubernetes" },
{ "source": "microservice", "target": "マイクロサービス" },
{ "source": "pod", "target": "Pod" }
]
}
Response: {
"glossary_id": "gloss_abc123",
"entry_count": 3,
"created_at": "2026-04-11T10:00:00Z"
}
Glossaries are scoped per account and per language pair. Enterprise customers maintain glossaries with 500-2,000 terms covering brand names, product features, and domain-specific vocabulary.
GET /v1/quality/report - translation quality analytics
Response: {
"period": "2026-04-01 to 2026-04-10",
"total_translations": 2450000,
"avg_comet_score": 0.88,
"by_tier": [
{ "tier": "nmt_optimized", "volume_pct": 65, "avg_comet": 0.91, "avg_latency_ms": 72 },
{ "tier": "nllb_general", "volume_pct": 25, "avg_comet": 0.84, "avg_latency_ms": 145 },
{ "tier": "llm_fallback", "volume_pct": 10, "avg_comet": 0.87, "avg_latency_ms": 380 }
],
"flagged_for_review": 3200,
"glossary_compliance_avg": 0.97
}
Quality analytics for monitoring translation accuracy across tiers. Operations teams use this to decide when to promote a language pair from NLLB to a fine-tuned NMT model.
High-level design
The translation service is built around a tiered model architecture. Not all language pairs are equal: EN-ES and EN-FR have decades of parallel corpora and highly optimized NMT models, while EN-MY (Burmese) or EN-AM (Amharic) have sparse training data and need a different approach entirely. Trying to serve all 50 language pairs through a single model is either expensive (GPT-4o for everything) or low quality (a single multilingual model optimized for nothing).
The architecture splits into three planes. The real-time plane handles WebSocket connections, input buffering, and streaming output. The translation plane manages model routing, inference, and glossary enforcement. The quality plane scores every translation asynchronously and routes low-confidence output to human review.
I have seen teams attempt to build a translation service as a thin wrapper around Google Translate or DeepL APIs. That works for prototypes, but production requirements (custom glossaries, streaming, quality control, cost management at scale) push you toward self-hosted models for high-traffic pairs within months.
The model router is the decision point. It maps each language pair to a tier based on available model quality, latency requirements, and cost. Here is the routing logic:
| Language Pair Category | Examples | Model Tier | Avg Latency | Cost/1M chars |
|---|---|---|---|---|
| High-traffic (top 10) | EN-ES, EN-FR, EN-DE, EN-ZH, EN-JA | Fine-tuned NMT (Marian/OPUS, ONNX) | 50-100ms | $2-3 |
| Mid-traffic (next 30) | EN-TH, EN-VI, EN-PL, EN-TR | NLLB-200 (3.3B params) | 100-200ms | $4-6 |
| Low-resource (long tail) | EN-AM, EN-MY, EN-LO | LLM with few-shot examples | 200-500ms | $15-20 |
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.