Design a sentiment analysis pipeline
Walk through designing a real-time sentiment analysis system that classifies 50K posts per minute across multiple languages using tiered models, with batch reprocessing and drift detection.
TL;DR
- The core architectural decision is tiered classification: a fine-tuned DistilBERT handles ~90% of posts at 5ms per inference, while ambiguous cases (low confidence scores) escalate to GPT-4o-mini at ~200ms per inference. This drops daily cost from $75K to under $500.
- Real-time path uses Kafka consumers with GPU-backed DistilBERT workers processing 50K posts/min. Batch path runs nightly with GPT-4o-mini to audit the fast classifier's accuracy and generate retraining labels.
- Drift detection is not optional. Language evolves, new slang appears, and your model's accuracy will silently degrade. Monitor the prediction confidence distribution daily and trigger retraining when the low-confidence bucket grows by more than 15%.
- Aspect-based sentiment (product great but shipping terrible) requires a separate extraction step before classification. Do not try to shove aspect detection into the same model call.
- The production lesson: most sentiment analysis failures are not model failures. They are label noise failures where your training data has 20% mislabeled examples because human annotators disagreed on sarcasm and neutral-vs-negative boundaries.
Requirements
Functional requirements
- The system classifies incoming text (social media posts, reviews, support tickets) as positive, negative, or neutral with a confidence score between 0 and 1.
- The system supports aspect-based sentiment extraction, identifying sentiment per aspect (e.g., "product quality: positive, shipping speed: negative") when requested.
- The system processes posts in real time via a streaming pipeline and also accepts on-demand classification via a REST API.
- Users can query aggregated sentiment trends over time windows (hourly, daily, weekly) filtered by source, language, or topic.
- The system supports at least 10 languages, with automatic language detection before classification.
- The system flags posts with confidence below a configurable threshold for human review or LLM re-classification.
Non-functional requirements
- Throughput: 50,000 posts per minute sustained (72M posts/day).
- P95 latency under 50ms for the real-time classification path (fast classifier).
- P95 latency under 500ms for the on-demand API path (which may use the LLM fallback).
- Classification accuracy: F1 score above 0.88 on the held-out test set, measured weekly.
- Cost: under $500/day for the real-time pipeline at full throughput.
- Availability: 99.9% uptime with graceful degradation (queue backpressure, not dropped posts).
The hardest engineering problem here: keeping accuracy stable over time. Language drifts, new slang appears, sarcasm patterns evolve, and your fine-tuned model's confidence distribution shifts silently. Without continuous drift detection and a retraining pipeline, accuracy degrades by 2-5% per quarter.
The core entities
ClassificationRequest
request_id,source(twitter/reddit/api/support),raw_text,detected_language,char_count,received_at,priority(realtime/batch)
ClassificationResult
result_id,request_id,sentiment(positive/negative/neutral),confidence,model_used(distilbert/gpt-4o-mini),aspects[](optional),latency_ms,classified_at
AspectSentiment
aspect_id,result_id,aspect_name(e.g., "shipping", "product quality"),sentiment,confidence,text_span
ModelVersion
version_id,model_type,training_dataset_size,f1_score,deployed_at,is_active,drift_score
DriftSnapshot
snapshot_id,timestamp,confidence_histogram,low_confidence_pct,accuracy_estimate,retraining_triggered(boolean)
API design
POST /api/classify - classify a single text on demand
Request: {
"text": "This product is amazing but shipping took forever",
"language": "auto",
"aspects": true,
"priority": "standard"
}
Response: {
"request_id": "req_s3nt1",
"sentiment": "mixed",
"confidence": 0.82,
"model_used": "distilbert-v3",
"aspects": [
{ "aspect": "product quality", "sentiment": "positive", "confidence": 0.95 },
{ "aspect": "shipping", "sentiment": "negative", "confidence": 0.91 }
],
"language_detected": "en",
"latency_ms": 38
}
On-demand classification for integrations that need a synchronous response. Routes through the same tiered model pipeline as the streaming path.
POST /api/classify/batch - submit a batch of texts for classification
Request: {
"texts": [
{ "id": "ext_001", "text": "Love this app!", "language": "auto" },
{ "id": "ext_002", "text": "Terrible experience", "language": "auto" }
],
"webhook_url": "https://client.example.com/sentiment-callback"
}
Response: {
"batch_id": "batch_abc123",
"count": 2,
"status": "queued",
"estimated_completion_ms": 500
}
For clients that have a backlog of texts to classify. Results are delivered via webhook to avoid long-polling.
GET /api/trends - query aggregated sentiment over time
Request: GET /api/trends?source=twitter&window=hourly&start=2026-04-01&end=2026-04-10
Response: {
"source": "twitter",
"window": "hourly",
"data": [
{ "timestamp": "2026-04-01T00:00:00Z", "positive": 4521, "negative": 1203, "neutral": 8876, "avg_confidence": 0.87 },
{ "timestamp": "2026-04-01T01:00:00Z", "positive": 4102, "negative": 1445, "neutral": 9012, "avg_confidence": 0.85 }
]
}
Powers the dashboard. Pre-aggregated in ClickHouse for sub-second queries across billions of classified posts.
GET /api/model/health - model performance and drift metrics
Response: {
"active_model": "distilbert-v3",
"f1_score": 0.89,
"low_confidence_pct": 8.2,
"drift_score": 0.03,
"last_retrained": "2026-03-28",
"retraining_triggered": false
}
Exposes model health for the ops dashboard and alerting. Drift score above 0.10 triggers a PagerDuty alert.
High-level design
The system has two parallel paths: a real-time streaming pipeline and a batch reprocessing pipeline. Both feed into the same results store and aggregation layer.
The real-time path is the workhorse. Posts arrive from multiple sources (Twitter firehose, Reddit API, support ticket webhooks) into a Kafka topic. A pool of GPU-backed consumer workers pulls posts, runs language detection, preprocesses text, and classifies using a fine-tuned DistilBERT model. Posts with confidence below 0.75 escalate to GPT-4o-mini for a second opinion. All results land in PostgreSQL (for individual lookups) and ClickHouse (for aggregated trend queries).
The batch path runs nightly. It re-classifies a random 5% sample of the day's posts using GPT-4o-mini as a quality audit. If the agreement rate between DistilBERT and GPT-4o-mini drops below 90%, the system flags a drift alert and queues a retraining job. I have seen teams skip the batch audit path and regret it three months later when accuracy silently dropped from 0.91 to 0.78.
For your interview: draw two paths (real-time and batch) from the start. Interviewers love seeing that you think about data quality, not just throughput.
Here is the real-time classification pipeline animated step by step. Notice how the confidence check acts as a router, sending 90% of posts down the fast path and only 10% to the expensive LLM.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.