Design a sentiment analysis pipeline

TL;DR

The core architectural decision is tiered classification: a fine-tuned DistilBERT handles ~90% of posts at 5ms per inference, while ambiguous cases (low confidence scores) escalate to GPT-4o-mini at ~200ms per inference. This drops daily cost from $75K to under $500.
Real-time path uses Kafka consumers with GPU-backed DistilBERT workers processing 50K posts/min. Batch path runs nightly with GPT-4o-mini to audit the fast classifier's accuracy and generate retraining labels.
Drift detection is not optional. Language evolves, new slang appears, and your model's accuracy will silently degrade. Monitor the prediction confidence distribution daily and trigger retraining when the low-confidence bucket grows by more than 15%.
Aspect-based sentiment (product great but shipping terrible) requires a separate extraction step before classification. Do not try to shove aspect detection into the same model call.
The production lesson: most sentiment analysis failures are not model failures. They are label noise failures where your training data has 20% mislabeled examples because human annotators disagreed on sarcasm and neutral-vs-negative boundaries.

Requirements

Functional requirements

The system classifies incoming text (social media posts, reviews, support tickets) as positive, negative, or neutral with a confidence score between 0 and 1.
The system supports aspect-based sentiment extraction, identifying sentiment per aspect (e.g., "product quality: positive, shipping speed: negative") when requested.
The system processes posts in real time via a streaming pipeline and also accepts on-demand classification via a REST API.
Users can query aggregated sentiment trends over time windows (hourly, daily, weekly) filtered by source, language, or topic.
The system supports at least 10 languages, with automatic language detection before classification.
The system flags posts with confidence below a configurable threshold for human review or LLM re-classification.

Non-functional requirements

Throughput: 50,000 posts per minute sustained (72M posts/day).
P95 latency under 50ms for the real-time classification path (fast classifier).
P95 latency under 500ms for the on-demand API path (which may use the LLM fallback).
Classification accuracy: F1 score above 0.88 on the held-out test set, measured weekly.
Cost: under $500/day for the real-time pipeline at full throughput.
Availability: 99.9% uptime with graceful degradation (queue backpressure, not dropped posts).

The hardest engineering problem here: keeping accuracy stable over time. Language drifts, new slang appears, sarcasm patterns evolve, and your fine-tuned model's confidence distribution shifts silently. Without continuous drift detection and a retraining pipeline, accuracy degrades by 2-5% per quarter.

The core entities

ClassificationRequest

request_id, source (twitter/reddit/api/support), raw_text, detected_language, char_count, received_at, priority (realtime/batch)

ClassificationResult

result_id, request_id, sentiment (positive/negative/neutral), confidence, model_used (distilbert/gpt-4o-mini), aspects[] (optional), latency_ms, classified_at

AspectSentiment

aspect_id, result_id, aspect_name (e.g., "shipping", "product quality"), sentiment, confidence, text_span

ModelVersion

version_id, model_type, training_dataset_size, f1_score, deployed_at, is_active, drift_score

DriftSnapshot

snapshot_id, timestamp, confidence_histogram, low_confidence_pct, accuracy_estimate, retraining_triggered (boolean)

API design

POST /api/classify - classify a single text on demand

Request: {
  "text": "This product is amazing but shipping took forever",
  "language": "auto",
  "aspects": true,
  "priority": "standard"
}
Response: {
  "request_id": "req_s3nt1",
  "sentiment": "mixed",
  "confidence": 0.82,
  "model_used": "distilbert-v3",
  "aspects": [
    { "aspect": "product quality", "sentiment": "positive", "confidence": 0.95 },
    { "aspect": "shipping", "sentiment": "negative", "confidence": 0.91 }
  ],
  "language_detected": "en",
  "latency_ms": 38
}

On-demand classification for integrations that need a synchronous response. Routes through the same tiered model pipeline as the streaming path.

POST /api/classify/batch - submit a batch of texts for classification

Request: {
  "texts": [
    { "id": "ext_001", "text": "Love this app!", "language": "auto" },
    { "id": "ext_002", "text": "Terrible experience", "language": "auto" }
  ],
  "webhook_url": "https://client.example.com/sentiment-callback"
}
Response: {
  "batch_id": "batch_abc123",
  "count": 2,
  "status": "queued",
  "estimated_completion_ms": 500
}

For clients that have a backlog of texts to classify. Results are delivered via webhook to avoid long-polling.

GET /api/trends - query aggregated sentiment over time

Request: GET /api/trends?source=twitter&window=hourly&start=2026-04-01&end=2026-04-10
Response: {
  "source": "twitter",
  "window": "hourly",
  "data": [
    { "timestamp": "2026-04-01T00:00:00Z", "positive": 4521, "negative": 1203, "neutral": 8876, "avg_confidence": 0.87 },
    { "timestamp": "2026-04-01T01:00:00Z", "positive": 4102, "negative": 1445, "neutral": 9012, "avg_confidence": 0.85 }
  ]
}

Powers the dashboard. Pre-aggregated in ClickHouse for sub-second queries across billions of classified posts.

GET /api/model/health - model performance and drift metrics

Response: {
  "active_model": "distilbert-v3",
  "f1_score": 0.89,
  "low_confidence_pct": 8.2,
  "drift_score": 0.03,
  "last_retrained": "2026-03-28",
  "retraining_triggered": false
}

Exposes model health for the ops dashboard and alerting. Drift score above 0.10 triggers a PagerDuty alert.

High-level design

The system has two parallel paths: a real-time streaming pipeline and a batch reprocessing pipeline. Both feed into the same results store and aggregation layer.

The real-time path is the workhorse. Posts arrive from multiple sources (Twitter firehose, Reddit API, support ticket webhooks) into a Kafka topic. A pool of GPU-backed consumer workers pulls posts, runs language detection, preprocesses text, and classifies using a fine-tuned DistilBERT model. Posts with confidence below 0.75 escalate to GPT-4o-mini for a second opinion. All results land in PostgreSQL (for individual lookups) and ClickHouse (for aggregated trend queries).

The batch path runs nightly. It re-classifies a random 5% sample of the day's posts using GPT-4o-mini as a quality audit. If the agreement rate between DistilBERT and GPT-4o-mini drops below 90%, the system flags a drift alert and queues a retraining job. I have seen teams skip the batch audit path and regret it three months later when accuracy silently dropped from 0.91 to 0.78.

For your interview: draw two paths (real-time and batch) from the start. Interviewers love seeing that you think about data quality, not just throughput.

Here is the real-time classification pipeline animated step by step. Notice how the confidence check acts as a router, sending 90% of posts down the fast path and only 10% to the expensive LLM.

TL;DR

The core architectural decision is tiered classification: a fine-tuned DistilBERT handles ~90% of posts at 5ms per inference, while ambiguous cases (low confidence scores) escalate to GPT-4o-mini at ~200ms per inference. This drops daily cost from $75K to under $500.
Real-time path uses Kafka consumers with GPU-backed DistilBERT workers processing 50K posts/min. Batch path runs nightly with GPT-4o-mini to audit the fast classifier's accuracy and generate retraining labels.
Drift detection is not optional. Language evolves, new slang appears, and your model's accuracy will silently degrade. Monitor the prediction confidence distribution daily and trigger retraining when the low-confidence bucket grows by more than 15%.
Aspect-based sentiment (product great but shipping terrible) requires a separate extraction step before classification. Do not try to shove aspect detection into the same model call.
The production lesson: most sentiment analysis failures are not model failures. They are label noise failures where your training data has 20% mislabeled examples because human annotators disagreed on sarcasm and neutral-vs-negative boundaries.

Requirements

Functional requirements

The system classifies incoming text (social media posts, reviews, support tickets) as positive, negative, or neutral with a confidence score between 0 and 1.
The system supports aspect-based sentiment extraction, identifying sentiment per aspect (e.g., "product quality: positive, shipping speed: negative") when requested.
The system processes posts in real time via a streaming pipeline and also accepts on-demand classification via a REST API.
Users can query aggregated sentiment trends over time windows (hourly, daily, weekly) filtered by source, language, or topic.
The system supports at least 10 languages, with automatic language detection before classification.
The system flags posts with confidence below a configurable threshold for human review or LLM re-classification.

Non-functional requirements

Throughput: 50,000 posts per minute sustained (72M posts/day).
P95 latency under 50ms for the real-time classification path (fast classifier).
P95 latency under 500ms for the on-demand API path (which may use the LLM fallback).
Classification accuracy: F1 score above 0.88 on the held-out test set, measured weekly.
Cost: under $500/day for the real-time pipeline at full throughput.
Availability: 99.9% uptime with graceful degradation (queue backpressure, not dropped posts).

The hardest engineering problem here: keeping accuracy stable over time. Language drifts, new slang appears, sarcasm patterns evolve, and your fine-tuned model's confidence distribution shifts silently. Without continuous drift detection and a retraining pipeline, accuracy degrades by 2-5% per quarter.

The core entities

ClassificationRequest

request_id, source (twitter/reddit/api/support), raw_text, detected_language, char_count, received_at, priority (realtime/batch)

ClassificationResult

result_id, request_id, sentiment (positive/negative/neutral), confidence, model_used (distilbert/gpt-4o-mini), aspects[] (optional), latency_ms, classified_at

AspectSentiment

aspect_id, result_id, aspect_name (e.g., "shipping", "product quality"), sentiment, confidence, text_span

ModelVersion

version_id, model_type, training_dataset_size, f1_score, deployed_at, is_active, drift_score

DriftSnapshot

snapshot_id, timestamp, confidence_histogram, low_confidence_pct, accuracy_estimate, retraining_triggered (boolean)

API design

POST /api/classify - classify a single text on demand

Request: {
  "text": "This product is amazing but shipping took forever",
  "language": "auto",
  "aspects": true,
  "priority": "standard"
}
Response: {
  "request_id": "req_s3nt1",
  "sentiment": "mixed",
  "confidence": 0.82,
  "model_used": "distilbert-v3",
  "aspects": [
    { "aspect": "product quality", "sentiment": "positive", "confidence": 0.95 },
    { "aspect": "shipping", "sentiment": "negative", "confidence": 0.91 }
  ],
  "language_detected": "en",
  "latency_ms": 38
}

On-demand classification for integrations that need a synchronous response. Routes through the same tiered model pipeline as the streaming path.

POST /api/classify/batch - submit a batch of texts for classification

Request: {
  "texts": [
    { "id": "ext_001", "text": "Love this app!", "language": "auto" },
    { "id": "ext_002", "text": "Terrible experience", "language": "auto" }
  ],
  "webhook_url": "https://client.example.com/sentiment-callback"
}
Response: {
  "batch_id": "batch_abc123",
  "count": 2,
  "status": "queued",
  "estimated_completion_ms": 500
}

For clients that have a backlog of texts to classify. Results are delivered via webhook to avoid long-polling.

GET /api/trends - query aggregated sentiment over time

Request: GET /api/trends?source=twitter&window=hourly&start=2026-04-01&end=2026-04-10
Response: {
  "source": "twitter",
  "window": "hourly",
  "data": [
    { "timestamp": "2026-04-01T00:00:00Z", "positive": 4521, "negative": 1203, "neutral": 8876, "avg_confidence": 0.87 },
    { "timestamp": "2026-04-01T01:00:00Z", "positive": 4102, "negative": 1445, "neutral": 9012, "avg_confidence": 0.85 }
  ]
}

Powers the dashboard. Pre-aggregated in ClickHouse for sub-second queries across billions of classified posts.

GET /api/model/health - model performance and drift metrics

Response: {
  "active_model": "distilbert-v3",
  "f1_score": 0.89,
  "low_confidence_pct": 8.2,
  "drift_score": 0.03,
  "last_retrained": "2026-03-28",
  "retraining_triggered": false
}

Exposes model health for the ops dashboard and alerting. Drift score above 0.10 triggers a PagerDuty alert.

High-level design

The system has two parallel paths: a real-time streaming pipeline and a batch reprocessing pipeline. Both feed into the same results store and aggregation layer.

For your interview: draw two paths (real-time and batch) from the start. Interviewers love seeing that you think about data quality, not just throughput.

Here is the real-time classification pipeline animated step by step. Notice how the confidence check acts as a router, sending 90% of posts down the fast path and only 10% to the expensive LLM.

Design a sentiment analysis pipeline

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design a sentiment analysis pipeline

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments