Design AI content moderation

TL;DR

Four tiers handle content at different costs and latencies: blocklist (under 1ms), ML classifier (under 10ms), LLM classifier (under 100ms), and human review (24-48 hours).
Each tier escalates to the next only when uncertain. The LLM classifier touches only 5-10% of total volume, keeping cost manageable.
False positives are more damaging than false negatives at a certain scale. Incorrectly removing legitimate content at 0.5% FP rate means 50,000 wrongful removals per day on 10M posts.
Context is critical for LLM moderation. A phrase that is harmful in one community is normal discussion in another. Send category and community metadata alongside the content.
Human reviewer decisions are the ground truth. Every decision feeds back into classifier training.

Requirements

Functional requirements

All content submissions (text and images) are screened before appearing on the platform.
Clearly harmful content (known hate terms, verified spam) is rejected immediately with a policy violation explanation.
Borderline content is routed to human review within 24 hours, and a temporary hold is applied until review is complete.
Users whose content is removed can submit an appeal with additional context.
The moderation decision and the tier that made it are logged for every piece of content for auditability.

Non-functional requirements

Volume: 10M posts per day, peak 500 posts per second during live events.
End-to-end latency for automated decision: under 100ms for 95% of content.
False positive rate (FP): under 0.1% for clearly non-harmful content.
Human review queue: processed within 24 hours; SLA of 48 hours maximum.
System availability: 99.95% (no moderation means content posts without review, which is unacceptable).

The core entities

ContentItem

content_id, creator_id, text, image_url, community_id, category, submitted_at, status (pending/approved/rejected/under_review)

ModerationDecision

decision_id, content_id, tier_reached (1/2/3/4), verdict (approved/rejected/escalated), confidence, policy_category, decided_at, model_version

HumanReviewTask

task_id, content_id, assigned_to, status, decision, reasoning, decided_at, training_label

Appeal

appeal_id, content_id, creator_id, creator_context, assigned_to, status, final_verdict, created_at

API design

POST /api/moderate (synchronous, real-time submission)

Request:  { "content_id": "c_abc", "text": "...", "image_url": "...", "community_id": "tech-news", "creator_id": "u_123" }
Response: { "verdict": "approved" | "rejected" | "under_review", "tier_reached": 2, "policy_category": null, "confidence": 0.97 }

POST /api/moderate/batch (async batch for re-moderation of existing content)

Request:  { "content_ids": ["c_1", "c_2", ...], "reason": "policy_update" }
Response: { "batch_id": "batch_xyz", "queued": 5000, "estimated_completion_s": 300 }

POST /api/appeals (creator submits an appeal)

Request:  { "content_id": "c_abc", "creator_context": "This is a medical discussion, not glorifying harm" }
Response: { "appeal_id": "appeal_99", "status": "pending", "expected_resolution_hours": 24 }

GET /api/moderation/metrics

Response: { "fpr_24h": 0.0008, "fnr_24h": 0.0041, "human_queue_depth": 1240, "avg_review_time_h": 6.2 }

The four tiers operate as a chain where each level handles what it can and escalates the rest. Tier 1 (blocklist) is deterministic and handles the easiest cases in under 1ms. Tier 2 (ML classifier) handles the bulk of ambiguous cases in under 10ms. Tier 3 (LLM) handles genuinely difficult cases in under 100ms. Tier 4 (human) handles the hardest cases and all appeals asynchronously.

The key insight is that each tier handles sharply decreasing volume. Tier 1 screens 100% of posts but terminates most of them. Tier 2 sees maybe 80% of posts (the ones not caught by blocklist). Tier 3 sees only the borderline subset from Tier 2, around 10%. Tier 4 sees the cases where Tier 3 confidence is low, around 1-2%. This means the expensive LLM call runs on a fraction of total volume.

Context is the most underestimated variable. The phrase "how to cut someone" means something very different in a cooking community versus a self-harm support group. A community-agnostic classifier will have dramatically higher error rates than one that receives community metadata and applies context-sensitive rules. Always pass community_id and category to Tier 3 and Tier 4.

TL;DR

Four tiers handle content at different costs and latencies: blocklist (under 1ms), ML classifier (under 10ms), LLM classifier (under 100ms), and human review (24-48 hours).
Each tier escalates to the next only when uncertain. The LLM classifier touches only 5-10% of total volume, keeping cost manageable.
False positives are more damaging than false negatives at a certain scale. Incorrectly removing legitimate content at 0.5% FP rate means 50,000 wrongful removals per day on 10M posts.
Context is critical for LLM moderation. A phrase that is harmful in one community is normal discussion in another. Send category and community metadata alongside the content.
Human reviewer decisions are the ground truth. Every decision feeds back into classifier training.

Requirements

Functional requirements

All content submissions (text and images) are screened before appearing on the platform.
Clearly harmful content (known hate terms, verified spam) is rejected immediately with a policy violation explanation.
Borderline content is routed to human review within 24 hours, and a temporary hold is applied until review is complete.
Users whose content is removed can submit an appeal with additional context.
The moderation decision and the tier that made it are logged for every piece of content for auditability.

Non-functional requirements

Volume: 10M posts per day, peak 500 posts per second during live events.
End-to-end latency for automated decision: under 100ms for 95% of content.
False positive rate (FP): under 0.1% for clearly non-harmful content.
Human review queue: processed within 24 hours; SLA of 48 hours maximum.
System availability: 99.95% (no moderation means content posts without review, which is unacceptable).

The core entities

ContentItem

content_id, creator_id, text, image_url, community_id, category, submitted_at, status (pending/approved/rejected/under_review)

ModerationDecision

decision_id, content_id, tier_reached (1/2/3/4), verdict (approved/rejected/escalated), confidence, policy_category, decided_at, model_version

HumanReviewTask

task_id, content_id, assigned_to, status, decision, reasoning, decided_at, training_label

Appeal

appeal_id, content_id, creator_id, creator_context, assigned_to, status, final_verdict, created_at

API design

POST /api/moderate (synchronous, real-time submission)

Request:  { "content_id": "c_abc", "text": "...", "image_url": "...", "community_id": "tech-news", "creator_id": "u_123" }
Response: { "verdict": "approved" | "rejected" | "under_review", "tier_reached": 2, "policy_category": null, "confidence": 0.97 }

POST /api/moderate/batch (async batch for re-moderation of existing content)

Request:  { "content_ids": ["c_1", "c_2", ...], "reason": "policy_update" }
Response: { "batch_id": "batch_xyz", "queued": 5000, "estimated_completion_s": 300 }

POST /api/appeals (creator submits an appeal)

Request:  { "content_id": "c_abc", "creator_context": "This is a medical discussion, not glorifying harm" }
Response: { "appeal_id": "appeal_99", "status": "pending", "expected_resolution_hours": 24 }

GET /api/moderation/metrics

Response: { "fpr_24h": 0.0008, "fnr_24h": 0.0041, "human_queue_depth": 1240, "avg_review_time_h": 6.2 }

Design AI content moderation

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design AI content moderation

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments