Design AI content moderation
Walk through designing a multi-layer AI content moderation system that handles 10M posts per day with sub-100ms latency, minimizes false positives, and routes borderline content to human review.
TL;DR
- Four tiers handle content at different costs and latencies: blocklist (under 1ms), ML classifier (under 10ms), LLM classifier (under 100ms), and human review (24-48 hours).
- Each tier escalates to the next only when uncertain. The LLM classifier touches only 5-10% of total volume, keeping cost manageable.
- False positives are more damaging than false negatives at a certain scale. Incorrectly removing legitimate content at 0.5% FP rate means 50,000 wrongful removals per day on 10M posts.
- Context is critical for LLM moderation. A phrase that is harmful in one community is normal discussion in another. Send category and community metadata alongside the content.
- Human reviewer decisions are the ground truth. Every decision feeds back into classifier training.
Requirements
Functional requirements
- All content submissions (text and images) are screened before appearing on the platform.
- Clearly harmful content (known hate terms, verified spam) is rejected immediately with a policy violation explanation.
- Borderline content is routed to human review within 24 hours, and a temporary hold is applied until review is complete.
- Users whose content is removed can submit an appeal with additional context.
- The moderation decision and the tier that made it are logged for every piece of content for auditability.
Non-functional requirements
- Volume: 10M posts per day, peak 500 posts per second during live events.
- End-to-end latency for automated decision: under 100ms for 95% of content.
- False positive rate (FP): under 0.1% for clearly non-harmful content.
- Human review queue: processed within 24 hours; SLA of 48 hours maximum.
- System availability: 99.95% (no moderation means content posts without review, which is unacceptable).
The core entities
ContentItem
content_id,creator_id,text,image_url,community_id,category,submitted_at,status(pending/approved/rejected/under_review)
ModerationDecision
decision_id,content_id,tier_reached(1/2/3/4),verdict(approved/rejected/escalated),confidence,policy_category,decided_at,model_version
HumanReviewTask
task_id,content_id,assigned_to,status,decision,reasoning,decided_at,training_label
Appeal
appeal_id,content_id,creator_id,creator_context,assigned_to,status,final_verdict,created_at
API design
POST /api/moderate (synchronous, real-time submission)
Request: { "content_id": "c_abc", "text": "...", "image_url": "...", "community_id": "tech-news", "creator_id": "u_123" }
Response: { "verdict": "approved" | "rejected" | "under_review", "tier_reached": 2, "policy_category": null, "confidence": 0.97 }
POST /api/moderate/batch (async batch for re-moderation of existing content)
Request: { "content_ids": ["c_1", "c_2", ...], "reason": "policy_update" }
Response: { "batch_id": "batch_xyz", "queued": 5000, "estimated_completion_s": 300 }
POST /api/appeals (creator submits an appeal)
Request: { "content_id": "c_abc", "creator_context": "This is a medical discussion, not glorifying harm" }
Response: { "appeal_id": "appeal_99", "status": "pending", "expected_resolution_hours": 24 }
GET /api/moderation/metrics
Response: { "fpr_24h": 0.0008, "fnr_24h": 0.0041, "human_queue_depth": 1240, "avg_review_time_h": 6.2 }
High-level design
The four tiers operate as a chain where each level handles what it can and escalates the rest. Tier 1 (blocklist) is deterministic and handles the easiest cases in under 1ms. Tier 2 (ML classifier) handles the bulk of ambiguous cases in under 10ms. Tier 3 (LLM) handles genuinely difficult cases in under 100ms. Tier 4 (human) handles the hardest cases and all appeals asynchronously.
The key insight is that each tier handles sharply decreasing volume. Tier 1 screens 100% of posts but terminates most of them. Tier 2 sees maybe 80% of posts (the ones not caught by blocklist). Tier 3 sees only the borderline subset from Tier 2, around 10%. Tier 4 sees the cases where Tier 3 confidence is low, around 1-2%. This means the expensive LLM call runs on a fraction of total volume.
Context is the most underestimated variable. The phrase "how to cut someone" means something very different in a cooking community versus a self-harm support group. A community-agnostic classifier will have dramatically higher error rates than one that receives community metadata and applies context-sensitive rules. Always pass community_id and category to Tier 3 and Tier 4.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.