Ad Platform
Walk through a complete ad platform design, from a basic campaign CRUD service to a two-stage retrieval-scoring pipeline serving the best ad per impression in under 100ms at 10B daily impressions, with smooth budget pacing and reliable attribution.
What is an ad management and serving system?
An ad management and serving system connects advertisers with users by selecting the best-matching creative per impression in under 100ms. The engineering challenge is the selection pipeline: thousands of campaigns compete for every impression and the winner must be chosen by relevance, bid, and remaining budget in single-digit milliseconds. This question tests candidate-retrieval data structures, budget pacing with distributed counters, high-throughput event ingestion, and delayed attribution, making it one of the richest hard-tier system design questions.
Functional Requirements
Core Requirements
- Advertisers can create campaigns with targeting criteria (demographics, interests, keywords), daily and total budgets, a schedule, and creative assets (image, headline, CTA, destination URL).
- When a user's feed loads, the system selects the best-matching ad(s) to show in under 100ms.
- Ad impressions and clicks are recorded and attributed to campaigns without data loss.
- Advertisers can view spend, impressions, clicks, and conversion dashboards, updated within minutes.
Below the Line (out of scope)
- ML model training for click-through-rate (CTR) prediction.
- Real-time bidding (RTB) with external demand-side platforms (DSPs).
- Billing, invoicing, and payment processing.
- Content review and brand safety moderation.
The hardest part in scope: Selecting the best ad per impression in under 100ms. With tens of thousands of active campaigns and complex targeting predicates, a naive database scan fails immediately. The selection pipeline needs a candidate retrieval stage (milliseconds, coarse) followed by a scoring stage (ranks by expected revenue per impression).
ML model training is below the line because it is a data-infrastructure problem separate from serving. The serving layer consumes a pre-trained CTR model as a black box. To add training, I would ship impression and click events to a feature store (Feast), run periodic training jobs with Spark or XGBoost, and deploy model artifacts to a model registry read by the scoring service.
RTB is below the line because it introduces sub-50ms latency budgets, the OpenRTB protocol, and external DSP integrations. To add RTB, the selection response from the first-party serving layer would be submitted as a floor price in an exchange auction alongside bids from external DSPs.
Billing does not change the serving path and is handled by a standalone billing service. A billing service would accumulate spend debits against a prepaid credit ledger and pause campaigns when the balance reaches zero.
Content review is below the line because moderation requires human reviewers or a specialized ML classifier and does not affect the serving hot path. To add it, new campaigns would route through a content classification service before transitioning from pending_review to active.
Non-Functional Requirements
Core Requirements
- Selection latency: Ad selection completes in under 100ms p99. The user's feed must not block on ad selection.
- Scale: 10 billion impressions per day, approximately 120,000 impression requests per second at peak. Click rate is typically 1 to 2 percent, giving 100 to 200 million clicks per day.
- Budget accuracy: A campaign's daily spend never exceeds its budget by more than 5%. Overspend protection is a contractual obligation with advertisers.
- Event durability: Impressions and clicks are captured with at-least-once delivery. Kafka retains events for 7 days for replay; acceptable duplicate rate after deduplication is under 0.01%. Losing an event is worse than recording it twice.
- Reporting freshness: Campaign dashboards reflect spend and engagement within 5 minutes. Exact real-time is not required; billing reconciliation runs hourly.
- Availability: 99.99% uptime for ad selection. Missing an impression is a direct revenue loss event. The serving path needs active-active multi-region deployment.
Below the Line
- Sub-10ms ad selection (would require on-device ML and edge-cached candidate sets)
- Exactly-once event delivery end-to-end (at-least-once with idempotent deduplication is sufficient and far cheaper)
Read/write ratio: For every campaign created, tens of millions of impression requests are evaluated against it. The serving path dwarfs campaign management by a factor of 10 million or more. Optimize for serving latency and throughput; campaign CRUD is a rounding error.
Under 100ms selection latency means a full database scan over all active campaigns is never viable at 120K requests per second. Even a 1ms per-request overhead against a single database creates a bottleneck no single instance can absorb. The selection tier must operate from in-memory data structures pre-loaded from the campaign store.
Budget accuracy at 120K impressions per second requires a distributed spend counter that can be decremented atomically without a centralized lock. This directly drives the budget pacing deep dive.
Core Entities
- Campaign: A business's advertising unit: budget (daily and total), schedule, targeting criteria, and a list of associated ads.
- Ad: A single creative unit (headline, image URL, destination URL, CTA) belonging to one campaign.
- Impression: A recorded event of an ad being displayed to a user. The primary billing and reporting unit.
- Click: A recorded event of a user clicking on an ad, referencing the originating impression ID.
- Conversion: A recorded event (purchase, sign-up) after a click or impression, reported by the advertiser's pixel or S2S postback.
- User: A platform user with behavioral signals, owned by the user profile service; the targeting layer queries it on each selection request.
The primary relationship is Campaign (1) to Ad (many). Impressions reference both the Ad shown and the User who saw it. Clicks reference an Impression. Conversions reference either a Click (click-through) or an Impression (view-through). Full schema and indexing are deferred to the deep dives.
API Design
One endpoint per core functional requirement, grouped by the requirement it satisfies.
FR 1: Create a campaign
POST /campaigns
Authorization: Bearer {advertiser_token}
Body: {
name: string,
daily_budget_cents: number,
total_budget_cents: number,
start_date: "YYYY-MM-DD",
end_date: "YYYY-MM-DD",
targeting: {
age_range: [25, 45],
interests: ["travel", "photography"],
geo: { country: "US", regions: ["CA", "NY"] },
keywords: ["mirrorless camera"]
}
}
Response 201: { campaign_id: "cmp_7x9q2k", status: "pending_review" }
The campaign starts in pending_review to allow a lightweight content check before it enters the serving pool. No ads are served from a pending_review campaign.
Create an ad inside a campaign:
POST /campaigns/{campaign_id}/ads
Authorization: Bearer {advertiser_token}
Body: {
headline: string,
image_url: string,
destination_url: string,
cta_text: string,
bid_cents: number
}
Response 201: { ad_id: "ad_4f8vja" }
bid_cents is the advertiser's maximum cost-per-click. Combined with predicted CTR, it drives the eCPM score used for selection: eCPM = bid_cents * predicted_ctr * 1000.
FR 2: Ad selection (the serving endpoint)
GET /ads/select?user_id=u_abc&placement=feed&count=1
Authorization: Bearer {platform_token}
Response 200: {
ads: [{
ad_id: "ad_4f8vja",
impression_id: "imp_9z3x1q",
headline: "Capture every moment",
image_url: "https://cdn.example.com/creatives/ad_4f8vja.jpg",
destination_url: "https://advertiser.example.com/cameras",
cta_text: "Shop now"
}]
}
The impression_id is generated server-side before responding so the client can fire an impression event using this same ID without a round trip. This separates selection latency from event recording latency.
FR 3: Record impressions and clicks
POST /events/impression
Body: { impression_id, ad_id, campaign_id, user_id, timestamp, placement }
Response 200: { ok: true }
POST /events/click
Body: { impression_id, ad_id, campaign_id, user_id, timestamp }
Response 200: { ok: true }
The client fires both endpoints as non-blocking fire-and-forget calls. Clicks include impression_id so the attribution pipeline can link the click back to the impression that preceded it. Implementation is a non-blocking Kafka publish, not a synchronous DB INSERT; I cover the details in the deep dive.
FR 4: Campaign reporting
GET /campaigns/{campaign_id}/stats?start_date=...&end_date=...&granularity=day
Authorization: Bearer {advertiser_token}
Response 200: {
impressions: 4200000,
clicks: 63000,
ctr: 0.015,
spend_cents: 441000,
conversions: 840,
data: [{ date, impressions, clicks, spend_cents }]
}
Data is served from pre-aggregated rollup tables, not a live query over billions of raw event rows. Freshness is within 5 minutes per the NFR.
High-Level Design
The system has four distinct flows, each mapping to one functional requirement. I build the design incrementally: start with the simplest component pair, then add exactly what each new requirement demands.
1. Advertisers can create campaigns with targeting and creative assets
The write path is straightforward. An advertiser submits a campaign definition; the Management Service validates it, stores it in the Campaign DB, and enqueues a review job. Once approved, the campaign becomes eligible for serving.
Components:
- Management Service: Validates campaign and ad payloads, enforces budget minimums, writes to Campaign DB.
- Campaign DB: PostgreSQL. Stores campaigns, ads, targeting criteria, budget configurations, and approval states. Low write volume (hundreds per minute, not per second).
- Review Queue: A lightweight async queue. Campaign review is a human or rule-based process that runs outside the critical path.
Request walkthrough:
- Advertiser sends
POST /campaignswith targeting rules and budget. - Management Service validates the payload (budget > 0, dates are valid, targeting schema is correct).
- Management Service inserts the campaign into Campaign DB with
status = pending_review. - Management Service enqueues a review task. Returns
campaign_idto the advertiser. - Review process approves the campaign, updating
status = active. - Campaign Sync Service (see next section) picks up the active campaign and loads it into the serving layer's in-memory index.
Creative assets (images, video) are uploaded directly to S3 by the advertiser and served globally via a CDN (CloudFront or Fastly). When an advertiser creates an ad, the image_url field contains the CDN URL, not the S3 origin URL. The Selection Service returns this CDN URL verbatim in the ad selection response; the client fetches the image directly from the nearest CDN edge node without touching the ad platform. A 500KB creative served from the nearest point of presence loads in under 20ms for 95% of users globally.
This covers campaign management and creative asset serving. Ad selection does not touch Campaign DB on the hot path; the serving layer reads from a pre-built in-memory index.
2. User's feed loads and the system selects the best ad in under 100ms
This is the hard part. Let me start with the naive approach and show why it fails before evolving to the correct design.
Naive approach: query Campaign DB directly
Components:
- Selection Service: Receives the ad select request with user context.
- Campaign DB: Scanned to find matching active campaigns.
Request walkthrough:
- Feed service sends
GET /ads/select?user_id=u_abc&placement=feed. - Selection Service queries Campaign DB:
SELECT * FROM campaigns WHERE status='active' AND geo matches AND interest_tags overlap AND budget_remaining > 0. - Selection Service scores the results and returns the top ad.
What breaks: Campaign DB has 50,000 active campaigns. A SELECT with an interests && ? array overlap operator and a remaining_budget > 0 check scans every row. At 120,000 requests per second, each query is 10 to 50ms on a warm DB. The database saturates well below this rate; no amount of indexing saves a full-table relevance scan under this load.
The fix: pre-built in-memory candidate index + two-stage selection
The key insight is that the candidate set for any given user is small (a few hundred campaigns) but the full campaign set is large (tens of thousands). We separate the problem into two steps: candidate retrieval (coarse, fast) and scoring (fine, computed on the small candidate set).
Evolved components:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.