A/B Testing Platform
Design an experimentation platform like Optimizely or Google Experiments that assigns users to treatments consistently, measures statistical impact on key metrics, and lets teams run hundreds of concurrent experiments safely.
What is an A/B testing system?
An A/B testing platform assigns users to experiment variants (control vs. one or more treatments) and measures whether the treatment changes a business metric. The interesting engineering challenges are not the statistics: they are ensuring every user sees the same variant on every request without querying a database, isolating hundreds of concurrent experiments from contaminating each other's results, and streaming millions of conversion events into per-variant aggregates fast enough for teams to act on them.
I've seen teams spend months building a sophisticated stats engine only to discover their assignment logic was flipping users between variants mid-session. Get assignment right first; the statistics are the easy part.
Functional Requirements
Core Requirements
- Experiment owners define variants (control and one or more treatments) with traffic allocation percentages and targeting rules.
- The system assigns a user to exactly one variant per experiment, and that assignment never changes for the duration of the experiment.
- Client SDKs retrieve all active experiment assignments for a user in under 10ms.
- Success metrics (impressions, conversions, revenue) are collected and computed per variant.
Below the Line (out of scope)
- Statistical significance calculation and p-value computation (delegate to a stats library such as SciPy or statsmodels).
- Multi-armed bandit and Bayesian optimization.
- Fraud and bot filtering for experiment traffic.
- Real-time metric dashboards (accept up to 5-minute lag in aggregates).
The hardest part in scope: Consistent user assignment without a per-request database lookup. This is the constraint that shapes the entire read path. If assignment is slow, every page load in your product is slow.
Statistical significance calculation is below the line because the platform only needs to produce the raw aggregates (exposures and conversions per variant). Calling scipy.stats.chi2_contingency on those aggregates is a one-line operation any experiment analyst can run outside the platform. Designing the significance engine adds months of complexity for a feature that any analyst can replicate locally in 30 seconds.
Multi-armed bandit optimization is below the line because it requires changing variant weights mid-experiment based on incoming results, which breaks the assumption that assignment probabilities are stable for the analysis window. Designing that safely (avoiding peeking problems and inflated false-positive rates) is a research-level problem that deserves its own article.
Fraud and bot filtering is below the line because it requires ML-based classification that sits orthogonal to the assignment and metrics pipeline. To add it, I would enrich incoming exposure events with a bot-score from an off-the-shelf fingerprinting service and filter bot users out of the analytics aggregation step.
Non-Functional Requirements
Core Requirements
- Assignment consistency: Once a user is assigned to a variant, they see the same variant for the experiment lifetime. No flipping mid-experiment.
- Availability: 99.99% uptime for the assignment path. A failure in the assignment logic must fall back to the control variant, never crash the client.
- Latency: SDK retrieves all active experiment assignments in under 10ms p99. This is the NFR that drives every architectural decision on the read path.
- Scale: 500 concurrent active experiments, 50 million DAU, peak assignment lookup rate of approximately 50,000 requests per second. Each experiment write (create, update, launch) happens at most a few times per day per experiment.
Below the Line
- Sub-5ms assignment latency via pure in-process computation (covered in the deep dive but not a primary target).
- Sub-second metric freshness (5-minute lag is acceptable).
Read/write ratio: For every experiment created or updated (roughly 100 writes per day across all experiments), there are approximately 50,000 assignment lookups per second. That is a read/write ratio of roughly 40 million to 1. This extreme imbalance means the assignment path must never touch the primary database. Every design decision on the read path exists to eliminate that database call.
Under 10ms assignment latency means a round trip to a remote cache is already risky. A Redis lookup adds 1ms under ideal conditions, but p99 latency on a busy cluster can spike to 5-10ms, consuming the entire budget.
I'd call out this latency budget early in the interview because it eliminates most caching architectures before you even draw a diagram. The safe approach serves experiment configs from an in-process SDK cache seeded by a CDN-backed config endpoint. Assignment computation then becomes a pure in-memory operation measured in microseconds.
Core Entities
- Experiment: The container for a test. Carries a unique key, status (draft, active, paused, concluded), targeting rules, and the date range for the analysis window.
- Variant: One arm of an experiment (control or a named treatment). Carries a variant key, a traffic allocation percentage, and an arbitrary JSON config payload that the client SDK uses to alter the experience.
- Assignment: A durable record of which variant a specific user was placed into and when. Written at first exposure. Serves as the ground truth for metric attribution.
- Event: A user action used as a success metric. Carries a user ID, event name, optional numeric value (e.g. order amount), and a timestamp. Events are linked to assignments at aggregation time, not at tracking time.
- Metric: A named aggregation definition tied to an experiment variant. Stores the exposure count, conversion count, and optional sum (for revenue-type metrics) over the analysis window.
The full schema and column types will be revisited during the data model deep dive if scope expands to include it; the entities above are sufficient to drive the API design and High-Level Design.
API Design
FR 1 and FR 4 - Create and launch an experiment:
# Create a new experiment in draft status
POST /experiments
Body: {
key: "signup_button_color",
variants: [
{ key: "control", allocation: 50, config: {} },
{ key: "treatment_a", allocation: 50, config: { button_color: "green" } }
],
targeting: { platforms: ["web"], user_segments: ["new_users"] },
metrics: ["signup_conversion", "revenue_30d"]
}
Response: { experiment_id, status: "draft" }
# Transition experiment from draft to active (launches it)
PATCH /experiments/{experiment_id}
Body: { status: "active" }
Response: { experiment_id, status: "active" }
PATCH over PUT for status transitions because we are modifying one field on an existing resource. Separating create from launch lets teams configure an experiment in draft before exposing it to users.
FR 2 and FR 3 - Retrieve assignments for a user:
# Fetch all active experiment assignments for a user in one call
GET /assignments?user_id={user_id}
Response: {
assignments: {
"signup_button_color": "treatment_a",
"homepage_layout": "control"
}
}
The SDK calls this endpoint once per session (or polls periodically) and caches the result in memory. Returning all active experiment assignments in one payload avoids per-experiment round trips. An SDK making 500 separate calls for a user enrolled in 500 experiments would be unusable.
FR 4 - Track an event:
# Track batched user events; server returns after Kafka publish, not after aggregation
POST /events
Body: {
user_id: "u123",
events: [
{ name: "signup_conversion", timestamp: "2026-04-02T10:00:00Z" },
{ name: "purchase", value: 49.99, timestamp: "2026-04-02T10:01:00Z" }
]
}
Response: { received: 2 }
Events are batched on the client SDK and flushed in bulk to reduce request overhead. The server validates schema and publishes to the event pipeline without waiting for downstream aggregation to complete. A 201 response confirms receipt, not processing.
FR 4 - Retrieve metric results for an experiment:
# Retrieve per-variant metric aggregates for an experiment
GET /experiments/{experiment_id}/results
Response: {
variants: [
{
key: "control",
exposures: 250000,
metrics: {
"signup_conversion": { conversions: 12500, rate: 0.050 },
"revenue_30d": { sum: 875000.00, mean_per_user: 3.50 }
}
},
{
key: "treatment_a",
exposures: 250000,
metrics: {
"signup_conversion": { conversions: 15000, rate: 0.060 },
"revenue_30d": { sum: 1050000.00, mean_per_user: 4.20 }
}
}
]
}
This endpoint is read-only and expensive; cache its response for 60 seconds keyed on experiment ID to prevent analysts from triggering repeated full-table scans when refreshing the results page.
High-Level Design
1. Experiment owners define variants and targeting rules
The write path for experiment configuration. Admins create and launch experiments through a management API that writes to a relational database.
Components:
- Admin Client: The product team's web UI or CI/CD tooling sending experiment definitions.
- Experiment API: Validates variant allocation percentages sum to 100%, persists the experiment and variant records, and invalidates the config cache on any change.
- Experiment DB: The source of truth for all experiment definitions. Relational storage suits this well: experiments and variants are small structured records with clear relationships.
Request walkthrough:
- Admin sends
POST /experimentswith variant definitions and targeting rules. - API validates that allocation percentages sum to exactly 100%.
- API writes one row to the
experimentstable and one row per variant to thevariantstable. - API publishes a config-invalidation event so the cache tier reflects the new experiment.
- API returns
{ experiment_id, status: "draft" }. - Admin sends
PATCH /experiments/{id}withstatus: "active"to launch.
This is the write path only. The read path that distributes these configs to SDKs comes next.
2. Users are assigned to variants consistently
Every user request needs to know which variant to show. The naive approach is to call the Experiment API on every request. I would never recommend that in practice: at 50K assignment lookups per second, even a small increase in API latency directly degrades every page in the product. This is the section where the interview is won or lost. If your assignment path touches a database on every request, the interviewer knows the design won't hold.
The key insight is that assignment does not require a network call. If the experiment config is available locally (which variant holds which bucket range), assignment is a deterministic hash computation: hash(user_id + experiment_id) modulo 100, compared against the variant allocation ranges.
Components:
- Client SDK: An in-process library (exists in every service that needs assignments). Holds a local copy of the active experiment configs, refreshed periodically from the Config endpoint.
- Assignment Logic: Pure in-memory computation inside the SDK. No network call needed once configs are loaded.
- Config Cache (Redis): Serves as the intermediary between the Experiment DB and SDKs. The Experiment API invalidates this cache on every experiment change.
Request walkthrough:
- SDK calls
GET /assignments?user_id=u123on startup and after each config refresh. - Assignment Service fetches active experiment configs from Redis (sub-millisecond).
- For each active experiment, the service computes
hash(user_id + experiment_id) % 100and maps the result to a variant bucket. - The service returns all assignments for the user in a single response object.
- SDK caches the assignments in process memory for the remainder of the session.
- All SDK calls (
getVariant("signup_button_color")) are now pure memory lookups.
Once the SDK has this map, every subsequent call to getVariant() is a hash-map lookup in the process heap. That is how sub-millisecond assignment is achievable even without edge caching.
3. Client SDKs retrieve assignments in under 10ms
The assignment endpoint above works for server-side SDKs. Client-side SDKs (browser and mobile) face an additional constraint: the first page load cannot afford a round trip to the Assignment Service before rendering. Users would see a flash of the control version before the treatment loads.
The fix is to push experiment configs to the edge so the SDK can compute assignments locally without a server round trip. This is the CDN-distributed config pattern.
Components:
- Config Endpoint: Serializes all active experiment definitions into a single JSON snapshot with an ETag fingerprint. The CDN caches this response globally.
- CDN Edge: Serves the config snapshot from the nearest edge node. Cache TTL of 30 seconds, so experiment launches propagate worldwide within 30 seconds.
- Client SDK: Polls the Config Endpoint periodically (every 30 seconds) using conditional
If-None-Matchrequests. Stores configs inlocalStorageso they survive page reloads. Computes assignments in-process using the same deterministic hash.
Request walkthrough:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.