A/B Testing Platform
Design an experimentation platform like Optimizely or Google Experiments that assigns users to treatments consistently, measures statistical impact on key metrics, and lets teams run hundreds of concurrent experiments safely.
What is an A/B testing system?
An A/B testing platform assigns users to experiment variants (control vs. one or more treatments) and measures whether the treatment changes a business metric. The interesting engineering challenges are not the statistics: they are ensuring every user sees the same variant on every request without querying a database, isolating hundreds of concurrent experiments from contaminating each other's results, and streaming millions of conversion events into per-variant aggregates fast enough for teams to act on them.
I've seen teams spend months building a sophisticated stats engine only to discover their assignment logic was flipping users between variants mid-session. Get assignment right first; the statistics are the easy part.
Functional Requirements
Core Requirements
- Experiment owners define variants (control and one or more treatments) with traffic allocation percentages and targeting rules.
- The system assigns a user to exactly one variant per experiment, and that assignment never changes for the duration of the experiment.
- Client SDKs retrieve all active experiment assignments for a user in under 10ms.
- Success metrics (impressions, conversions, revenue) are collected and computed per variant.
Below the Line (out of scope)
- Statistical significance calculation and p-value computation (delegate to a stats library such as SciPy or statsmodels).
- Multi-armed bandit and Bayesian optimization.
- Fraud and bot filtering for experiment traffic.
- Real-time metric dashboards (accept up to 5-minute lag in aggregates).
The hardest part in scope: Consistent user assignment without a per-request database lookup. This is the constraint that shapes the entire read path. If assignment is slow, every page load in your product is slow.
Statistical significance calculation is below the line because the platform only needs to produce the raw aggregates (exposures and conversions per variant). Calling scipy.stats.chi2_contingency on those aggregates is a one-line operation any experiment analyst can run outside the platform. Designing the significance engine adds months of complexity for a feature that any analyst can replicate locally in 30 seconds.
Multi-armed bandit optimization is below the line because it requires changing variant weights mid-experiment based on incoming results, which breaks the assumption that assignment probabilities are stable for the analysis window. Designing that safely (avoiding peeking problems and inflated false-positive rates) is a research-level problem that deserves its own article.
Fraud and bot filtering is below the line because it requires ML-based classification that sits orthogonal to the assignment and metrics pipeline. To add it, I would enrich incoming exposure events with a bot-score from an off-the-shelf fingerprinting service and filter bot users out of the analytics aggregation step.
Non-Functional Requirements
Core Requirements
- Assignment consistency: Once a user is assigned to a variant, they see the same variant for the experiment lifetime. No flipping mid-experiment.
- Availability: 99.99% uptime for the assignment path. A failure in the assignment logic must fall back to the control variant, never crash the client.
- Latency: SDK retrieves all active experiment assignments in under 10ms p99. This is the NFR that drives every architectural decision on the read path.
- Scale: 500 concurrent active experiments, 50 million DAU, peak assignment lookup rate of approximately 50,000 requests per second. Each experiment write (create, update, launch) happens at most a few times per day per experiment.
Below the Line
- Sub-5ms assignment latency via pure in-process computation (covered in the deep dive but not a primary target).
- Sub-second metric freshness (5-minute lag is acceptable).
Read/write ratio: For every experiment created or updated (roughly 100 writes per day across all experiments), there are approximately 50,000 assignment lookups per second. That is a read/write ratio of roughly 40 million to 1. This extreme imbalance means the assignment path must never touch the primary database. Every design decision on the read path exists to eliminate that database call.
Under 10ms assignment latency means a round trip to a remote cache is already risky. A Redis lookup adds 1ms under ideal conditions, but p99 latency on a busy cluster can spike to 5-10ms, consuming the entire budget.
I'd call out this latency budget early in the interview because it eliminates most caching architectures before you even draw a diagram. The safe approach serves experiment configs from an in-process SDK cache seeded by a CDN-backed config endpoint. Assignment computation then becomes a pure in-memory operation measured in microseconds.
Core Entities
- Experiment: The container for a test. Carries a unique key, status (draft, active, paused, concluded), targeting rules, and the date range for the analysis window.
- Variant: One arm of an experiment (control or a named treatment). Carries a variant key, a traffic allocation percentage, and an arbitrary JSON config payload that the client SDK uses to alter the experience.
- Assignment: A durable record of which variant a specific user was placed into and when. Written at first exposure. Serves as the ground truth for metric attribution.
- Event: A user action used as a success metric. Carries a user ID, event name, optional numeric value (e.g. order amount), and a timestamp. Events are linked to assignments at aggregation time, not at tracking time.
- Metric: A named aggregation definition tied to an experiment variant. Stores the exposure count, conversion count, and optional sum (for revenue-type metrics) over the analysis window.
The full schema and column types will be revisited during the data model deep dive if scope expands to include it; the entities above are sufficient to drive the API design and High-Level Design.
API Design
FR 1 and FR 4 - Create and launch an experiment:
# Create a new experiment in draft status
POST /experiments
Body: {
key: "signup_button_color",
variants: [
{ key: "control", allocation: 50, config: {} },
{ key: "treatment_a", allocation: 50, config: { button_color: "green" } }
],
targeting: { platforms: ["web"], user_segments: ["new_users"] },
metrics: ["signup_conversion", "revenue_30d"]
}
Response: { experiment_id, status: "draft" }
# Transition experiment from draft to active (launches it)
PATCH /experiments/{experiment_id}
Body: { status: "active" }
Response: { experiment_id, status: "active" }
PATCH over PUT for status transitions because we are modifying one field on an existing resource. Separating create from launch lets teams configure an experiment in draft before exposing it to users.
FR 2 and FR 3 - Retrieve assignments for a user:
# Fetch all active experiment assignments for a user in one call
GET /assignments?user_id={user_id}
Response: {
assignments: {
"signup_button_color": "treatment_a",
"homepage_layout": "control"
}
}
The SDK calls this endpoint once per session (or polls periodically) and caches the result in memory. Returning all active experiment assignments in one payload avoids per-experiment round trips. An SDK making 500 separate calls for a user enrolled in 500 experiments would be unusable.
FR 4 - Track an event:
# Track batched user events; server returns after Kafka publish, not after aggregation
POST /events
Body: {
user_id: "u123",
events: [
{ name: "signup_conversion", timestamp: "2026-04-02T10:00:00Z" },
{ name: "purchase", value: 49.99, timestamp: "2026-04-02T10:01:00Z" }
]
}
Response: { received: 2 }
Events are batched on the client SDK and flushed in bulk to reduce request overhead. The server validates schema and publishes to the event pipeline without waiting for downstream aggregation to complete. A 201 response confirms receipt, not processing.
FR 4 - Retrieve metric results for an experiment:
# Retrieve per-variant metric aggregates for an experiment
GET /experiments/{experiment_id}/results
Response: {
variants: [
{
key: "control",
exposures: 250000,
metrics: {
"signup_conversion": { conversions: 12500, rate: 0.050 },
"revenue_30d": { sum: 875000.00, mean_per_user: 3.50 }
}
},
{
key: "treatment_a",
exposures: 250000,
metrics: {
"signup_conversion": { conversions: 15000, rate: 0.060 },
"revenue_30d": { sum: 1050000.00, mean_per_user: 4.20 }
}
}
]
}
This endpoint is read-only and expensive; cache its response for 60 seconds keyed on experiment ID to prevent analysts from triggering repeated full-table scans when refreshing the results page.
High-Level Design
1. Experiment owners define variants and targeting rules
The write path for experiment configuration. Admins create and launch experiments through a management API that writes to a relational database.
Components:
- Admin Client: The product team's web UI or CI/CD tooling sending experiment definitions.
- Experiment API: Validates variant allocation percentages sum to 100%, persists the experiment and variant records, and invalidates the config cache on any change.
- Experiment DB: The source of truth for all experiment definitions. Relational storage suits this well: experiments and variants are small structured records with clear relationships.
Request walkthrough:
- Admin sends
POST /experimentswith variant definitions and targeting rules. - API validates that allocation percentages sum to exactly 100%.
- API writes one row to the
experimentstable and one row per variant to thevariantstable. - API publishes a config-invalidation event so the cache tier reflects the new experiment.
- API returns
{ experiment_id, status: "draft" }. - Admin sends
PATCH /experiments/{id}withstatus: "active"to launch.
This is the write path only. The read path that distributes these configs to SDKs comes next.
2. Users are assigned to variants consistently
Every user request needs to know which variant to show. The naive approach is to call the Experiment API on every request. I would never recommend that in practice: at 50K assignment lookups per second, even a small increase in API latency directly degrades every page in the product. This is the section where the interview is won or lost. If your assignment path touches a database on every request, the interviewer knows the design won't hold.
The key insight is that assignment does not require a network call. If the experiment config is available locally (which variant holds which bucket range), assignment is a deterministic hash computation: hash(user_id + experiment_id) modulo 100, compared against the variant allocation ranges.
Components:
- Client SDK: An in-process library (exists in every service that needs assignments). Holds a local copy of the active experiment configs, refreshed periodically from the Config endpoint.
- Assignment Logic: Pure in-memory computation inside the SDK. No network call needed once configs are loaded.
- Config Cache (Redis): Serves as the intermediary between the Experiment DB and SDKs. The Experiment API invalidates this cache on every experiment change.
Request walkthrough:
- SDK calls
GET /assignments?user_id=u123on startup and after each config refresh. - Assignment Service fetches active experiment configs from Redis (sub-millisecond).
- For each active experiment, the service computes
hash(user_id + experiment_id) % 100and maps the result to a variant bucket. - The service returns all assignments for the user in a single response object.
- SDK caches the assignments in process memory for the remainder of the session.
- All SDK calls (
getVariant("signup_button_color")) are now pure memory lookups.
Once the SDK has this map, every subsequent call to getVariant() is a hash-map lookup in the process heap. That is how sub-millisecond assignment is achievable even without edge caching.
3. Client SDKs retrieve assignments in under 10ms
The assignment endpoint above works for server-side SDKs. Client-side SDKs (browser and mobile) face an additional constraint: the first page load cannot afford a round trip to the Assignment Service before rendering. Users would see a flash of the control version before the treatment loads.
The fix is to push experiment configs to the edge so the SDK can compute assignments locally without a server round trip. This is the CDN-distributed config pattern.
Components:
- Config Endpoint: Serializes all active experiment definitions into a single JSON snapshot with an ETag fingerprint. The CDN caches this response globally.
- CDN Edge: Serves the config snapshot from the nearest edge node. Cache TTL of 30 seconds, so experiment launches propagate worldwide within 30 seconds.
- Client SDK: Polls the Config Endpoint periodically (every 30 seconds) using conditional
If-None-Matchrequests. Stores configs inlocalStorageso they survive page reloads. Computes assignments in-process using the same deterministic hash.
Request walkthrough:
- SDK on first load checks
localStoragefor a cached config and ETag. - SDK sends
GET /configwithIf-None-Match: {cached_etag}to the CDN. - CDN responds with 304 Not Modified if configs have not changed (zero bytes downloaded).
- If configs changed, CDN forwards the miss to the Config Endpoint, which queries the DB and returns a fresh JSON snapshot with a new ETag.
- SDK updates
localStorageand recomputes in-memory assignments. - All
getVariant()calls return from memory immediately. No network roundtrip during rendering.
With configs cached in localStorage, the SDK survives a CDN outage for the duration of the last cached config. The fallback is serving the control variant for all experiments, which is always the safe default.
4. Success metrics are collected and computed per variant
Measuring the impact of a treatment requires two data streams: exposure events (which users saw which variant) and goal events (which users performed the target action). The platform joins these two streams to compute conversion rates per variant.
Components:
- Event Ingestion API: Accepts batched events from SDKs. Validates schema and publishes to Kafka without blocking on downstream aggregation.
- Kafka Event Topic: A durable, partitioned log of all user events. Partitioned by
user_idso all events for a given user land on the same partition, enabling efficient temporal joins. - Metrics Worker: Consumes both the exposure stream and the goal event stream. Performs a temporal join: for each goal event, find the most recent exposure for that user in the same experiment, and credit the conversion to that variant.
- Analytics DB: Stores the raw event log and pre-aggregated metrics tables per variant. ClickHouse or Druid works well here: they are columnar stores optimized for GROUP BY queries over time-series event data.
Request walkthrough:
- SDK batches exposure events (user u123 saw treatment_a of experiment signup_button_color) and goal events (user u123 clicked the button) and sends them to the Event Ingestion API.
- Ingestion API validates the event schema and publishes each event to the appropriate Kafka topic (
exposuresorevents). - Metrics Worker consumes from both topics. For each goal event, it looks up the user's active assignment in the
exposurestopic within the analysis window. - Worker writes
(experiment_id, variant_key, +1 exposure, +1 conversion)increments to the Analytics DB. - Analytics DB materializes per-variant aggregate views that the Results API reads.
The event pipeline is intentionally async. The ingestion API does not wait for the Metrics Worker to finish before returning. This keeps the write latency on the hot path (SDK to Ingestion API) under 50ms even when the Metrics Worker falls behind. I'd make this design decision explicit in an interview: decouple ingestion from aggregation. The moment you couple them, a slow ClickHouse query blocks event collection and you lose data.
Potential Deep Dives
1. How do we ensure user assignment is consistent across sessions?
The constraint is strict: a user who sees treatment_a on Monday must see treatment_a on Tuesday and on mobile, even if the experiment config was briefly unavailable. Assignment must never flip mid-experiment.
There are three levels of consistency to consider:
- The same user must see the same variant on every request.
- The variant must be stable even if the SDK restarts, the user clears cookies, or they switch devices.
- The assignment must hold even if the Experiment API or cache is temporarily unavailable.
2. How do we serve experiment configs in under 10ms?
Every SDK assignment call depends on an up-to-date view of the active experiment configs (which experiments are live, what the variant allocations are, what the targeting rules are). Fetching this from the Experiment DB on every call is not viable. There are three progressively better approaches.
3. How do we measure success metrics per treatment?
Measuring the impact of a treatment requires joining two independent event streams: who was exposed to which variant, and who performed the target action (signup, purchase, etc.). The join is the hard part.
I'd flag the temporal join early because it's the subtle correctness requirement most candidates miss. The constraints from the NFRs:
- 50M DAU generating exposure events and goal events.
- Up to 5-minute lag in metric aggregates is acceptable.
- The join must be temporally aware: a purchase made before the user was exposed to the experiment should not count as a treatment-driven conversion.
4. How do we prevent experiment interactions when running hundreds of experiments simultaneously?
At scale (500 concurrent experiments), two experiments will inevitably target overlapping user populations. If Experiment A tests a red vs. blue button and Experiment B tests a one-click vs. two-click checkout, both experiments target all users. Any user in Experiment A's treatment is also in Experiment B, and over-representation of one variant across population groups biases both experiments' results.
In my experience, this is the deep dive that separates senior candidates from everyone else. Most people stop at "hash the user ID" and never think about what happens when 500 experiments share the same hash space.
Constraints to design against:
- A user can be enrolled in multiple experiments simultaneously.
- Assignments across experiments must be statistically independent (correlation in enrollment breaks significance tests).
- The mechanism must scale to hundreds of experiments without requiring experiment owners to manually declare exclusions.
5. How do we safely toggle experiments without user-visible flicker?
When an experiment is launched or modified, client-side SDKs that have an older config cached will show a different experience from SDKs that have already received the update. For visual experiments (new button color, new layout), this means some users see the control flash briefly before switching to the treatment.
Final Architecture
The read path for assignment is the core architectural insight: the CDN distributes config snapshots globally so SDK assignment computation requires zero network I/O during active sessions. The write path (Admin โ Experiment API โ DB) touches the database once per experiment launch, not once per user. The metrics pipeline is entirely async, keeping event ingestion latency under 50ms regardless of how long aggregation takes.
Interview Cheat Sheet
- Start by locking down 4 core features: experiment definition, consistent user assignment, fast config retrieval (under 10ms), and per-variant metric collection. Everything else (statistics, multi-armed bandit, real-time dashboards) is below the line.
- State the read/write ratio early: 40 million assignment lookups for every 1 experiment write. This ratio is the reason the read path must never touch the primary database.
- Deterministic hash assignment (
hash(user_id + experiment_id) % 100) eliminates per-request storage entirely. Same inputs always produce the same bucket, so consistency across sessions and devices is free. - Assignment computation is a hash and a bucket comparison. Once configs are in memory, each
getVariant()call runs in under 1 microsecond. No network, no cache lookup. - Serve experiment configs from a CDN with ETag-based conditional requests. SDK instances receive 304 Not Modified on 99% of polls when experiments are stable. Config size is typically under 150KB before compression.
- Session-locked configs (via
sessionStorage) prevent user-visible flicker. Users see one consistent variant for the duration of a session, regardless of when experiments are launched or paused. - Use MurmurHash3 (not MD5 or SHA-1) for bucket assignment. It produces a uniform distribution across the 0-99 range, which is required for allocation percentages to be accurate at all traffic levels.
- Layer-based orthogonal assignment eliminates inter-experiment correlation without any explicit exclusion rules. Experiments in different layers use independent hash salts, making assignments statistically independent.
- The temporal join in the Metrics Worker is the correctness-critical step: a conversion only credits the variant the user was exposed to, and only if the conversion happened after the exposure. Pre-exposure conversions are ignored.
- Use ClickHouse (or Druid) for the Analytics DB. Columnar storage and
MergeTreepartitioning let the Results API scan only the relevant experiment's rows, returning results in seconds even with billions of events. - At-least-once Kafka delivery requires idempotent metric increments. Use a unique event ID as a deduplication key in the upsert, not a blind
INCREMENT. - The Assignment Service is stateless. It reads from Redis (or falls back to the DB) and computes a hash. Horizontal scaling requires no coordination between pods.
- Gradual experiment rollout is a one-line extension:
if hash(user_id + exp_id) % 100 >= rollout_pct: return null. Increasing rollout from 5% to 50% does not re-assign users already in the experiment window. - For an urgent experiment kill switch, use a CDN cache-bust URL and a short TTL (5 seconds) on the emergency config. This propagates worldwide in under 10 seconds regardless of session locking.