Analytics Pipeline
Design an analytics platform like Google Analytics that collects billions of user events per day, processes them through a streaming and batch pipeline, and serves query results on dashboards in seconds.
What is a user analytics pipeline?
A user analytics pipeline collects every click, page view, and custom event from your product, stores billions of those events durably, and lets you query them seconds later on a dashboard. The deceptively hard part is not collecting events; it is reconciling two contradictory requirements that live in the same system.
Real-time dashboards demand low-latency streaming. Historical funnels and cohort analysis demand high-throughput batch scans over months of data. Building both without building two separate systems that drift apart is the core engineering challenge here. I've seen teams burn months maintaining a Lambda architecture where the batch and streaming layers compute slightly different numbers, and nobody trusts either. This question tests pipeline architecture, columnar storage selection, the Lambda vs Kappa architecture debate, and write-heavy system design.
Functional Requirements
Core Requirements
- Collect page views, clicks, and custom events from web and mobile clients.
- Store events durably and allow querying by time range, user segment, and event type.
- Serve near-real-time and historical dashboards with aggregated metrics.
- Support funnel analysis: track the sequence of steps users take toward a conversion goal.
Below the Line (out of scope)
- ML-based anomaly detection
- GDPR-compliant data deletion workflows
- A/B test assignment and exposure logging
- Real-time personalization scoring
The hardest part in scope: Balancing query freshness against query cost. Scanning raw events for every dashboard request at 1B events/day is prohibitively slow. Pre-aggregating too aggressively locks you into fixed query shapes. The architecture must support both approximate fresh aggregates and exact historical scans without duplicating the full pipeline.
ML-based anomaly detection is below the line because it is a consumer of the pipeline, not the pipeline itself. To add it, attach a Flink job to the raw Kafka topic that scores each event against a trained model and emits anomaly signals to a separate alert topic. The ingestion and storage layers are unchanged.
GDPR deletion is below the line because it requires finding and purging a specific user's events across all partitions of a columnar store (ClickHouse, BigQuery) where rows are immutable by design. To add it, maintain a deletion log in a mutable store (PostgreSQL). The query layer joins against the deletion log at read time to filter suppressed user IDs. Periodic compaction jobs physically remove the rows during off-peak hours.
A/B test assignment is below the line because it requires a consistent assignment service (low-latency, avoid re-assigning users mid-experiment) and the logging of assignment events, which is a write path concern separate from the analytics query path.
Non-Functional Requirements
Core Requirements
- Durability: 99.9% of events must be delivered and stored. Losing a handful of debug-level events during a broker restart is tolerable; losing conversion events is not.
- Write throughput: 1 billion events per day, roughly 11,500 events/second sustained. Peak traffic (product launches, sales events) can reach 3x that: approximately 35,000 events/second.
- Query latency: Dashboard aggregate queries must return in under 5 seconds at p95. Funnel computation for 90-day cohorts must complete in under 30 seconds.
- Freshness: Real-time dashboard panels must reflect events no older than 60 seconds. Historical reports are batch-computed and may be up to 1 hour stale.
- Retention: 90 days of hot storage (fully indexed, fast query). 3 years of cold archival (object storage, queryable via batch scan).
- Scale: At 1 KB average event size, 1B events/day is roughly 1 TB of raw ingestion per day. After 90 days that is 90 TB of hot storage.
Below the Line
- Sub-second dashboard latency (requires fully pre-materialized views for every possible query shape)
- Per-customer data isolation with separate encryption keys (multi-tenant SaaS concern)
- Event schema enforcement and breaking-change detection
Read/write ratio: Writes completely dominate this system. 1B events per day hit the ingestion layer continuously. Dashboard queries are sporadic: a few thousand users check their dashboards per hour, each query scanning billions of rows. The ratio is roughly 500:1 writes to interactive reads under normal load. Every architectural decision flows from this asymmetry: the write path must be cheap and embarrassingly parallel; the read path must be powered by pre-computation rather than raw scans.
I treat the 5-second query SLA as the forcing function for every storage decision. Anything that cannot return aggregates in 5 seconds at 90-day data volumes is architecturally wrong for this system.
Core Entities
- Event: A single user action with event type, anonymous or identified user ID, session ID, timestamp, device and geo metadata, and a free-form properties map.
- User: An identity record linking anonymous IDs (browser fingerprint, cookie) to an identified user ID after login. Many anonymous IDs may map to one user.
- Session: A bounded window of contiguous user activity (max 30 minutes of inactivity). Groups events for funnel computation.
- Funnel: A named, ordered sequence of event types representing a conversion path (e.g., View Product -> Add to Cart -> Purchase).
- AggregateResult: A pre-computed rollup: metric name, dimension values (event type, country, device), time bucket, and count or sum. The unit that powers dashboard panels.
- Dashboard: A saved collection of query definitions rendered as charts. Backed by the query service.
Schema details and partitioning strategies are deferred to the deep dives. These six entities are sufficient to drive the API and High-Level Design.
API Design
Start with one endpoint per functional requirement. Evolve where the naive shape breaks at scale.
FR 1 - Ingest events:
Naive shape:
POST /events
Body: { event_type, user_id, session_id, timestamp, properties }
Response: { event_id }
This breaks at 11,500 events/second: each client sending individual HTTP requests generates enormous per-request overhead. Clients on mobile networks with 100ms+ round-trip times can barely sustain 10 requests/second per connection. The evolved shape batches events:
POST /events/batch
Body: { events: [ { event_type, user_id, session_id, timestamp, properties }, ... ] }
Response: { accepted: 247, failed: 0, batch_id }
Batch size caps at 500 events or 512KB, whichever is smaller. The server acknowledges receipt of the batch and writes to Kafka asynchronously. The client retries the entire batch on failure, so each event must carry a client-generated event_id UUID for deduplication downstream.
FR 2 - Query aggregated metrics:
GET /query
Query params: metric, event_type, start_time, end_time, granularity (minute|hour|day), group_by
Response: {
data: [ { timestamp, value, dimensions } ],
next_cursor: "...",
freshness_seconds: 42
}
The freshness_seconds field tells the dashboard UI whether it is rendering the streaming aggregate (slightly stale) or a fully materialized batch result. This lets the UI render a staleness badge without a second round-trip.
FR 3 - Funnel analysis:
POST /funnels
Body: { steps: ["view_product", "add_to_cart", "purchase"], window_hours: 24, start_date, end_date }
Response: { funnel_id }
GET /funnels/{funnel_id}/results
Response: {
steps: [ { name, users_entered, users_completed, conversion_rate } ],
computed_at: "2026-03-29T14:00:00Z"
}
Funnel computation is expensive (correlated scan across user sessions). It is submitted asynchronously and polled. The funnel_id pattern avoids a synchronous 30-second HTTP hold.
High-Level Design
1. Ingesting events from clients
The write path must handle 11,500 events/second sustainably and absorb 3x traffic spikes without dropping events.
Naive approach: Client sends individual events synchronously to an ingestion server that writes directly to a PostgreSQL events table.
This fails immediately under real load. At 11,500 writes/second, a single PostgreSQL instance saturates. Network round-trips from mobile clients add latency that makes synchronous per-event writes impractical. A single spike trips a circuit breaker on the database and events are lost.
The key insight is to decouple ingestion acknowledgment from storage write. I always mention this decoupling principle early in the interview because it applies to every write-heavy system, not just analytics. The ingestion service acknowledges the client as soon as the batch lands in Kafka. The database write is asynchronous and retryable.
Evolved components:
- Client SDK: Buffers events in memory up to 500 events or 10 seconds and sends batches via
POST /events/batch. - Ingestion Service: Stateless HTTP servers. Validate event schema, assign a server-side receive timestamp, publish the batch to Kafka. Scales horizontally.
- Kafka: Write-ahead log. Partitioned by
user_idto preserve per-user ordering. Retains events for 7 days as a replay buffer. - Event Store (ClickHouse): Columnar OLAP database. Kafka consumers write events in bulk using asynchronous insert batches.
This is the write path only. The read path and query layer come next.
2. Querying stored events
With 90TB of data in ClickHouse after 90 days, a scan of raw events for every dashboard request is expensive. Even ClickHouse scans billions of rows in seconds only when the query hits the right partition.
Components added:
- Query Service: Accepts metric queries, checks the query cache first, routes to ClickHouse materialized views for recent data or raw event partitions for long-range historical queries.
- Query Cache (Redis): Caches the results of identical or near-identical queries. Cache key includes the query parameters and the time bucket. TTL of 60 seconds for real-time panels, 1 hour for historical.
Routing the query to the right table is the query planner's job. Recent data in materialized views returns in under 500ms. Full raw scans for multi-month cohorts take 5-20 seconds on properly sized ClickHouse hardware. I'd tell the interviewer that this tiered routing is the single most important read-path optimization: it converts a scan across 90 TB of raw data into a lookup against a few gigabytes of pre-aggregated rollups for 95% of dashboard queries.
The two-tier routing shown above (under 7 days vs over 7 days) gains a third tier in the next section: events less than 60 seconds old are served from a dedicated streaming summary table written by Flink. The final routing model has three tiers: sub-60-second real-time, hourly materialized view, and raw partitioned scan.
3. Near-real-time dashboards
The Query Cache keeps repeated identical queries fast, but the first request for a real-time panel still hits ClickHouse. For dashboards that must reflect events within 60 seconds, we need a streaming aggregation layer that writes pre-computed results ahead of any query.
Components added:
- Stream Processor (Flink): Consumes the Kafka
events_rawtopic. Computes rolling 1-minute aggregate windows per event type, user segment, and geography. Writes results to the OLAP summary table in ClickHouse. - Batch Processor: Runs hourly and daily rollup jobs that recompute exact aggregates from raw events. Overwrites the streaming approximations with exact values. This is the "lambda correction" step.
Real-time panels read from the ClickHouse Summaries table. The freshness of any panel is the age of the newest aggregate written by Flink, which is at most 60 seconds. Historical panels read from either the summaries (if granularity is hourly or coarser) or from the raw partitioned table for fine-grained analysis.
For your interview: state clearly that Flink is only needed for the sub-60-second freshness requirement. If the interviewer relaxes that to 5-minute freshness, you can drop Flink entirely and rely on ClickHouse materialized views alone. Knowing when to simplify the architecture is more impressive than always adding components.
4. Funnel analysis
Funnel analysis requires more than counting events. It requires correlating events across a user's session: did this specific user complete step A before step B within a 24-hour window? That question does not fit the columnar aggregate model. It requires grouping events by user identity and time window.
Components added:
- Funnel Service: Accepts funnel definitions. Runs a ClickHouse query that uses the
windowFunnelfunction to evaluate ordered event sequences per user over the requested date range. - Session Store (Redis): Caches active user sessions for real-time funnel monitoring. Stores the ordered event list per
session_idwith a 30-minute TTL.
The windowFunnel function in ClickHouse evaluates the ordered event sequence natively in the storage engine. A 90-day funnel over 100M users returns in under 30 seconds on a 6-shard ClickHouse cluster because the query is parallelized across shards partitioned by user_id. I always mention this built-in function by name because it shows you know the tool, not just the theory.
Potential Deep Dives
1. How do we ingest 1 billion events per day without dropping data?
Client volume at 11,500 events/second means the write path needs to handle bursty traffic, flaky mobile connections, and client retries that produce duplicates.
2. How do we balance real-time dashboards with historical queries?
This is the core architectural tension: streaming is fresh but approximate; batch is exact but stale. Two architectural patterns attack this differently.
3. What database do you use for the event store and query layer?
The choice of database determines the entire query performance envelope. This is one of the highest-leverage decisions in the design.
4. How do you handle event deduplication at scale?
Clients retry batches on network failure. Mobile apps resume events after going offline. This produces duplicate events that corrupt aggregate counts and funnel metrics.
Final Architecture
Kafka as the write-ahead buffer combined with ClickHouse columnar storage is what makes 1 billion events per day tractable. Kafka absorbs ingestion spikes without back-pressure to clients. ClickHouse materialized views pre-compute aggregates at insert time, eliminating the need for a separate streaming jobs framework for standard metrics. The Query Service's three-tier routing (real-time summary, hourly materialized view, raw scan) ensures that 95% of dashboard queries hit pre-aggregated data and return in under 1 second.
The architecture is intentionally columnar-first. I'd frame the entire interview answer around the write/read asymmetry because every component choice traces back to it.
Interview Cheat Sheet
- State the write/read asymmetry up front: 1B events/day ingestion vs. a few thousand dashboard queries per hour. Every architecture decision follows from this ratio.
- Never have clients write events directly to the database. Buffer on the client (500 events, 10s max), acknowledge at Kafka, write to ClickHouse asynchronously. The database is never on the critical path for event capture.
- Kafka is the write-ahead buffer, not the event store. 7-day retention covers replay during ClickHouse downtime. ClickHouse is the actual store.
- ClickHouse reads only the columns touched by a query. A COUNT by event type on 1 billion rows scans roughly 2 GB instead of 200 GB on a row store. This 100x reduction is why columnar storage is mandatory for analytics at this scale.
- Pre-aggregated materialized views in ClickHouse eliminate the need for a separate Flink or Spark aggregation job for standard metrics. The view is populated automatically at INSERT time.
- For real-time panels (under 60 seconds freshness), run a Flink job writing 1-minute tumbling window aggregates to a dedicated summary table. For dashboards where hourly granularity is acceptable, the materialized view alone is sufficient.
- Deduplication: assign a UUID event_id at the moment of capture on the client. This survives retries. Check the UUID against a 24-hour Redis dedup window in the Kafka consumer before writing to ClickHouse.
- Funnel analysis requires ordered event correlation per user. Use ClickHouse's built-in
windowFunnel()function, which evaluates ordered event sequences natively in the storage engine. - Funnel queries are expensive. Submit them asynchronously, return a
funnel_id, and poll for results. Do not hold a synchronous HTTP connection open for 30 seconds. - For query routing: real-time panels read from the streaming summary table; daily/weekly dashboards read from the hourly materialized view; long-range cohort scans read from raw event partitions. The Query Service abstracts this routing behind a single API.
- Cold archival is a nightly Parquet export to S3. Three years of data at 1 TB/day (3x compressed) is roughly 120 TB. Use Athena for ad-hoc historical analysis; do not keep 3 years in the ClickHouse hot tier.
- The difference between Lambda architecture (two separate pipelines, code duplication) and materialized views (one pipeline, aggregation at insert time) is a common follow-up. Materialized views win for standard metrics; Lambda is only justified when the batch layer uses dramatically different logic from the streaming layer.
- How do you handle late-arriving events? (Watermarking in stream processing, re-aggregation windows)
- How do you route traffic from millions of websites to your pipeline without a central bottleneck?