Analytics Pipeline

What is a user analytics pipeline?

A user analytics pipeline collects every click, page view, and custom event from your product, stores billions of those events durably, and lets you query them seconds later on a dashboard. The deceptively hard part is not collecting events; it is reconciling two contradictory requirements that live in the same system.

Real-time dashboards demand low-latency streaming. Historical funnels and cohort analysis demand high-throughput batch scans over months of data. Building both without building two separate systems that drift apart is the core engineering challenge here. I've seen teams burn months maintaining a Lambda architecture where the batch and streaming layers compute slightly different numbers, and nobody trusts either. This question tests pipeline architecture, columnar storage selection, the Lambda vs Kappa architecture debate, and write-heavy system design.

Functional Requirements

Core Requirements

Collect page views, clicks, and custom events from web and mobile clients.
Store events durably and allow querying by time range, user segment, and event type.
Serve near-real-time and historical dashboards with aggregated metrics.
Support funnel analysis: track the sequence of steps users take toward a conversion goal.

Below the Line (out of scope)

ML-based anomaly detection
GDPR-compliant data deletion workflows
A/B test assignment and exposure logging
Real-time personalization scoring

The hardest part in scope: Balancing query freshness against query cost. Scanning raw events for every dashboard request at 1B events/day is prohibitively slow. Pre-aggregating too aggressively locks you into fixed query shapes. The architecture must support both approximate fresh aggregates and exact historical scans without duplicating the full pipeline.

ML-based anomaly detection is below the line because it is a consumer of the pipeline, not the pipeline itself. To add it, attach a Flink job to the raw Kafka topic that scores each event against a trained model and emits anomaly signals to a separate alert topic. The ingestion and storage layers are unchanged.

GDPR deletion is below the line because it requires finding and purging a specific user's events across all partitions of a columnar store (ClickHouse, BigQuery) where rows are immutable by design. To add it, maintain a deletion log in a mutable store (PostgreSQL). The query layer joins against the deletion log at read time to filter suppressed user IDs. Periodic compaction jobs physically remove the rows during off-peak hours.

A/B test assignment is below the line because it requires a consistent assignment service (low-latency, avoid re-assigning users mid-experiment) and the logging of assignment events, which is a write path concern separate from the analytics query path.

Non-Functional Requirements

Core Requirements

Durability: 99.9% of events must be delivered and stored. Losing a handful of debug-level events during a broker restart is tolerable; losing conversion events is not.
Write throughput: 1 billion events per day, roughly 11,500 events/second sustained. Peak traffic (product launches, sales events) can reach 3x that: approximately 35,000 events/second.
Query latency: Dashboard aggregate queries must return in under 5 seconds at p95. Funnel computation for 90-day cohorts must complete in under 30 seconds.
Freshness: Real-time dashboard panels must reflect events no older than 60 seconds. Historical reports are batch-computed and may be up to 1 hour stale.
Retention: 90 days of hot storage (fully indexed, fast query). 3 years of cold archival (object storage, queryable via batch scan).
Scale: At 1 KB average event size, 1B events/day is roughly 1 TB of raw ingestion per day. After 90 days that is 90 TB of hot storage.

Below the Line

Sub-second dashboard latency (requires fully pre-materialized views for every possible query shape)
Per-customer data isolation with separate encryption keys (multi-tenant SaaS concern)
Event schema enforcement and breaking-change detection

Read/write ratio: Writes completely dominate this system. 1B events per day hit the ingestion layer continuously. Dashboard queries are sporadic: a few thousand users check their dashboards per hour, each query scanning billions of rows. The ratio is roughly 500:1 writes to interactive reads under normal load. Every architectural decision flows from this asymmetry: the write path must be cheap and embarrassingly parallel; the read path must be powered by pre-computation rather than raw scans.

I treat the 5-second query SLA as the forcing function for every storage decision. Anything that cannot return aggregates in 5 seconds at 90-day data volumes is architecturally wrong for this system.

Core Entities

Event: A single user action with event type, anonymous or identified user ID, session ID, timestamp, device and geo metadata, and a free-form properties map.
User: An identity record linking anonymous IDs (browser fingerprint, cookie) to an identified user ID after login. Many anonymous IDs may map to one user.
Session: A bounded window of contiguous user activity (max 30 minutes of inactivity). Groups events for funnel computation.
Funnel: A named, ordered sequence of event types representing a conversion path (e.g., View Product -> Add to Cart -> Purchase).
AggregateResult: A pre-computed rollup: metric name, dimension values (event type, country, device), time bucket, and count or sum. The unit that powers dashboard panels.
Dashboard: A saved collection of query definitions rendered as charts. Backed by the query service.

Schema details and partitioning strategies are deferred to the deep dives. These six entities are sufficient to drive the API and High-Level Design.

API Design

Start with one endpoint per functional requirement. Evolve where the naive shape breaks at scale.

FR 1 - Ingest events:

Naive shape:

POST /events
Body: { event_type, user_id, session_id, timestamp, properties }
Response: { event_id }

This breaks at 11,500 events/second: each client sending individual HTTP requests generates enormous per-request overhead. Clients on mobile networks with 100ms+ round-trip times can barely sustain 10 requests/second per connection. The evolved shape batches events:

POST /events/batch
Body: { events: [ { event_type, user_id, session_id, timestamp, properties }, ... ] }
Response: { accepted: 247, failed: 0, batch_id }

Batch size caps at 500 events or 512KB, whichever is smaller. The server acknowledges receipt of the batch and writes to Kafka asynchronously. The client retries the entire batch on failure, so each event must carry a client-generated event_id UUID for deduplication downstream.

FR 2 - Query aggregated metrics:

GET /query
Query params: metric, event_type, start_time, end_time, granularity (minute|hour|day), group_by
Response: {
  data: [ { timestamp, value, dimensions } ],
  next_cursor: "...",
  freshness_seconds: 42
}

The freshness_seconds field tells the dashboard UI whether it is rendering the streaming aggregate (slightly stale) or a fully materialized batch result. This lets the UI render a staleness badge without a second round-trip.

FR 3 - Funnel analysis:

POST /funnels
Body: { steps: ["view_product", "add_to_cart", "purchase"], window_hours: 24, start_date, end_date }
Response: { funnel_id }

GET /funnels/{funnel_id}/results
Response: {
  steps: [ { name, users_entered, users_completed, conversion_rate } ],
  computed_at: "2026-03-29T14:00:00Z"
}

Funnel computation is expensive (correlated scan across user sessions). It is submitted asynchronously and polled. The funnel_id pattern avoids a synchronous 30-second HTTP hold.

High-Level Design

1. Ingesting events from clients

The write path must handle 11,500 events/second sustainably and absorb 3x traffic spikes without dropping events.

Naive approach: Client sends individual events synchronously to an ingestion server that writes directly to a PostgreSQL events table.

This fails immediately under real load. At 11,500 writes/second, a single PostgreSQL instance saturates. Network round-trips from mobile clients add latency that makes synchronous per-event writes impractical. A single spike trips a circuit breaker on the database and events are lost.

The key insight is to decouple ingestion acknowledgment from storage write. I always mention this decoupling principle early in the interview because it applies to every write-heavy system, not just analytics. The ingestion service acknowledges the client as soon as the batch lands in Kafka. The database write is asynchronous and retryable.

Evolved components:

Client SDK: Buffers events in memory up to 500 events or 10 seconds and sends batches via POST /events/batch.
Ingestion Service: Stateless HTTP servers. Validate event schema, assign a server-side receive timestamp, publish the batch to Kafka. Scales horizontally.
Kafka: Write-ahead log. Partitioned by user_id to preserve per-user ordering. Retains events for 7 days as a replay buffer.
Event Store (ClickHouse): Columnar OLAP database. Kafka consumers write events in bulk using asynchronous insert batches.

This is the write path only. The read path and query layer come next.

2. Querying stored events

With 90TB of data in ClickHouse after 90 days, a scan of raw events for every dashboard request is expensive. Even ClickHouse scans billions of rows in seconds only when the query hits the right partition.

Components added:

What is a user analytics pipeline?

Functional Requirements

Core Requirements

Collect page views, clicks, and custom events from web and mobile clients.
Store events durably and allow querying by time range, user segment, and event type.
Serve near-real-time and historical dashboards with aggregated metrics.
Support funnel analysis: track the sequence of steps users take toward a conversion goal.

Below the Line (out of scope)

ML-based anomaly detection
GDPR-compliant data deletion workflows
A/B test assignment and exposure logging
Real-time personalization scoring

The hardest part in scope: Balancing query freshness against query cost. Scanning raw events for every dashboard request at 1B events/day is prohibitively slow. Pre-aggregating too aggressively locks you into fixed query shapes. The architecture must support both approximate fresh aggregates and exact historical scans without duplicating the full pipeline.

Non-Functional Requirements

Core Requirements

Durability: 99.9% of events must be delivered and stored. Losing a handful of debug-level events during a broker restart is tolerable; losing conversion events is not.
Write throughput: 1 billion events per day, roughly 11,500 events/second sustained. Peak traffic (product launches, sales events) can reach 3x that: approximately 35,000 events/second.
Query latency: Dashboard aggregate queries must return in under 5 seconds at p95. Funnel computation for 90-day cohorts must complete in under 30 seconds.
Freshness: Real-time dashboard panels must reflect events no older than 60 seconds. Historical reports are batch-computed and may be up to 1 hour stale.
Retention: 90 days of hot storage (fully indexed, fast query). 3 years of cold archival (object storage, queryable via batch scan).
Scale: At 1 KB average event size, 1B events/day is roughly 1 TB of raw ingestion per day. After 90 days that is 90 TB of hot storage.

Below the Line

Sub-second dashboard latency (requires fully pre-materialized views for every possible query shape)
Per-customer data isolation with separate encryption keys (multi-tenant SaaS concern)
Event schema enforcement and breaking-change detection

Read/write ratio: Writes completely dominate this system. 1B events per day hit the ingestion layer continuously. Dashboard queries are sporadic: a few thousand users check their dashboards per hour, each query scanning billions of rows. The ratio is roughly 500:1 writes to interactive reads under normal load. Every architectural decision flows from this asymmetry: the write path must be cheap and embarrassingly parallel; the read path must be powered by pre-computation rather than raw scans.

I treat the 5-second query SLA as the forcing function for every storage decision. Anything that cannot return aggregates in 5 seconds at 90-day data volumes is architecturally wrong for this system.

Core Entities

Event: A single user action with event type, anonymous or identified user ID, session ID, timestamp, device and geo metadata, and a free-form properties map.
User: An identity record linking anonymous IDs (browser fingerprint, cookie) to an identified user ID after login. Many anonymous IDs may map to one user.
Session: A bounded window of contiguous user activity (max 30 minutes of inactivity). Groups events for funnel computation.
Funnel: A named, ordered sequence of event types representing a conversion path (e.g., View Product -> Add to Cart -> Purchase).
AggregateResult: A pre-computed rollup: metric name, dimension values (event type, country, device), time bucket, and count or sum. The unit that powers dashboard panels.
Dashboard: A saved collection of query definitions rendered as charts. Backed by the query service.

Schema details and partitioning strategies are deferred to the deep dives. These six entities are sufficient to drive the API and High-Level Design.

API Design

Start with one endpoint per functional requirement. Evolve where the naive shape breaks at scale.

FR 1 - Ingest events:

Naive shape:

POST /events
Body: { event_type, user_id, session_id, timestamp, properties }
Response: { event_id }

POST /events/batch
Body: { events: [ { event_type, user_id, session_id, timestamp, properties }, ... ] }
Response: { accepted: 247, failed: 0, batch_id }

FR 2 - Query aggregated metrics:

GET /query
Query params: metric, event_type, start_time, end_time, granularity (minute|hour|day), group_by
Response: {
  data: [ { timestamp, value, dimensions } ],
  next_cursor: "...",
  freshness_seconds: 42
}

FR 3 - Funnel analysis:

POST /funnels
Body: { steps: ["view_product", "add_to_cart", "purchase"], window_hours: 24, start_date, end_date }
Response: { funnel_id }

GET /funnels/{funnel_id}/results
Response: {
  steps: [ { name, users_entered, users_completed, conversion_rate } ],
  computed_at: "2026-03-29T14:00:00Z"
}

Funnel computation is expensive (correlated scan across user sessions). It is submitted asynchronously and polled. The funnel_id pattern avoids a synchronous 30-second HTTP hold.

High-Level Design

1. Ingesting events from clients

The write path must handle 11,500 events/second sustainably and absorb 3x traffic spikes without dropping events.

Naive approach: Client sends individual events synchronously to an ingestion server that writes directly to a PostgreSQL events table.

Evolved components:

Client SDK: Buffers events in memory up to 500 events or 10 seconds and sends batches via POST /events/batch.
Ingestion Service: Stateless HTTP servers. Validate event schema, assign a server-side receive timestamp, publish the batch to Kafka. Scales horizontally.
Kafka: Write-ahead log. Partitioned by user_id to preserve per-user ordering. Retains events for 7 days as a replay buffer.
Event Store (ClickHouse): Columnar OLAP database. Kafka consumers write events in bulk using asynchronous insert batches.

This is the write path only. The read path and query layer come next.

2. Querying stored events

Components added:

Analytics Pipeline

What is a user analytics pipeline?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Ingesting events from clients

2. Querying stored events

Continue Reading with Premium

Comments

Analytics Pipeline

What is a user analytics pipeline?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Ingesting events from clients

2. Querying stored events

Continue Reading with Premium

Comments