Surge Pricing

What is a surge pricing system?

A surge pricing system detects when ride demand outpaces driver supply in a geographic area, computes a price multiplier, and applies it to new trips in real time. The engineering challenge is not the math; it is doing geo-cell aggregation at sub-30-second freshness across millions of concurrent requests without the multiplier oscillating every few seconds and eroding rider trust. I like opening with this question in interviews because it forces you to reason about feedback loops, not just request flows. It tests real-time stream processing, geospatial partitioning, cache design, and feedback-loop control in one question.

Functional Requirements

Core Requirements

Detect when demand (ride requests) exceeds supply (available drivers) in a geographic area in near real time.
Compute a surge multiplier for that area and apply it to new trip prices.
Show riders the current surge multiplier before they confirm a booking.
Automatically remove surge when supply and demand rebalance.

Scope Exclusions

Driver incentive programs for surges.
Long-term price forecasting.

The hardest problem in scope: Preventing surge oscillation. A pure real-time system where any demand spike triggers a multiplier increase and any supply recovery immediately drops it back to 1.0 creates a sawtooth wave that confuses riders and drivers alike. The deep dive on oscillation prevention is the most interview-differentiating part of this design.

Driver incentive programs are below the line because they run after a surge is detected, not during the detection or computation path. To add them: publish a SurgeActivated event to a Kafka topic; a separate Incentive Service subscribes, evaluates driver eligibility, and sends push notifications. It sits beside the Pricing Service, not inside it.

Long-term price forecasting is below the line because it relies on historical data warehousing and model training infrastructure that does not touch the real-time multiplier path. To add it: feed SurgeEvent records from the audit log into an offline ML pipeline that predicts surge windows by time of day and area and pre-publishes expected multipliers via a separate ForecastedSurge table.

Non-Functional Requirements

Core Requirements

Multiplier latency: The surge multiplier is visible to a rider within 2 seconds of a demand spike in that cell.
Freshness: Multipliers are recomputed every 30 seconds per geo-cell. A 30-second-old multiplier is stale but acceptable; a 5-minute-old one causes mis-pricing complaints.
Read throughput: The Pricing Service handles multiplier lookups at the rate of the booking request volume, roughly 50K requests per second at peak globally.
Availability: 99.99% uptime. Fail open: if the Pricing Service is unreachable, the Booking Service uses a multiplier of 1.0 rather than blocking the booking.
Scale: 5M active drivers globally, 10M concurrent riders, and thousands of geo-cells active simultaneously.
Consistency: Eventual consistency for multiplier reads is acceptable. A rider seeing a multiplier that is 30 seconds stale is a minor UX issue; an unavailable pricing endpoint is a revenue-stopping outage.

Below the Line

Per-rider personalized pricing
Cross-city multiplier normalization
Real-time audit fraud detection on multiplier values

The hardest architectural constraint: 50K multiplier lookups per second with a 2-second visibility SLA. This eliminates any design where the Booking Service queries a database directly on every request. The multiplier must live in an in-process or near-process cache that is refreshed by an independently operating computation pipeline.

Per-rider personalized pricing is below the line because it requires user-level demand modeling that adds significant latency to the lookup path and changes the Pricing Service from a cell-keyed read to a user-keyed computation. To add it: run a separate Personalization Service in parallel and merge its multiplier adjustment with the cell-level multiplier before returning the final price.

Cross-city multiplier normalization is below the line because surge in one city does not propagate to another. Each city's cells are independent. Global multiplier coordination would require a consensus layer that adds latency without a clear user benefit.

Core Entities

GeoCell: A geographic unit (identified by cell_id) with the current demand_count, supply_count, multiplier, and computed_at timestamp. The primary keyed object the Pricing Service reads and writes.
SurgeEvent: An immutable audit record written whenever a multiplier changes. Contains cell_id, old_multiplier, new_multiplier, demand_count, supply_count, and event_time. Feeds ML training and compliance reporting.
RideRequest: A demand signal. Carries request_id, rider_id, pickup_cell_id, status, and created_at. The aggregator counts pending requests per cell to compute demand.
Driver: A supply signal. Carries driver_id, status (available, on_trip, offline), and current_cell_id. The aggregator counts available drivers per cell to compute supply.

Full schema, partition keys, and indexes are deferred to the deep dives. These four entities are sufficient for the API design and High-Level Design.

API Design

FR 1 and FR 3 - Rider sees surge multiplier before booking:

GET /pricing/surge?cell_id={cell_id}
Response: { cell_id, multiplier, computed_at }

This is the hot read path. The Booking Service calls this endpoint on every ride request before presenting the final price to the rider. The response must be fast (under 10ms) because it sits in the critical path of booking. The computed_at field allows the Booking Service to surface "prices updated 15s ago" messaging in the UI.

FR 2 - Internal: Aggregator publishes demand and supply counts:

The aggregator does not expose an HTTP endpoint. It reads from two Kafka topics and writes computed cell states to Redis.

// Kafka topic: ride.requests
// Published by Booking Service on every ride request
{ request_id, pickup_cell_id, event_time }

// Kafka topic: driver.status
// Published by Location Service on every driver availability change
{ driver_id, cell_id, status, event_time }

FR 2 - Internal: Pricing Service computes and stores multiplier:

POST /pricing/compute (internal, called by Aggregator Worker)
Body: { cell_id, demand_count, supply_count }
Response: { cell_id, multiplier }

In the evolved design this becomes a Redis write directly from the Aggregator Worker rather than an HTTP call; the HTTP shape is shown here to make the contract explicit before the architecture evolves.

FR 4 - Surge removal is implicit: The Aggregator recomputes every cell on a 30-second schedule. When demand_count / supply_count falls below the deactivation threshold, the multiplier resets to 1.0 and a SurgeEvent is written with new_multiplier = 1.0. No separate "remove surge" API is needed.

High-Level Design

1. Detecting demand-supply imbalance

The system must count pending ride requests and available drivers per geo-cell, then compare them. The naive approach polls a database on a timer.

Components:

Booking Service: Records every ride request and publishes a demand event.
Location Service: Tracks driver availability and publishes supply events.
Aggregator: A scheduled job that queries the primary DB for demand and supply counts per cell every 60 seconds.
Surge DB: Stores GeoCell records with demand, supply, and multiplier.

Request walkthrough:

Rider submits a ride request; Booking Service inserts it into Surge DB with status pending.
Driver sends a GPS update; Location Service updates driver status in Surge DB.
Aggregator runs every 60 seconds, executes SELECT cell_id, COUNT(*) FROM ride_requests GROUP BY pickup_cell_id and SELECT cell_id, COUNT(*) FROM drivers WHERE status='available' GROUP BY current_cell_id.
Aggregator writes updated demand_count and supply_count into each GeoCell row.
Pricing Service reads the GeoCell table when the Booking Service queries for the multiplier.

This covers demand detection and multiplier reads. The 60-second poll interval already violates the 2-second freshness NFR. The next section evolves to streaming aggregation.

2. Real-time aggregation with streaming events

A 60-second cron job cannot meet a 2-second freshness SLA. The fix is replacing the batch poll with a streaming event pipeline that maintains rolling counts in memory.

I always draw the naive cron-poll version on the whiteboard first and let the interviewer see its two failure modes before introducing Kafka. It shows you know why the streaming approach exists, not just that it exists.

The DB-poll approach has two failure modes. First, a SELECT COUNT GROUP BY cell across millions of ride requests and drivers runs a full table scan; at peak it adds seconds of query time before any multiplier update is written. Second, the 60-second window means a demand spike from a concert ending triggers no surge response for up to a minute, during which all riders see a 1.0 multiplier and the system fails to clear the queue.

The fix is Kafka plus a stateful Aggregator Worker. The Booking Service publishes a RideRequested event to Kafka on every request. The Location Service publishes a DriverStatusChanged event on every status transition.

The Aggregator Worker maintains an in-memory counter map per cell, increments or decrements it on each event, and flushes to Redis every 30 seconds.

Components:

Kafka (ride.requests topic): Receives demand events from Booking Service. Partitioned by cell_id so all events for the same cell route to the same Aggregator Worker partition.
Kafka (driver.status topic): Receives supply events from Location Service. Also partitioned by cell_id.
Aggregator Worker: Consumes both topics, maintains HashMap<cell_id, (demand, supply)> in memory, and flushes updated cell states to Redis every 30 seconds.
Pricing Service: Reads the multiplier directly from Redis on every Booking Service query. No DB read in the hot path.
Redis (multiplier cache): Stores surge:cell:{cell_id} as a simple key-value with the serialized multiplier. TTL is 90 seconds as a safety net.

Request walkthrough:

Rider submits a ride request; Booking Service publishes { request_id, cell_id } to ride.requests.
Driver goes offline; Location Service publishes { driver_id, cell_id, status: offline } to driver.status.
Aggregator Worker consumes events, updates in-memory counters: demand[cell_a]++, supply[cell_a]--.
Every 30 seconds, Aggregator Worker computes multiplier = compute(demand, supply) per cell and writes SET surge:cell:{cell_id} {multiplier} EX 90 to Redis.
Booking Service calls GET /pricing/surge?cell_id=X; Pricing Service executes GET surge:cell:{cell_id} from Redis and returns in under 1ms.

What is a surge pricing system?

Functional Requirements

Core Requirements

Detect when demand (ride requests) exceeds supply (available drivers) in a geographic area in near real time.
Compute a surge multiplier for that area and apply it to new trip prices.
Show riders the current surge multiplier before they confirm a booking.
Automatically remove surge when supply and demand rebalance.

Scope Exclusions

Driver incentive programs for surges.
Long-term price forecasting.

The hardest problem in scope: Preventing surge oscillation. A pure real-time system where any demand spike triggers a multiplier increase and any supply recovery immediately drops it back to 1.0 creates a sawtooth wave that confuses riders and drivers alike. The deep dive on oscillation prevention is the most interview-differentiating part of this design.

Non-Functional Requirements

Core Requirements

Multiplier latency: The surge multiplier is visible to a rider within 2 seconds of a demand spike in that cell.
Freshness: Multipliers are recomputed every 30 seconds per geo-cell. A 30-second-old multiplier is stale but acceptable; a 5-minute-old one causes mis-pricing complaints.
Read throughput: The Pricing Service handles multiplier lookups at the rate of the booking request volume, roughly 50K requests per second at peak globally.
Availability: 99.99% uptime. Fail open: if the Pricing Service is unreachable, the Booking Service uses a multiplier of 1.0 rather than blocking the booking.
Scale: 5M active drivers globally, 10M concurrent riders, and thousands of geo-cells active simultaneously.
Consistency: Eventual consistency for multiplier reads is acceptable. A rider seeing a multiplier that is 30 seconds stale is a minor UX issue; an unavailable pricing endpoint is a revenue-stopping outage.

Below the Line

Per-rider personalized pricing
Cross-city multiplier normalization
Real-time audit fraud detection on multiplier values

The hardest architectural constraint: 50K multiplier lookups per second with a 2-second visibility SLA. This eliminates any design where the Booking Service queries a database directly on every request. The multiplier must live in an in-process or near-process cache that is refreshed by an independently operating computation pipeline.

Core Entities

GeoCell: A geographic unit (identified by cell_id) with the current demand_count, supply_count, multiplier, and computed_at timestamp. The primary keyed object the Pricing Service reads and writes.
SurgeEvent: An immutable audit record written whenever a multiplier changes. Contains cell_id, old_multiplier, new_multiplier, demand_count, supply_count, and event_time. Feeds ML training and compliance reporting.
RideRequest: A demand signal. Carries request_id, rider_id, pickup_cell_id, status, and created_at. The aggregator counts pending requests per cell to compute demand.
Driver: A supply signal. Carries driver_id, status (available, on_trip, offline), and current_cell_id. The aggregator counts available drivers per cell to compute supply.

Full schema, partition keys, and indexes are deferred to the deep dives. These four entities are sufficient for the API design and High-Level Design.

API Design

FR 1 and FR 3 - Rider sees surge multiplier before booking:

GET /pricing/surge?cell_id={cell_id}
Response: { cell_id, multiplier, computed_at }

FR 2 - Internal: Aggregator publishes demand and supply counts:

The aggregator does not expose an HTTP endpoint. It reads from two Kafka topics and writes computed cell states to Redis.

// Kafka topic: ride.requests
// Published by Booking Service on every ride request
{ request_id, pickup_cell_id, event_time }

// Kafka topic: driver.status
// Published by Location Service on every driver availability change
{ driver_id, cell_id, status, event_time }

FR 2 - Internal: Pricing Service computes and stores multiplier:

POST /pricing/compute (internal, called by Aggregator Worker)
Body: { cell_id, demand_count, supply_count }
Response: { cell_id, multiplier }

High-Level Design

1. Detecting demand-supply imbalance

The system must count pending ride requests and available drivers per geo-cell, then compare them. The naive approach polls a database on a timer.

Components:

Booking Service: Records every ride request and publishes a demand event.
Location Service: Tracks driver availability and publishes supply events.
Aggregator: A scheduled job that queries the primary DB for demand and supply counts per cell every 60 seconds.
Surge DB: Stores GeoCell records with demand, supply, and multiplier.

Request walkthrough:

Rider submits a ride request; Booking Service inserts it into Surge DB with status pending.
Driver sends a GPS update; Location Service updates driver status in Surge DB.
Aggregator runs every 60 seconds, executes SELECT cell_id, COUNT(*) FROM ride_requests GROUP BY pickup_cell_id and SELECT cell_id, COUNT(*) FROM drivers WHERE status='available' GROUP BY current_cell_id.
Aggregator writes updated demand_count and supply_count into each GeoCell row.
Pricing Service reads the GeoCell table when the Booking Service queries for the multiplier.

This covers demand detection and multiplier reads. The 60-second poll interval already violates the 2-second freshness NFR. The next section evolves to streaming aggregation.

2. Real-time aggregation with streaming events

A 60-second cron job cannot meet a 2-second freshness SLA. The fix is replacing the batch poll with a streaming event pipeline that maintains rolling counts in memory.

The Aggregator Worker maintains an in-memory counter map per cell, increments or decrements it on each event, and flushes to Redis every 30 seconds.

Components:

Kafka (ride.requests topic): Receives demand events from Booking Service. Partitioned by cell_id so all events for the same cell route to the same Aggregator Worker partition.
Kafka (driver.status topic): Receives supply events from Location Service. Also partitioned by cell_id.
Aggregator Worker: Consumes both topics, maintains HashMap<cell_id, (demand, supply)> in memory, and flushes updated cell states to Redis every 30 seconds.
Pricing Service: Reads the multiplier directly from Redis on every Booking Service query. No DB read in the hot path.
Redis (multiplier cache): Stores surge:cell:{cell_id} as a simple key-value with the serialized multiplier. TTL is 90 seconds as a safety net.

Request walkthrough:

Rider submits a ride request; Booking Service publishes { request_id, cell_id } to ride.requests.
Driver goes offline; Location Service publishes { driver_id, cell_id, status: offline } to driver.status.
Aggregator Worker consumes events, updates in-memory counters: demand[cell_a]++, supply[cell_a]--.
Every 30 seconds, Aggregator Worker computes multiplier = compute(demand, supply) per cell and writes SET surge:cell:{cell_id} {multiplier} EX 90 to Redis.
Booking Service calls GET /pricing/surge?cell_id=X; Pricing Service executes GET surge:cell:{cell_id} from Redis and returns in under 1ms.

Surge Pricing

What is a surge pricing system?

Functional Requirements

Core Requirements

Scope Exclusions

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Detecting demand-supply imbalance

2. Real-time aggregation with streaming events

Continue Reading with Premium

Comments

Surge Pricing

What is a surge pricing system?

Functional Requirements

Core Requirements

Scope Exclusions

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Detecting demand-supply imbalance

2. Real-time aggregation with streaming events

Continue Reading with Premium

Comments