Surge Pricing
Design a dynamic pricing system for a ride-sharing platform that detects demand-supply imbalances in real time, computes a surge multiplier, and integrates pricing seamlessly into the matching and booking flows.
What is a surge pricing system?
A surge pricing system detects when ride demand outpaces driver supply in a geographic area, computes a price multiplier, and applies it to new trips in real time. The engineering challenge is not the math; it is doing geo-cell aggregation at sub-30-second freshness across millions of concurrent requests without the multiplier oscillating every few seconds and eroding rider trust. I like opening with this question in interviews because it forces you to reason about feedback loops, not just request flows. It tests real-time stream processing, geospatial partitioning, cache design, and feedback-loop control in one question.
Functional Requirements
Core Requirements
- Detect when demand (ride requests) exceeds supply (available drivers) in a geographic area in near real time.
- Compute a surge multiplier for that area and apply it to new trip prices.
- Show riders the current surge multiplier before they confirm a booking.
- Automatically remove surge when supply and demand rebalance.
Scope Exclusions
- Driver incentive programs for surges.
- Long-term price forecasting.
The hardest problem in scope: Preventing surge oscillation. A pure real-time system where any demand spike triggers a multiplier increase and any supply recovery immediately drops it back to 1.0 creates a sawtooth wave that confuses riders and drivers alike. The deep dive on oscillation prevention is the most interview-differentiating part of this design.
Driver incentive programs are below the line because they run after a surge is detected, not during the detection or computation path. To add them: publish a SurgeActivated event to a Kafka topic; a separate Incentive Service subscribes, evaluates driver eligibility, and sends push notifications. It sits beside the Pricing Service, not inside it.
Long-term price forecasting is below the line because it relies on historical data warehousing and model training infrastructure that does not touch the real-time multiplier path. To add it: feed SurgeEvent records from the audit log into an offline ML pipeline that predicts surge windows by time of day and area and pre-publishes expected multipliers via a separate ForecastedSurge table.
Non-Functional Requirements
Core Requirements
- Multiplier latency: The surge multiplier is visible to a rider within 2 seconds of a demand spike in that cell.
- Freshness: Multipliers are recomputed every 30 seconds per geo-cell. A 30-second-old multiplier is stale but acceptable; a 5-minute-old one causes mis-pricing complaints.
- Read throughput: The Pricing Service handles multiplier lookups at the rate of the booking request volume, roughly 50K requests per second at peak globally.
- Availability: 99.99% uptime. Fail open: if the Pricing Service is unreachable, the Booking Service uses a multiplier of 1.0 rather than blocking the booking.
- Scale: 5M active drivers globally, 10M concurrent riders, and thousands of geo-cells active simultaneously.
- Consistency: Eventual consistency for multiplier reads is acceptable. A rider seeing a multiplier that is 30 seconds stale is a minor UX issue; an unavailable pricing endpoint is a revenue-stopping outage.
Below the Line
- Per-rider personalized pricing
- Cross-city multiplier normalization
- Real-time audit fraud detection on multiplier values
The hardest architectural constraint: 50K multiplier lookups per second with a 2-second visibility SLA. This eliminates any design where the Booking Service queries a database directly on every request. The multiplier must live in an in-process or near-process cache that is refreshed by an independently operating computation pipeline.
Per-rider personalized pricing is below the line because it requires user-level demand modeling that adds significant latency to the lookup path and changes the Pricing Service from a cell-keyed read to a user-keyed computation. To add it: run a separate Personalization Service in parallel and merge its multiplier adjustment with the cell-level multiplier before returning the final price.
Cross-city multiplier normalization is below the line because surge in one city does not propagate to another. Each city's cells are independent. Global multiplier coordination would require a consensus layer that adds latency without a clear user benefit.
Core Entities
- GeoCell: A geographic unit (identified by
cell_id) with the currentdemand_count,supply_count,multiplier, andcomputed_attimestamp. The primary keyed object the Pricing Service reads and writes. - SurgeEvent: An immutable audit record written whenever a multiplier changes. Contains
cell_id,old_multiplier,new_multiplier,demand_count,supply_count, andevent_time. Feeds ML training and compliance reporting. - RideRequest: A demand signal. Carries
request_id,rider_id,pickup_cell_id,status, andcreated_at. The aggregator counts pending requests per cell to compute demand. - Driver: A supply signal. Carries
driver_id,status(available, on_trip, offline), andcurrent_cell_id. The aggregator counts available drivers per cell to compute supply.
Full schema, partition keys, and indexes are deferred to the deep dives. These four entities are sufficient for the API design and High-Level Design.
API Design
FR 1 and FR 3 - Rider sees surge multiplier before booking:
GET /pricing/surge?cell_id={cell_id}
Response: { cell_id, multiplier, computed_at }
This is the hot read path. The Booking Service calls this endpoint on every ride request before presenting the final price to the rider. The response must be fast (under 10ms) because it sits in the critical path of booking. The computed_at field allows the Booking Service to surface "prices updated 15s ago" messaging in the UI.
FR 2 - Internal: Aggregator publishes demand and supply counts:
The aggregator does not expose an HTTP endpoint. It reads from two Kafka topics and writes computed cell states to Redis.
// Kafka topic: ride.requests
// Published by Booking Service on every ride request
{ request_id, pickup_cell_id, event_time }
// Kafka topic: driver.status
// Published by Location Service on every driver availability change
{ driver_id, cell_id, status, event_time }
FR 2 - Internal: Pricing Service computes and stores multiplier:
POST /pricing/compute (internal, called by Aggregator Worker)
Body: { cell_id, demand_count, supply_count }
Response: { cell_id, multiplier }
In the evolved design this becomes a Redis write directly from the Aggregator Worker rather than an HTTP call; the HTTP shape is shown here to make the contract explicit before the architecture evolves.
FR 4 - Surge removal is implicit: The Aggregator recomputes every cell on a 30-second schedule. When demand_count / supply_count falls below the deactivation threshold, the multiplier resets to 1.0 and a SurgeEvent is written with new_multiplier = 1.0. No separate "remove surge" API is needed.
High-Level Design
1. Detecting demand-supply imbalance
The system must count pending ride requests and available drivers per geo-cell, then compare them. The naive approach polls a database on a timer.
Components:
- Booking Service: Records every ride request and publishes a demand event.
- Location Service: Tracks driver availability and publishes supply events.
- Aggregator: A scheduled job that queries the primary DB for demand and supply counts per cell every 60 seconds.
- Surge DB: Stores
GeoCellrecords with demand, supply, and multiplier.
Request walkthrough:
- Rider submits a ride request; Booking Service inserts it into Surge DB with status
pending. - Driver sends a GPS update; Location Service updates driver status in Surge DB.
- Aggregator runs every 60 seconds, executes
SELECT cell_id, COUNT(*) FROM ride_requests GROUP BY pickup_cell_idandSELECT cell_id, COUNT(*) FROM drivers WHERE status='available' GROUP BY current_cell_id. - Aggregator writes updated
demand_countandsupply_countinto eachGeoCellrow. - Pricing Service reads the
GeoCelltable when the Booking Service queries for the multiplier.
This covers demand detection and multiplier reads. The 60-second poll interval already violates the 2-second freshness NFR. The next section evolves to streaming aggregation.
2. Real-time aggregation with streaming events
A 60-second cron job cannot meet a 2-second freshness SLA. The fix is replacing the batch poll with a streaming event pipeline that maintains rolling counts in memory.
I always draw the naive cron-poll version on the whiteboard first and let the interviewer see its two failure modes before introducing Kafka. It shows you know why the streaming approach exists, not just that it exists.
The DB-poll approach has two failure modes. First, a SELECT COUNT GROUP BY cell across millions of ride requests and drivers runs a full table scan; at peak it adds seconds of query time before any multiplier update is written. Second, the 60-second window means a demand spike from a concert ending triggers no surge response for up to a minute, during which all riders see a 1.0 multiplier and the system fails to clear the queue.
The fix is Kafka plus a stateful Aggregator Worker. The Booking Service publishes a RideRequested event to Kafka on every request. The Location Service publishes a DriverStatusChanged event on every status transition.
The Aggregator Worker maintains an in-memory counter map per cell, increments or decrements it on each event, and flushes to Redis every 30 seconds.
Components:
- Kafka (ride.requests topic): Receives demand events from Booking Service. Partitioned by
cell_idso all events for the same cell route to the same Aggregator Worker partition. - Kafka (driver.status topic): Receives supply events from Location Service. Also partitioned by
cell_id. - Aggregator Worker: Consumes both topics, maintains
HashMap<cell_id, (demand, supply)>in memory, and flushes updated cell states to Redis every 30 seconds. - Pricing Service: Reads the multiplier directly from Redis on every Booking Service query. No DB read in the hot path.
- Redis (multiplier cache): Stores
surge:cell:{cell_id}as a simple key-value with the serialized multiplier. TTL is 90 seconds as a safety net.
Request walkthrough:
- Rider submits a ride request; Booking Service publishes
{ request_id, cell_id }toride.requests. - Driver goes offline; Location Service publishes
{ driver_id, cell_id, status: offline }todriver.status. - Aggregator Worker consumes events, updates in-memory counters:
demand[cell_a]++,supply[cell_a]--. - Every 30 seconds, Aggregator Worker computes
multiplier = compute(demand, supply)per cell and writesSET surge:cell:{cell_id} {multiplier} EX 90to Redis. - Booking Service calls
GET /pricing/surge?cell_id=X; Pricing Service executesGET surge:cell:{cell_id}from Redis and returns in under 1ms.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.