Surge Pricing
Design a dynamic pricing system for a ride-sharing platform that detects demand-supply imbalances in real time, computes a surge multiplier, and integrates pricing seamlessly into the matching and booking flows.
What is a surge pricing system?
A surge pricing system detects when ride demand outpaces driver supply in a geographic area, computes a price multiplier, and applies it to new trips in real time. The engineering challenge is not the math; it is doing geo-cell aggregation at sub-30-second freshness across millions of concurrent requests without the multiplier oscillating every few seconds and eroding rider trust. I like opening with this question in interviews because it forces you to reason about feedback loops, not just request flows. It tests real-time stream processing, geospatial partitioning, cache design, and feedback-loop control in one question.
Functional Requirements
Core Requirements
- Detect when demand (ride requests) exceeds supply (available drivers) in a geographic area in near real time.
- Compute a surge multiplier for that area and apply it to new trip prices.
- Show riders the current surge multiplier before they confirm a booking.
- Automatically remove surge when supply and demand rebalance.
Scope Exclusions
- Driver incentive programs for surges.
- Long-term price forecasting.
The hardest problem in scope: Preventing surge oscillation. A pure real-time system where any demand spike triggers a multiplier increase and any supply recovery immediately drops it back to 1.0 creates a sawtooth wave that confuses riders and drivers alike. The deep dive on oscillation prevention is the most interview-differentiating part of this design.
Driver incentive programs are below the line because they run after a surge is detected, not during the detection or computation path. To add them: publish a SurgeActivated event to a Kafka topic; a separate Incentive Service subscribes, evaluates driver eligibility, and sends push notifications. It sits beside the Pricing Service, not inside it.
Long-term price forecasting is below the line because it relies on historical data warehousing and model training infrastructure that does not touch the real-time multiplier path. To add it: feed SurgeEvent records from the audit log into an offline ML pipeline that predicts surge windows by time of day and area and pre-publishes expected multipliers via a separate ForecastedSurge table.
Non-Functional Requirements
Core Requirements
- Multiplier latency: The surge multiplier is visible to a rider within 2 seconds of a demand spike in that cell.
- Freshness: Multipliers are recomputed every 30 seconds per geo-cell. A 30-second-old multiplier is stale but acceptable; a 5-minute-old one causes mis-pricing complaints.
- Read throughput: The Pricing Service handles multiplier lookups at the rate of the booking request volume, roughly 50K requests per second at peak globally.
- Availability: 99.99% uptime. Fail open: if the Pricing Service is unreachable, the Booking Service uses a multiplier of 1.0 rather than blocking the booking.
- Scale: 5M active drivers globally, 10M concurrent riders, and thousands of geo-cells active simultaneously.
- Consistency: Eventual consistency for multiplier reads is acceptable. A rider seeing a multiplier that is 30 seconds stale is a minor UX issue; an unavailable pricing endpoint is a revenue-stopping outage.
Below the Line
- Per-rider personalized pricing
- Cross-city multiplier normalization
- Real-time audit fraud detection on multiplier values
The hardest architectural constraint: 50K multiplier lookups per second with a 2-second visibility SLA. This eliminates any design where the Booking Service queries a database directly on every request. The multiplier must live in an in-process or near-process cache that is refreshed by an independently operating computation pipeline.
Per-rider personalized pricing is below the line because it requires user-level demand modeling that adds significant latency to the lookup path and changes the Pricing Service from a cell-keyed read to a user-keyed computation. To add it: run a separate Personalization Service in parallel and merge its multiplier adjustment with the cell-level multiplier before returning the final price.
Cross-city multiplier normalization is below the line because surge in one city does not propagate to another. Each city's cells are independent. Global multiplier coordination would require a consensus layer that adds latency without a clear user benefit.
Core Entities
- GeoCell: A geographic unit (identified by
cell_id) with the currentdemand_count,supply_count,multiplier, andcomputed_attimestamp. The primary keyed object the Pricing Service reads and writes. - SurgeEvent: An immutable audit record written whenever a multiplier changes. Contains
cell_id,old_multiplier,new_multiplier,demand_count,supply_count, andevent_time. Feeds ML training and compliance reporting. - RideRequest: A demand signal. Carries
request_id,rider_id,pickup_cell_id,status, andcreated_at. The aggregator counts pending requests per cell to compute demand. - Driver: A supply signal. Carries
driver_id,status(available, on_trip, offline), andcurrent_cell_id. The aggregator counts available drivers per cell to compute supply.
Full schema, partition keys, and indexes are deferred to the deep dives. These four entities are sufficient for the API design and High-Level Design.
API Design
FR 1 and FR 3 - Rider sees surge multiplier before booking:
GET /pricing/surge?cell_id={cell_id}
Response: { cell_id, multiplier, computed_at }
This is the hot read path. The Booking Service calls this endpoint on every ride request before presenting the final price to the rider. The response must be fast (under 10ms) because it sits in the critical path of booking. The computed_at field allows the Booking Service to surface "prices updated 15s ago" messaging in the UI.
FR 2 - Internal: Aggregator publishes demand and supply counts:
The aggregator does not expose an HTTP endpoint. It reads from two Kafka topics and writes computed cell states to Redis.
// Kafka topic: ride.requests
// Published by Booking Service on every ride request
{ request_id, pickup_cell_id, event_time }
// Kafka topic: driver.status
// Published by Location Service on every driver availability change
{ driver_id, cell_id, status, event_time }
FR 2 - Internal: Pricing Service computes and stores multiplier:
POST /pricing/compute (internal, called by Aggregator Worker)
Body: { cell_id, demand_count, supply_count }
Response: { cell_id, multiplier }
In the evolved design this becomes a Redis write directly from the Aggregator Worker rather than an HTTP call; the HTTP shape is shown here to make the contract explicit before the architecture evolves.
FR 4 - Surge removal is implicit: The Aggregator recomputes every cell on a 30-second schedule. When demand_count / supply_count falls below the deactivation threshold, the multiplier resets to 1.0 and a SurgeEvent is written with new_multiplier = 1.0. No separate "remove surge" API is needed.
High-Level Design
1. Detecting demand-supply imbalance
The system must count pending ride requests and available drivers per geo-cell, then compare them. The naive approach polls a database on a timer.
Components:
- Booking Service: Records every ride request and publishes a demand event.
- Location Service: Tracks driver availability and publishes supply events.
- Aggregator: A scheduled job that queries the primary DB for demand and supply counts per cell every 60 seconds.
- Surge DB: Stores
GeoCellrecords with demand, supply, and multiplier.
Request walkthrough:
- Rider submits a ride request; Booking Service inserts it into Surge DB with status
pending. - Driver sends a GPS update; Location Service updates driver status in Surge DB.
- Aggregator runs every 60 seconds, executes
SELECT cell_id, COUNT(*) FROM ride_requests GROUP BY pickup_cell_idandSELECT cell_id, COUNT(*) FROM drivers WHERE status='available' GROUP BY current_cell_id. - Aggregator writes updated
demand_countandsupply_countinto eachGeoCellrow. - Pricing Service reads the
GeoCelltable when the Booking Service queries for the multiplier.
This covers demand detection and multiplier reads. The 60-second poll interval already violates the 2-second freshness NFR. The next section evolves to streaming aggregation.
2. Real-time aggregation with streaming events
A 60-second cron job cannot meet a 2-second freshness SLA. The fix is replacing the batch poll with a streaming event pipeline that maintains rolling counts in memory.
I always draw the naive cron-poll version on the whiteboard first and let the interviewer see its two failure modes before introducing Kafka. It shows you know why the streaming approach exists, not just that it exists.
The DB-poll approach has two failure modes. First, a SELECT COUNT GROUP BY cell across millions of ride requests and drivers runs a full table scan; at peak it adds seconds of query time before any multiplier update is written. Second, the 60-second window means a demand spike from a concert ending triggers no surge response for up to a minute, during which all riders see a 1.0 multiplier and the system fails to clear the queue.
The fix is Kafka plus a stateful Aggregator Worker. The Booking Service publishes a RideRequested event to Kafka on every request. The Location Service publishes a DriverStatusChanged event on every status transition.
The Aggregator Worker maintains an in-memory counter map per cell, increments or decrements it on each event, and flushes to Redis every 30 seconds.
Components:
- Kafka (ride.requests topic): Receives demand events from Booking Service. Partitioned by
cell_idso all events for the same cell route to the same Aggregator Worker partition. - Kafka (driver.status topic): Receives supply events from Location Service. Also partitioned by
cell_id. - Aggregator Worker: Consumes both topics, maintains
HashMap<cell_id, (demand, supply)>in memory, and flushes updated cell states to Redis every 30 seconds. - Pricing Service: Reads the multiplier directly from Redis on every Booking Service query. No DB read in the hot path.
- Redis (multiplier cache): Stores
surge:cell:{cell_id}as a simple key-value with the serialized multiplier. TTL is 90 seconds as a safety net.
Request walkthrough:
- Rider submits a ride request; Booking Service publishes
{ request_id, cell_id }toride.requests. - Driver goes offline; Location Service publishes
{ driver_id, cell_id, status: offline }todriver.status. - Aggregator Worker consumes events, updates in-memory counters:
demand[cell_a]++,supply[cell_a]--. - Every 30 seconds, Aggregator Worker computes
multiplier = compute(demand, supply)per cell and writesSET surge:cell:{cell_id} {multiplier} EX 90to Redis. - Booking Service calls
GET /pricing/surge?cell_id=X; Pricing Service executesGET surge:cell:{cell_id}from Redis and returns in under 1ms.
Kafka partitioning by cell_id is the key move. It ensures all events for a given cell arrive at the same Aggregator Worker instance, so the in-memory counter map never needs cross-process coordination. I would call this out explicitly on the whiteboard because it is the sentence that tells the interviewer you understand stateful stream processing. Treat the multiplier computation formula as a black box here; the deep dives cover it.
3. Showing riders the surge multiplier before booking
The Pricing Service read path must be in the critical booking flow but must not add perceptible latency. The multiplier must be visible to the rider within 2 seconds of a demand spike.
Components:
- Booking Service: Calls
GET /pricing/surge?cell_id=Xbefore returning the price estimate to the rider UI. - Pricing Service: A stateless read service that executes one Redis
GETper request. - Redis (multiplier cache): Pre-populated by the Aggregator Worker. The Pricing Service never writes to Redis.
Request walkthrough:
- Rider app requests a price estimate; Booking Service receives the call.
- Booking Service determines the rider's pickup
cell_id(via H3 or geographic lookup). - Booking Service calls Pricing Service:
GET /pricing/surge?cell_id={cell_id}. - Pricing Service runs
GET surge:cell:{cell_id}against Redis. - If key missing (cold start or TTL expired): Pricing Service returns
{ multiplier: 1.0, source: "default" }and logs the cache miss. This is the fail-open behavior required by the 99.99% availability NFR. - Pricing Service returns
{ cell_id, multiplier, computed_at }to Booking Service. - Booking Service multiplies the base fare by the multiplier and presents the final price to the rider.
The Pricing Service is intentionally stateless. It owns no data; it only translates a cell_id into a Redis key lookup. This means it can be horizontally scaled to any number of instances behind a load balancer with no coordination overhead.
Never put the multiplier computation inside the Pricing Service read path. Computing surge on every booking request would couple read latency to computation complexity and make the 50K reads/second SLA impossible to hold. Computation belongs exclusively in the Aggregator Worker, which runs asynchronously on its own schedule.
4. Automatic surge removal on rebalance
When drivers flood into a surge area, supply recovers, and the multiplier must return to 1.0 without manual intervention. The Aggregator Worker handles this on its next flush cycle.
The Aggregator Worker recomputes every active cell on every 30-second flush. When demand_count / supply_count drops below the deactivation threshold, it writes multiplier = 1.0 to Redis and inserts a SurgeEvent with new_multiplier = 1.0 to the audit log. The Pricing Service reads the updated value on the next request; no separate deactivation signal is needed.
Components:
- Aggregator Worker: Evaluates the deactivation condition on every flush cycle.
- Redis (multiplier cache): Updated to
1.0when the condition clears. - Surge DB: New
SurgeEventrow written for auditing and ML training.
Request walkthrough:
- Drivers enter a surge cell; Location Service publishes
DriverStatusChangedevents withstatus: available. - Aggregator Worker consumes events;
supply[cell_a]increments untildemand / supply < deactivation_threshold. - On the next 30-second flush:
SET surge:cell:{cell_a} 1.0 EX 90overwrites the previous multiplier. - Aggregator Worker writes
SurgeEvent { old_multiplier: 1.8, new_multiplier: 1.0 }to Surge DB. - All subsequent Pricing Service reads for
cell_areturn1.0.
Surge removal is symmetric with surge activation. The same flush loop that raises multipliers also lowers them. I like to point out this symmetry explicitly because interviewers sometimes expect a separate deactivation service; showing that the same 30-second flush handles both directions demonstrates clean design. The only difference is the threshold comparison, which the EWMA deep dive covers in detail.
Potential Deep Dives
1. How do we partition geography into cells?
Every multiplier is scoped to a geo-cell. The cell definition determines surge granularity, shard distribution in Redis, and whether a high-demand street corner incorrectly inflates the multiplier for a quiet street two blocks away.
2. How do we compute the surge multiplier?
Given demand and supply counts for a cell, how do you translate them into a multiplier? The naive formula is trivial; the production formula prevents wild swings that undermine rider trust.
3. How do we prevent surge oscillation?
Oscillation is the surge pricing failure mode that appears in production but not in staging. A cell enters and exits surge every 30-60 seconds, riders see the multiplier change every time they open the app, and trust collapses. The EWMA formula mitigates this but does not eliminate it without an explicit state machine.
4. How do we scale the Pricing Service to 50K reads per second?
The Pricing Service sits in the critical booking path. Every booking request triggers one multiplier read. At 50K bookings per second globally, the Pricing Service must return under 10ms without a database query in the hot path.
Final Architecture
The core insight is the push inversion: the Aggregator Worker pre-computes multipliers and pushes them to Redis every 30 seconds, so the Booking Service reads a pre-computed value in under 1ms with no coordination. The Pricing Service exits the critical booking path entirely and becomes an external-facing read API only. If I had to summarize this entire design in one sentence for a whiteboard, it would be: "the Aggregator pushes and the Booking Service reads; nothing in the booking path computes."
Interview Cheat Sheet
- State the freshness SLA (2-second visibility, 30-second recompute interval) before discussing any architecture; it eliminates the DB-poll approach immediately.
- Use H3 at resolution 7 for geo-cells: hexagons tile the sphere without distortion, and each cell is a separate Redis key that shards naturally.
- Apply k=1 ring expansion when computing demand and supply per cell; it prevents cliff-edge effects at cell boundaries.
- The Kafka topics must be partitioned by
cell_idso all events for a given cell route to the same Aggregator Worker instance and the in-memory counter map needs no cross-process coordination. - EWMA with alpha=0.3 gives recent readings 30% weight and prior history 70% weight, smoothing transient noise without lagging genuine surges.
- Hysteresis prevents oscillation: activate surge at ratio 1.3, deactivate only when ratio drops below 1.1. The asymmetric gap is the key.
- Require 2 consecutive flush cycles above the activation threshold before triggering surge; one noisy reading cannot spike the multiplier.
- The Booking Service must read from Redis directly, not via an HTTP call to the Pricing Service; the extra network hop is 5-20ms and is unacceptable in the booking critical path.
- Redis handles 100K+ simple GET operations per second per node; a 3-node cluster comfortably serves 50K booking reads per second at under 1ms p99.
- The Pricing Service is fail-open: if Redis is unreachable, return multiplier 1.0 and log the miss. A blocked booking is worse than a temporarily under-priced one.
- Write
SurgeEventrecords only when the multiplier changes by 0.1 or more, not on every flush cycle; this keeps the audit log from growing at 30 rows per active cell per minute. - Treat EWMA alpha and hysteresis thresholds as market-specific config, not code constants; surge dynamics in Manhattan differ from those at an airport pickup lane.