ETA Service
Design an ETA service for a ride-sharing app that computes accurate travel time estimates in real time, using Contraction Hierarchies, a GPS probe pipeline, and SSE push updates at 100M requests per minute.
What is an ETA service?
An ETA service takes two coordinates (pickup and dropoff) and returns the time a driver will arrive. The apparent simplicity collapses fast: your road network graph has millions of edges, traffic conditions change every minute, and you need a fresh answer in under 200ms for every one of 100 million requests per minute.
This is a strong interview question because it tests graph algorithms, real-time stream processing, geospatial data structures, and push notification design all inside one system.
Functional Requirements
Core Requirements
- Given a pickup and dropoff location, return an estimated travel time.
- ETA updates continuously as the driver moves (every 5-15 seconds).
- Live traffic conditions and road closures are incorporated into the estimate.
- ETA recalculates automatically when the driver deviates from the predicted route.
Below the Line
- Turn-by-turn navigation instructions.
- Multi-modal routing (transit, walking, cycling).
- Traffic incident reporting from user-submitted data.
- Historical ETA accuracy reports and analytics dashboards.
The hardest part in scope: Computing a route ETA on a real-world road network in under 200ms requires pre-processed hierarchical graph structures and real-time traffic overlays. We will dedicate an entire deep dive to the routing algorithm choice.
Turn-by-turn navigation is below the line because it requires storing and serving a fully rendered instruction set per maneuver, which involves 10-50x more data than a single ETA number plus a separate rendering pipeline. To add it later, we would compute the full path during the routing step and serialize the maneuver sequence alongside the ETA result.
Multi-modal routing is below the line because each transport mode (bus, subway, walking) uses a different graph topology and scheduling model. Combining them requires a transit schedule database and a separate graph search pass.
Traffic incident reporting is below the line because it requires a moderation pipeline to reject false reports and a classification model to distinguish genuine accidents from normal congestion signals. It does not affect the core routing or push architecture we are designing here.
Historical ETA accuracy dashboards are below the line because they require a separate OLAP store and aggregation pipeline for post-trip analysis. They do not affect real-time latency or correctness.
Non-Functional Requirements
Core Requirements
- Latency: Initial ETA returned in under 200ms p99. Live ETA updates reflected within 2 seconds of a driver location ping.
- Freshness: Traffic conditions reflected within 30 seconds of a real-world change (an accident slowing a corridor, a road closure).
- Scale: 10M concurrent active rides. 100M ETA requests per minute (roughly 1.67M per second).
- Availability: 99.99% uptime. Availability over consistency. A 30-second stale ETA is acceptable; a 503 error is not.
- Accuracy: Average ETA error within 10% of actual travel time at the 80th percentile.
Under 200ms latency for route computation means we cannot run a naive Dijkstra search on a full city-scale road network at request time. New York's road network has roughly 370,000 nodes and 950,000 edges, and full Dijkstra with no preprocessing takes 100-500ms on that graph. Pre-processing is mandatory.
99.99% availability over consistency means we tolerate slightly stale traffic data before accuracy degrades meaningfully. A driver going 5 mph instead of 10 mph on one segment changes that segment's contribution by 1-2 minutes on a 20-minute trip overall.
Below the Line
- Sub-second global traffic update propagation.
- ML-based historical pattern modeling (time-of-day, day-of-week speed adjustments).
Sub-second global propagation is deferred because our freshness NFR is 30 seconds, not 1 second. Achieving sub-second would require per-ping streaming writes with a very different Redis write pattern and a much tighter Flink window, significantly increasing infrastructure cost for marginal accuracy gain.
ML-based historical modeling is deferred because it requires an offline training pipeline and inference infrastructure. The static CH speed overlay already satisfies the 10% accuracy NFR without it.
Read/write ratio: For every 1 driver GPS ping written, expect 10-15 ETA reads (rider app updates, internal recalculation triggers, pricing engine queries). The location ingestion layer is write-heavy at millions of pings per second. The ETA query layer is read-heavy. These two paths need independent scaling or the write pipeline saturates the routing nodes.
I'd call out this separation immediately in an interview because it's the architectural decision that shapes everything else. The moment you couple the write-heavy ingest path with the read-heavy query path, you've built a system that can't scale either one independently.
Core Entities
- Route: A computed path between two geographic coordinates, represented as an ordered list of road segment IDs with total distance and initial ETA.
- RoadSegment: An edge in the road network graph. Stores static geometry (start node, end node, distance in meters) plus a dynamic field for current travel speed updated from the traffic ingestion pipeline.
- DriverLocation: Real-time position snapshot (driver_id, latitude, longitude, speed, heading, timestamp). Written every 5-15 seconds per active driver.
- ETAResult: The current estimated arrival for a trip (trip_id, eta_seconds, route_id, computed_at). Cached and refreshed on each meaningful location update.
- TrafficSnapshot: Aggregated speed data per road segment, derived from GPS probe vehicles and third-party data feeds. Applied as an overlay on RoadSegment edge weights.
We will define detailed schema (indexes, partition keys, denormalization decisions) in the deep dives. The entities above are enough to reason about the API and data flow.
API Design
Start with one endpoint per core functional requirement.
FR 1 - Compute an initial ETA:
POST /eta
Body: { origin: { lat, lng }, destination: { lat, lng } }
Response: { eta_seconds: 1140, route_id: "rt_abc123", estimated_arrival_at: "2026-04-02T10:23:00Z" }
POST because we are computing a new resource (a route), not fetching a pre-existing one. The route_id in the response lets the client subscribe to updates without re-sending coordinates on every subsequent call.
FR 2 - Get current ETA for an active trip:
GET /trips/{trip_id}/eta
Response: { eta_seconds: 840, driver_location: { lat: 37.78, lng: -122.41 }, updated_at: "2026-04-02T10:14:30Z" }
The rider app calls this on each polling cycle. The updated_at field lets the client detect stale responses if the driver location pipeline falls behind.
FR 3 - Driver pushes a location update:
POST /drivers/{driver_id}/location
Body: { lat, lng, speed, heading, timestamp }
Response: 204 No Content
Every ping triggers ETA recomputation if the driver has progressed significantly (more than 100 meters or 30 seconds since the last trigger). We detail the threshold logic in the deep dives.
FR 4 - Subscribe to live ETA updates via SSE:
GET /trips/{trip_id}/eta/stream
Response: text/event-stream
data: { eta_seconds: 820, updated_at: "..." }
data: { eta_seconds: 790, updated_at: "..." }
The rider app opens this persistent connection once. The server pushes each new ETAResult over the stream as the driver moves. We detail the push mechanism in Deep Dive 3.
Authentication is out of scope for the core design. In production, all endpoints require a signed JWT. The driver location endpoint additionally validates that the driver_id in the JWT matches the URL parameter to prevent location spoofing.
High-Level Design
1. Return an ETA given a pickup and dropoff location
The core path: the client submits two coordinates, the ETA Service computes a route on the road network, and returns a travel time estimate.
Components:
- Client: Rider or driver app sending
POST /etawith origin and destination. - ETA Service: Receives the request, finds the nearest road network nodes to the coordinates, calls the Routing Engine, and returns the result.
- Routing Engine: Runs the graph search algorithm against the road network. I treat this as a black box here and detail the algorithm choice in Deep Dive 1.
- Road Network DB: Stores the graph (nodes are intersections, edges are road segments with distance and speed). Read-heavy, infrequently updated.
Request walkthrough:
- Client sends
POST /etawith origin and destination coordinates. - ETA Service finds the nearest graph nodes to both coordinates using a geospatial index.
- ETA Service calls Routing Engine with the source node and destination node.
- Routing Engine runs the shortest-path algorithm and returns an ordered list of segment IDs plus total travel time.
- ETA Service caches the result keyed by
trip_id. - ETA Service returns
eta_secondsandroute_idto the client.
This diagram shows the initial ETA computation only. The traffic ingestion pipeline and real-time update loop come in the next two requirements.
2. Incorporate live traffic conditions
A static road network gives wrong answers fast. A corridor that normally flows at 35 mph drops to 5 mph during an accident. Without live traffic, every ETA on that corridor stays wrong by 6-7x until the jam clears. I've seen this exact failure mode in production: a static graph returns a confident 8-minute ETA while the rider watches the driver sit in gridlock for 40 minutes.
The fix is a Traffic Ingestion Pipeline that continuously collects GPS speed data from driver probe vehicles, aggregates it per road segment, and updates edge weights in the Road Network DB. Every driver in the fleet is already a traffic sensor.
Components added:
- Location Ingestor: A stateless service that receives driver pings and publishes them to Kafka. It does not write to the DB directly (that would become a write bottleneck at millions of pings per second).
- Kafka (driver-locations topic): Decouples the write path (driver pings) from the processing path. Pings are published at ingestion rate and consumed at the rate the Traffic Processor can sustain.
- Traffic Processor: A stream processing job (Flink or Kafka Streams) that consumes the location stream, groups pings by road segment, computes a 5-minute rolling average speed per segment, and writes the updated speed to the Road Network DB.
Request walkthrough (traffic update path):
- Driver app sends
POST /drivers/{driver_id}/locationevery 10 seconds. - Location Ingestor validates and publishes the ping to the
driver-locationsKafka topic. - Traffic Processor consumes from the topic, matches each ping to a road segment using the ping coordinates, and adds it to a 30-second tumbling window per segment.
- At the end of each window, Traffic Processor writes the updated median speed to the RoadSegment edge in the Road Network DB.
- The next ETA query for a route traversing that segment reads the fresh speed and returns a corrected estimate.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.