ETA Service
Design an ETA service for a ride-sharing app that computes accurate travel time estimates in real time, using Contraction Hierarchies, a GPS probe pipeline, and SSE push updates at 100M requests per minute.
What is an ETA service?
An ETA service takes two coordinates (pickup and dropoff) and returns the time a driver will arrive. The apparent simplicity collapses fast: your road network graph has millions of edges, traffic conditions change every minute, and you need a fresh answer in under 200ms for every one of 100 million requests per minute.
This is a strong interview question because it tests graph algorithms, real-time stream processing, geospatial data structures, and push notification design all inside one system.
Functional Requirements
Core Requirements
- Given a pickup and dropoff location, return an estimated travel time.
- ETA updates continuously as the driver moves (every 5-15 seconds).
- Live traffic conditions and road closures are incorporated into the estimate.
- ETA recalculates automatically when the driver deviates from the predicted route.
Below the Line
- Turn-by-turn navigation instructions.
- Multi-modal routing (transit, walking, cycling).
- Traffic incident reporting from user-submitted data.
- Historical ETA accuracy reports and analytics dashboards.
The hardest part in scope: Computing a route ETA on a real-world road network in under 200ms requires pre-processed hierarchical graph structures and real-time traffic overlays. We will dedicate an entire deep dive to the routing algorithm choice.
Turn-by-turn navigation is below the line because it requires storing and serving a fully rendered instruction set per maneuver, which involves 10-50x more data than a single ETA number plus a separate rendering pipeline. To add it later, we would compute the full path during the routing step and serialize the maneuver sequence alongside the ETA result.
Multi-modal routing is below the line because each transport mode (bus, subway, walking) uses a different graph topology and scheduling model. Combining them requires a transit schedule database and a separate graph search pass.
Traffic incident reporting is below the line because it requires a moderation pipeline to reject false reports and a classification model to distinguish genuine accidents from normal congestion signals. It does not affect the core routing or push architecture we are designing here.
Historical ETA accuracy dashboards are below the line because they require a separate OLAP store and aggregation pipeline for post-trip analysis. They do not affect real-time latency or correctness.
Non-Functional Requirements
Core Requirements
- Latency: Initial ETA returned in under 200ms p99. Live ETA updates reflected within 2 seconds of a driver location ping.
- Freshness: Traffic conditions reflected within 30 seconds of a real-world change (an accident slowing a corridor, a road closure).
- Scale: 10M concurrent active rides. 100M ETA requests per minute (roughly 1.67M per second).
- Availability: 99.99% uptime. Availability over consistency. A 30-second stale ETA is acceptable; a 503 error is not.
- Accuracy: Average ETA error within 10% of actual travel time at the 80th percentile.
Under 200ms latency for route computation means we cannot run a naive Dijkstra search on a full city-scale road network at request time. New York's road network has roughly 370,000 nodes and 950,000 edges, and full Dijkstra with no preprocessing takes 100-500ms on that graph. Pre-processing is mandatory.
99.99% availability over consistency means we tolerate slightly stale traffic data before accuracy degrades meaningfully. A driver going 5 mph instead of 10 mph on one segment changes that segment's contribution by 1-2 minutes on a 20-minute trip overall.
Below the Line
- Sub-second global traffic update propagation.
- ML-based historical pattern modeling (time-of-day, day-of-week speed adjustments).
Sub-second global propagation is deferred because our freshness NFR is 30 seconds, not 1 second. Achieving sub-second would require per-ping streaming writes with a very different Redis write pattern and a much tighter Flink window, significantly increasing infrastructure cost for marginal accuracy gain.
ML-based historical modeling is deferred because it requires an offline training pipeline and inference infrastructure. The static CH speed overlay already satisfies the 10% accuracy NFR without it.
Read/write ratio: For every 1 driver GPS ping written, expect 10-15 ETA reads (rider app updates, internal recalculation triggers, pricing engine queries). The location ingestion layer is write-heavy at millions of pings per second. The ETA query layer is read-heavy. These two paths need independent scaling or the write pipeline saturates the routing nodes.
I'd call out this separation immediately in an interview because it's the architectural decision that shapes everything else. The moment you couple the write-heavy ingest path with the read-heavy query path, you've built a system that can't scale either one independently.
Core Entities
- Route: A computed path between two geographic coordinates, represented as an ordered list of road segment IDs with total distance and initial ETA.
- RoadSegment: An edge in the road network graph. Stores static geometry (start node, end node, distance in meters) plus a dynamic field for current travel speed updated from the traffic ingestion pipeline.
- DriverLocation: Real-time position snapshot (driver_id, latitude, longitude, speed, heading, timestamp). Written every 5-15 seconds per active driver.
- ETAResult: The current estimated arrival for a trip (trip_id, eta_seconds, route_id, computed_at). Cached and refreshed on each meaningful location update.
- TrafficSnapshot: Aggregated speed data per road segment, derived from GPS probe vehicles and third-party data feeds. Applied as an overlay on RoadSegment edge weights.
We will define detailed schema (indexes, partition keys, denormalization decisions) in the deep dives. The entities above are enough to reason about the API and data flow.
API Design
Start with one endpoint per core functional requirement.
FR 1 - Compute an initial ETA:
POST /eta
Body: { origin: { lat, lng }, destination: { lat, lng } }
Response: { eta_seconds: 1140, route_id: "rt_abc123", estimated_arrival_at: "2026-04-02T10:23:00Z" }
POST because we are computing a new resource (a route), not fetching a pre-existing one. The route_id in the response lets the client subscribe to updates without re-sending coordinates on every subsequent call.
FR 2 - Get current ETA for an active trip:
GET /trips/{trip_id}/eta
Response: { eta_seconds: 840, driver_location: { lat: 37.78, lng: -122.41 }, updated_at: "2026-04-02T10:14:30Z" }
The rider app calls this on each polling cycle. The updated_at field lets the client detect stale responses if the driver location pipeline falls behind.
FR 3 - Driver pushes a location update:
POST /drivers/{driver_id}/location
Body: { lat, lng, speed, heading, timestamp }
Response: 204 No Content
Every ping triggers ETA recomputation if the driver has progressed significantly (more than 100 meters or 30 seconds since the last trigger). We detail the threshold logic in the deep dives.
FR 4 - Subscribe to live ETA updates via SSE:
GET /trips/{trip_id}/eta/stream
Response: text/event-stream
data: { eta_seconds: 820, updated_at: "..." }
data: { eta_seconds: 790, updated_at: "..." }
The rider app opens this persistent connection once. The server pushes each new ETAResult over the stream as the driver moves. We detail the push mechanism in Deep Dive 3.
Authentication is out of scope for the core design. In production, all endpoints require a signed JWT. The driver location endpoint additionally validates that the driver_id in the JWT matches the URL parameter to prevent location spoofing.
High-Level Design
1. Return an ETA given a pickup and dropoff location
The core path: the client submits two coordinates, the ETA Service computes a route on the road network, and returns a travel time estimate.
Components:
- Client: Rider or driver app sending
POST /etawith origin and destination. - ETA Service: Receives the request, finds the nearest road network nodes to the coordinates, calls the Routing Engine, and returns the result.
- Routing Engine: Runs the graph search algorithm against the road network. I treat this as a black box here and detail the algorithm choice in Deep Dive 1.
- Road Network DB: Stores the graph (nodes are intersections, edges are road segments with distance and speed). Read-heavy, infrequently updated.
Request walkthrough:
- Client sends
POST /etawith origin and destination coordinates. - ETA Service finds the nearest graph nodes to both coordinates using a geospatial index.
- ETA Service calls Routing Engine with the source node and destination node.
- Routing Engine runs the shortest-path algorithm and returns an ordered list of segment IDs plus total travel time.
- ETA Service caches the result keyed by
trip_id. - ETA Service returns
eta_secondsandroute_idto the client.
This diagram shows the initial ETA computation only. The traffic ingestion pipeline and real-time update loop come in the next two requirements.
2. Incorporate live traffic conditions
A static road network gives wrong answers fast. A corridor that normally flows at 35 mph drops to 5 mph during an accident. Without live traffic, every ETA on that corridor stays wrong by 6-7x until the jam clears. I've seen this exact failure mode in production: a static graph returns a confident 8-minute ETA while the rider watches the driver sit in gridlock for 40 minutes.
The fix is a Traffic Ingestion Pipeline that continuously collects GPS speed data from driver probe vehicles, aggregates it per road segment, and updates edge weights in the Road Network DB. Every driver in the fleet is already a traffic sensor.
Components added:
- Location Ingestor: A stateless service that receives driver pings and publishes them to Kafka. It does not write to the DB directly (that would become a write bottleneck at millions of pings per second).
- Kafka (driver-locations topic): Decouples the write path (driver pings) from the processing path. Pings are published at ingestion rate and consumed at the rate the Traffic Processor can sustain.
- Traffic Processor: A stream processing job (Flink or Kafka Streams) that consumes the location stream, groups pings by road segment, computes a 5-minute rolling average speed per segment, and writes the updated speed to the Road Network DB.
Request walkthrough (traffic update path):
- Driver app sends
POST /drivers/{driver_id}/locationevery 10 seconds. - Location Ingestor validates and publishes the ping to the
driver-locationsKafka topic. - Traffic Processor consumes from the topic, matches each ping to a road segment using the ping coordinates, and adds it to a 30-second tumbling window per segment.
- At the end of each window, Traffic Processor writes the updated median speed to the RoadSegment edge in the Road Network DB.
- The next ETA query for a route traversing that segment reads the fresh speed and returns a corrected estimate.
The Routing Engine now reads live edge weights, not static ones. The Traffic Processor maintains a 30-second lag from real-world conditions, which meets our freshness NFR.
3. Continuous ETA updates as the driver moves
A static ETA computed at trip start becomes incorrect 3 minutes in. The rider needs to see the estimate change as the driver progresses (or gets stuck in traffic).
The naive approach is HTTP polling: the rider app calls GET /trips/{trip_id}/eta every 5 seconds. This works at small scale but generates 10M concurrent rides x 1 request/5 seconds = 2M requests/second, all hitting the ETA Service even when the ETA has not changed.
The key insight is to separate the recalculation trigger from the rider read path. Only recompute when the driver moves a meaningful threshold (100+ meters since last recalculation). Store the latest ETAResult in Redis so rider reads hit the cache directly, not the routing engine. This is one of those design choices I'd highlight on the whiteboard because it reduces routing engine load by 6x compared to the polling approach.
Components added:
- ETA Cache (Redis): Stores the latest
ETAResultpertrip_id. Rider app reads come here. Updated only when a recalculation trigger fires. - Recalculation Trigger: Logic inside the Location Ingestor. If the driver has moved 100+ meters since last recalculation, publish a
RecalcRequiredevent to a second Kafka topic. - ETA Updater: Consumes
RecalcRequiredevents, calls the Routing Engine with the driver's current position as the new origin, writes the freshETAResultto Redis and to the ETAResult store.
Request walkthrough (live update cycle):
- Driver sends a location ping. Location Ingestor publishes to Kafka and checks the threshold.
- If the threshold is exceeded, Location Ingestor publishes
RecalcRequired{ trip_id, current_lat, current_lng }to the recalc topic. - ETA Updater consumes the event, calls Routing Engine with current driver position and original destination.
- Routing Engine returns updated travel time.
- ETA Updater writes new
ETAResultto Redis (and persists to the audit store). - Rider app polls
GET /trips/{trip_id}/eta, hits Redis, gets the latest value in under 1ms.
The rider app reads exclusively from Redis. The Routing Engine is triggered only when the driver moves, not on every rider poll. At 10M concurrent rides and one meaningful location change per 30 seconds per driver, the routing engine handles roughly 333K requests per second instead of 2M.
4. ETA recalculates when the driver deviates
GPS drift, missed turns, and traffic diversions mean drivers sometimes leave the computed route. The ETA for the old route becomes invalid when this happens. The system needs to detect the deviation and trigger a full reroute.
Components added:
- Route Adherence Checker: Logic inside the Location Ingestor alongside the threshold check. For each driver ping, it checks whether the current position is within a 50-meter tolerance band of the expected position on the active route. If outside the band, it publishes a
RerouteRequiredevent. - Reroute Handler: Consumes
RerouteRequiredevents. It calls the Routing Engine with the driver's current position as origin (not the original pickup) and the original destination. Writes the new route andETAResultto the cache.
A hard 50-meter tolerance causes false reroutes on curved roads and in areas with poor GPS accuracy (smartphone GPS has 20-50m natural drift). Production implementations (Uber, Lyft) use a dynamic tolerance based on road curvature and reported GPS accuracy. A flat threshold is a reasonable starting point for the interview but flag this trade-off when the interviewer probes the deviation detection design.
The reroute event flow reuses the same Kafka topics and ETA Updater infrastructure already in place. The only new logic is the deviation detection algorithm running inside the Location Ingestor per ping.
Deviation detection and normal progress recalculation use the same downstream pipeline. The only code that changes is the classification inside the Location Ingestor: a RecalcRequired event means progress on the current route, while a RerouteRequired event means the ETA Updater must call the Routing Engine with no route context (full reroute from scratch).
Potential Deep Dives
1. How do we compute a route and ETA efficiently?
Three constraints shape the algorithm choice:
- The road network graph for a single major city has hundreds of thousands of nodes and millions of edges.
- Each request must return in under 200ms p99.
- Traffic updates change edge weights every 30 seconds, so we cannot fully pre-compute all paths offline.
I'd set up these three constraints at the start of this deep dive because each one eliminates a candidate algorithm. The intersection of all three is what forces you toward Contraction Hierarchies.
2. How do we ingest live traffic at scale?
Constraints:
- 10M active drivers emit location pings every 10 seconds = 1M pings per second at peak.
- Each ping must be matched to a road segment to update that segment's speed.
- Segment speed updates must propagate to routing servers within 30 seconds.
3. How do we push ETA updates to the rider with low latency?
Constraints:
- 10M concurrent riders each need updates every 5-15 seconds.
- Each update reflects the driver's latest position.
- The update must reach the rider within 2 seconds of the driver ping (not 30 seconds).
4. How do we scale the routing computation to 1.67M requests per second?
Constraints:
- 1.67M route computation requests per second globally.
- Each CH query takes 1-10ms on a single core.
- A full global road network does not fit in memory on any single server.
Final Architecture
The Routing Cluster and Traffic Processor define this system's two independent scalability limits. Routing is a pure CPU-bound problem: CH queries run entirely in-memory with no DB I/O, and adding routing servers is linear scale-out with no shared mutable state. The Traffic Processor is constrained by the 30-second freshness NFR; its Flink cluster scales by adding Kafka partitions and Flink task slots without any application code changes. In my experience, interviewers love when you identify these two independent scaling axes explicitly because it shows you understand where the system's real bottlenecks live.
Interview Cheat Sheet
- State the read/write asymmetry early: 10 ETA reads per driver location write. The ingestion path and query path scale differently and must be architecturally separated from the start.
- Contraction Hierarchies reduce a full-graph Dijkstra (100-500ms) to a bidirectional upward search (1-10ms) by contracting low-importance nodes offline. The 10-50x speedup is what makes sub-200ms routing possible at this scale.
- CH preprocessing reflects graph topology only, not traffic. Apply dynamic traffic as a speed overlay at query time. This means traffic updates never require re-running the expensive preprocessing step. This is the single most important fact to state early because it's the reason CH works with live traffic at all.
- Every driver in the fleet is a real-time traffic sensor. GPS probe vehicles are the ground truth source of segment speeds. Third-party feeds (TomTom, HERE) supplement for low-density roads.
- Use a Hidden Markov Model map matcher to assign GPS pings to road segments. Nearest-node lookup misattributes 10-20% of pings to parallel roads (highway vs. off-ramp). HMM uses speed and heading to disambiguate.
- Write traffic aggregation results to a Redis speed overlay, not the Road Network DB. Redis handles 1M+ SET ops per second. The main DB is never on the live traffic write path.
- Only trigger ETA recalculation when the driver moves 100+ meters or deviates 50+ meters from the route polyline. Recalculating on every ping is 10x the necessary routing compute.
- SSE over HTTP/2 is the correct push mechanism. It sends data only when the ETA changes, uses 2-4 KB of server memory per stream (versus 8-16 KB for raw TCP), and reconnects automatically on mobile network interruption.
- Use a consistent-hash load balancer to route each
trip_idto the same Push Service instance. This co-locates Kafka consumption and SSE connection state, eliminating inter-instance coordination on the push path. - Shard the routing cluster geographically by tile. Each server loads 10-15 GB of CH graph for its region. Over 95% of ride-sharing trips are intra-tile; the 5% that cross tiles use a two-hop boundary-node approximation.
- CH preprocessing must re-run when road topology changes (new bridge, permanent closure), not when traffic changes. These are separate operational events and should be treated as separate pipelines.
- For cross-cluster queries in logistics (long-haul trucking, airport transfers), maintain a highway-class global CH graph covering only major road edges. At 1-2% of total segments, it fits in 2-4 GB on any server and answers long-distance queries without requiring a full-world graph.
- The ETAResult Store is append-only and used for audit and post-trip accuracy analysis. Trip accuracy over time is the primary signal for tuning the CH speed overlay calibration and HMM map-match thresholds.