Service discovery
How distributed services find and communicate with each other in dynamic environments, covering client-side vs. server-side discovery, DNS-based vs. registry-based approaches, and health checking.
TL;DR
- Service discovery solves the question "how does Service A know where to send requests to Service B?" in dynamic environments where instances start, stop, and move.
- Two models: client-side discovery (client queries registry, picks an instance) and server-side discovery (client sends to a load balancer, which consults the registry).
- Registry backends: DNS (simple, eventually consistent), dedicated registry (Consul, etcd, ZooKeeper β strongly consistent, supports health checks), platform-native (Kubernetes Services).
- Health checking is inseparable from service discovery β a registry without health checks returns dead endpoints.
- In Kubernetes, service discovery is provided by the platform and most teams never build it themselves.
The Problem It Solves
You're running three instances of the payments service behind a load balancer. Friday afternoon, the team deploys version 2.1 with a rolling update. Kubernetes terminates the old pods and spins up new ones. Five minutes later, the orders service starts logging connection refused errors. It was configured with hardcoded IPs for the old payments pods, and nobody updated the config.
In static infrastructure, you hardcode payments.internal:8080 once and it stays valid for months. In dynamic environments with containers, auto-scaling, and rolling deploys, the set of instances serving a given service changes constantly. An IP address that was valid five minutes ago might now point to a terminated container.
The core question: how does Service A know where Service B lives right now?
Without a discovery mechanism, every deployment requires restarting all upstream services with new configuration. Every scale-out event means updating config files across the cluster. Every crashed instance keeps receiving traffic until someone manually removes it. I've seen teams spend more time on configuration management than on actual feature work because of this exact problem.
Service discovery eliminates this manual coordination entirely.
What Is It?
Service discovery is the mechanism that automatically tracks which instances are available for each service, where they live (IP and port), and whether they're healthy enough to receive traffic. It replaces hardcoded configuration with a dynamic, self-updating directory of services.
Think of it like the difference between a printed phone directory and a modern contact list that syncs automatically. The printed directory is correct the day it ships, but entries go stale as people move. A synced contact list updates itself in real time: when someone gets a new number, every device sees the change. Service discovery is the synced contact list for your infrastructure.
Every service registers itself on startup and deregisters on shutdown. The registry health-checks each instance continuously. Callers query the registry to get a list of healthy instances, and they never touch a configuration file. For your interview: say "service discovery replaces static configuration with a dynamic registry that health-checks and routes traffic automatically" and move on.
How It Works
Here's the lifecycle of a single service instance, from boot to receiving traffic to shutdown:
Steps in detail:
- Register: The service starts on some host with IP 10.0.3.22. It sends a registration call to the registry: "I'm payments-service at 10.0.3.22:8080, my health endpoint is /health."
- Health checking begins: The registry starts polling the instance's
/healthendpoint every 5 seconds. Three consecutive failures mark the instance as unhealthy. - Lookup: The orders service needs to call payments. It queries the registry for all healthy instances of
payments-service. - Instance selection: The caller receives a list like
[10.0.3.22:8080, 10.0.3.25:8080]and picks one using round-robin, random, or least-connections. - Request: The caller sends its HTTP/gRPC request to the chosen instance.
- Deregistration: On clean shutdown, the instance deregisters immediately. On a crash, the health check fails after 2-3 missed rounds, and the registry removes it automatically.
Here's what a Consul registration call looks like:
PUT /v1/agent/service/register
{
"Name": "payments-service",
"ID": "payments-service-10.0.3.22",
"Address": "10.0.3.22",
"Port": 8080,
"Check": {
"HTTP": "http://10.0.3.22:8080/health",
"Interval": "5s",
"DeregisterCriticalServiceAfter": "30s"
}
}
The DeregisterCriticalServiceAfter field is key: after 30 seconds of failed health checks, the instance is removed entirely from the registry. This prevents stale entries from accumulating when instances crash without deregistering.
I'll often see teams skip the health check configuration and then wonder why their registry is full of dead endpoints. Health checking isn't optional. It's the mechanism that makes service discovery reliable instead of just a different kind of stale configuration.
Key Components
| Component | Role |
|---|---|
| Service Registry | Central database storing service names, instance IPs, ports, and health status. Examples: Consul catalog, etcd key-value store, K8s Endpoints objects. |
| Health Checker | Probes each registered instance on a schedule (HTTP GET, TCP connect, or gRPC health). Removes unhealthy instances from the available pool. |
| Service Instance | Any running process that registers itself. Sends a registration on startup, heartbeats periodically, and deregisters on clean shutdown. |
| Client Library / SDK | Queries the registry, caches results locally, and handles load balancing across returned instances. Used in client-side discovery. |
| Load Balancer / Proxy | Queries the registry on behalf of clients and forwards traffic to healthy instances. Used in server-side discovery. |
| DNS Resolver | Translates service names to IP addresses using standard DNS protocol. Simple but limited: no health check integration, TTL-based caching creates staleness. |
| Sidecar Proxy | Runs alongside each service instance (e.g., Envoy in a service mesh). Intercepts outbound traffic and handles discovery transparently, so the application code doesn't touch the registry at all. |
Types / Variations
Client-side discovery
The calling service queries the registry directly and picks an instance using its own load-balancing logic.
The client has full control over the load-balancing algorithm and can make smart decisions (prefer same-zone instances, weight by latency). The downside: every service in every language needs a discovery-aware client library.
Server-side discovery
The calling service sends to a fixed load balancer address. The load balancer handles registry lookups and routing.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.