Service discovery
How distributed services find and communicate with each other in dynamic environments, covering client-side vs. server-side discovery, DNS-based vs. registry-based approaches, and health checking.
TL;DR
- Service discovery solves the question "how does Service A know where to send requests to Service B?" in dynamic environments where instances start, stop, and move.
- Two models: client-side discovery (client queries registry, picks an instance) and server-side discovery (client sends to a load balancer, which consults the registry).
- Registry backends: DNS (simple, eventually consistent), dedicated registry (Consul, etcd, ZooKeeper β strongly consistent, supports health checks), platform-native (Kubernetes Services).
- Health checking is inseparable from service discovery β a registry without health checks returns dead endpoints.
- In Kubernetes, service discovery is provided by the platform and most teams never build it themselves.
The Problem It Solves
You're running three instances of the payments service behind a load balancer. Friday afternoon, the team deploys version 2.1 with a rolling update. Kubernetes terminates the old pods and spins up new ones. Five minutes later, the orders service starts logging connection refused errors. It was configured with hardcoded IPs for the old payments pods, and nobody updated the config.
In static infrastructure, you hardcode payments.internal:8080 once and it stays valid for months. In dynamic environments with containers, auto-scaling, and rolling deploys, the set of instances serving a given service changes constantly. An IP address that was valid five minutes ago might now point to a terminated container.
The core question: how does Service A know where Service B lives right now?
Without a discovery mechanism, every deployment requires restarting all upstream services with new configuration. Every scale-out event means updating config files across the cluster. Every crashed instance keeps receiving traffic until someone manually removes it. I've seen teams spend more time on configuration management than on actual feature work because of this exact problem.
Service discovery eliminates this manual coordination entirely.
What Is It?
Service discovery is the mechanism that automatically tracks which instances are available for each service, where they live (IP and port), and whether they're healthy enough to receive traffic. It replaces hardcoded configuration with a dynamic, self-updating directory of services.
Think of it like the difference between a printed phone directory and a modern contact list that syncs automatically. The printed directory is correct the day it ships, but entries go stale as people move. A synced contact list updates itself in real time: when someone gets a new number, every device sees the change. Service discovery is the synced contact list for your infrastructure.
Every service registers itself on startup and deregisters on shutdown. The registry health-checks each instance continuously. Callers query the registry to get a list of healthy instances, and they never touch a configuration file. For your interview: say "service discovery replaces static configuration with a dynamic registry that health-checks and routes traffic automatically" and move on.
How It Works
Here's the lifecycle of a single service instance, from boot to receiving traffic to shutdown:
Steps in detail:
- Register: The service starts on some host with IP 10.0.3.22. It sends a registration call to the registry: "I'm payments-service at 10.0.3.22:8080, my health endpoint is /health."
- Health checking begins: The registry starts polling the instance's
/healthendpoint every 5 seconds. Three consecutive failures mark the instance as unhealthy. - Lookup: The orders service needs to call payments. It queries the registry for all healthy instances of
payments-service. - Instance selection: The caller receives a list like
[10.0.3.22:8080, 10.0.3.25:8080]and picks one using round-robin, random, or least-connections. - Request: The caller sends its HTTP/gRPC request to the chosen instance.
- Deregistration: On clean shutdown, the instance deregisters immediately. On a crash, the health check fails after 2-3 missed rounds, and the registry removes it automatically.
Here's what a Consul registration call looks like:
PUT /v1/agent/service/register
{
"Name": "payments-service",
"ID": "payments-service-10.0.3.22",
"Address": "10.0.3.22",
"Port": 8080,
"Check": {
"HTTP": "http://10.0.3.22:8080/health",
"Interval": "5s",
"DeregisterCriticalServiceAfter": "30s"
}
}
The DeregisterCriticalServiceAfter field is key: after 30 seconds of failed health checks, the instance is removed entirely from the registry. This prevents stale entries from accumulating when instances crash without deregistering.
I'll often see teams skip the health check configuration and then wonder why their registry is full of dead endpoints. Health checking isn't optional. It's the mechanism that makes service discovery reliable instead of just a different kind of stale configuration.
Key Components
| Component | Role |
|---|---|
| Service Registry | Central database storing service names, instance IPs, ports, and health status. Examples: Consul catalog, etcd key-value store, K8s Endpoints objects. |
| Health Checker | Probes each registered instance on a schedule (HTTP GET, TCP connect, or gRPC health). Removes unhealthy instances from the available pool. |
| Service Instance | Any running process that registers itself. Sends a registration on startup, heartbeats periodically, and deregisters on clean shutdown. |
| Client Library / SDK | Queries the registry, caches results locally, and handles load balancing across returned instances. Used in client-side discovery. |
| Load Balancer / Proxy | Queries the registry on behalf of clients and forwards traffic to healthy instances. Used in server-side discovery. |
| DNS Resolver | Translates service names to IP addresses using standard DNS protocol. Simple but limited: no health check integration, TTL-based caching creates staleness. |
| Sidecar Proxy | Runs alongside each service instance (e.g., Envoy in a service mesh). Intercepts outbound traffic and handles discovery transparently, so the application code doesn't touch the registry at all. |
Types / Variations
Client-side discovery
The calling service queries the registry directly and picks an instance using its own load-balancing logic.
The client has full control over the load-balancing algorithm and can make smart decisions (prefer same-zone instances, weight by latency). The downside: every service in every language needs a discovery-aware client library.
Server-side discovery
The calling service sends to a fixed load balancer address. The load balancer handles registry lookups and routing.
Clients stay simple: they just send to a stable address. The load balancer is a single point of failure, but running it in HA mode (multiple replicas behind a VIP) mitigates this. My recommendation for most teams: start with server-side discovery unless you have a specific reason to need client-side control.
Comparison
| Aspect | Client-side | Server-side | DNS-based | Platform-native (K8s) |
|---|---|---|---|---|
| Client complexity | High (needs SDK) | Low (fixed address) | Low (DNS lookup) | Low (DNS lookup) |
| Extra network hop | No | Yes (through LB) | No | Depends (kube-proxy) |
| Health checking | Registry-driven | LB + registry | None (TTL only) | Built-in (kubelet probes) |
| LB control | Full (client decides) | Centralized (LB decides) | Round-robin DNS | kube-proxy rules |
| Language independence | No (each needs SDK) | Yes | Yes | Yes |
| Multi-datacenter | Possible (registry-aware) | Yes (global LB) | DNS-based routing | Federation (complex) |
| Example | Netflix Ribbon + Eureka | AWS ALB, Envoy | Route53 | K8s ClusterIP Service |
DNS-based discovery
Services register as DNS A or SRV records. Clients use standard DNS lookups to resolve service names to IPs.
DNS is the simplest form of service discovery and requires zero library dependencies. The limitation is TTL-based caching: DNS resolvers cache entries for the record's TTL (commonly 30-60 seconds), so clients may route to stale or terminated instances for up to a minute after a change. DNS also provides no health checking, no metadata (version, capacity), and no notification when instances change.
For managed environments with slow change rates, DNS is adequate. For high-churn environments with frequent deploys and auto-scaling, a dedicated registry is almost always better.
Platform-native (Kubernetes)
In Kubernetes, a Service object creates a stable DNS name (e.g., payments-service.default.svc.cluster.local) that automatically load-balances to all healthy pods matching a label selector. The kubelet checks pod health via liveness and readiness probes, and kube-proxy maintains IPtables/IPVS rules that route traffic.
Most teams running on Kubernetes use built-in Service discovery and don't need an external registry. If you're in an interview and the system runs on K8s, you can say "Kubernetes Services handle discovery natively" and spend your time on more interesting problems.
DNS is not service discovery
I see candidates treat DNS resolution as equivalent to full service discovery. DNS gives you IP addresses with TTL-based caching, but it has no health checking, no real-time instance updates, and no metadata. If an instance crashes, clients keep sending traffic to it until the cached TTL expires (30-60 seconds). Real service discovery requires health checking and near-instant deregistration of failed instances.
Service mesh / sidecar approach
In a service mesh (Istio, Linkerd), each service instance gets a sidecar proxy (typically Envoy). The sidecar intercepts all outbound network calls, queries the registry (the mesh's control plane), and routes to a healthy instance. The application code makes a plain HTTP call to http://payments-service/ and the sidecar handles everything.
This is the most transparent approach: zero library dependencies, zero code changes, and the sidecar can add mTLS, retries, circuit breaking, and observability for free. The cost is operational complexity (running a control plane plus sidecars on every pod) and a small latency overhead per hop.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Eliminates hardcoded configuration | Adds a new infrastructure component (the registry) that must be highly available |
| Auto-removes crashed instances via health checks | Health check tuning is tricky: too aggressive causes flapping, too lenient means stale routing |
| Enables auto-scaling without config changes | Client-side approaches require discovery-aware libraries per language |
| Supports rolling deploys with zero downtime | Server-side approaches add a network hop and a potential single point of failure |
| Provides a single source of truth for instance topology | The registry itself needs replication and consistency, which is a distributed systems problem |
| Can carry metadata (version, region, capacity) | Lookup latency adds up (mitigated by caching, but caching introduces staleness) |
The fundamental tension is freshness vs. availability. A registry that provides perfectly fresh data (no caching, synchronous health checks) is slower and more fragile. A registry that caches aggressively is fast and resilient but routes to stale instances during changes. Every implementation picks a point on this spectrum.
When to Use It / When to Avoid It
Use service discovery when:
- You have 3 or more services that need to communicate
- Your infrastructure is dynamic (containers, auto-scaling, cloud VMs)
- You do frequent deployments (multiple per day)
- You're running on Kubernetes (you already have it for free)
- You need multi-region routing or failover
Avoid service discovery when:
- You're running a monolith (no inter-service communication)
- Your infrastructure is static (fixed set of VMs, rarely redeployed)
- You have exactly 2 services (a config file is fine, seriously)
- You're fully serverless (API Gateway / Lambda handles routing at the platform level)
- The operational overhead of running a registry exceeds the benefit (small teams with few services)
Here's the honest answer: if you're building microservices, you need service discovery. The question isn't whether, it's which approach. If you're on K8s, you already have it. If you're on bare VMs with few services, a static config file or DNS is probably fine. Don't introduce Consul for two services.
Real-World Examples
Netflix Eureka pioneered client-side discovery at scale. At peak, Netflix runs hundreds of microservices on AWS, with thousands of instances registering and deregistering as auto-scaling groups resize. Eureka is AP (available and partition-tolerant) by design: each Eureka server maintains a full copy of the registry, and instances cache registry data locally. If Eureka goes down entirely, services continue routing using their cached instance lists. This was a deliberate choice: Netflix decided stale routing data is better than no routing data.
HashiCorp Consul handles service discovery for organizations running 10,000+ services across multiple datacenters. Consul uses Raft consensus for strong consistency (CP), supports DNS and HTTP APIs for lookups, and includes native health checking with configurable intervals. It's the most popular standalone registry outside of Kubernetes environments.
Kubernetes Services have become the de facto standard for containerized workloads. Every K8s cluster has CoreDNS and kube-proxy, giving every pod automatic DNS-based service discovery with health checking via liveness/readiness probes. At Google, Borg (Kubernetes' predecessor) ran service discovery at a scale of millions of containers, and the design principles carried directly into K8s.
How This Shows Up in Interviews
When to bring it up: Mention service discovery whenever your architecture involves 3+ services that need to find each other. Don't spend more than 30 seconds on it unless the interviewer asks for depth. "Services register with a service registry (or use Kubernetes DNS), and callers look up healthy instances dynamically" is usually enough.
Depth expected at senior/staff level:
- Explain the difference between client-side and server-side discovery and when you'd choose each
- Know that DNS-based discovery has TTL caching limitations and no health checking
- Understand that K8s Services handle discovery natively for containerized workloads
- Be ready to discuss health checking strategies (active polling vs. passive circuit-breaking)
- Mention the self-registration vs. third-party registration trade-off (who is responsible for keeping the registry accurate?)
Interview shortcut: K8s is the default answer
If the system design is running on containers/K8s, say "Kubernetes Services handle service discovery natively via CoreDNS and kube-proxy" and move on. Only go deeper if the interviewer asks about custom discovery requirements like multi-datacenter routing, weighted routing, or canary deployments.
| Interviewer asks | Strong answer |
|---|---|
| "How do services find each other?" | "Each service registers with a service registry on startup. Callers query the registry for healthy instances and load-balance across them. On K8s, this is built-in via Services and CoreDNS." |
| "What if the registry goes down?" | "Client-side caching. Clients cache the last-known instance list and continue routing even if the registry is temporarily unreachable. Netflix Eureka was designed around this exact scenario." |
| "How do you handle a bad deploy that passes health checks?" | "Canary deployments: route a small percentage of traffic to the new version, monitor error rates, and roll back if metrics spike. The registry doesn't solve bad code, it just knows who's alive." |
| "Client-side vs. server-side discovery?" | "Client-side gives the caller full control over load balancing and avoids an extra hop. Server-side keeps clients simple and works across any language. For most teams, server-side (or K8s native) is the right default." |
| "DNS vs. dedicated registry?" | "DNS is simple and dependency-free but has TTL caching (30-60s stale data) and no health checking. A dedicated registry like Consul provides real-time health checking and instant deregistration. Use DNS for slow-changing environments, registry for dynamic ones." |
Test Your Understanding
Quick Recap
- Service discovery replaces hardcoded IP configuration with a dynamic registry that tracks which instances are available, where they live, and whether they're healthy.
- Client-side discovery queries the registry directly and picks an instance (full control, but requires a library per language). Server-side discovery routes through a load balancer that handles the lookup (simpler clients, extra hop).
- DNS-based discovery is the simplest approach but has no health checking and relies on TTL-based caching, which means stale routing during changes.
- Consul and etcd provide strongly consistent registries with native health checking. Kubernetes Services provide DNS-based discovery with health checking baked into the platform.
- Health checking is not optional. A registry without health checks actively makes things worse by returning dead instances as if they were alive.
- Use TTL-based heartbeats as the deregistration fallback for crashed instances that can't self-deregister on shutdown.
- If you're on Kubernetes, service discovery is already solved. Mention it in interviews, but don't over-design what the platform gives you for free.
Related Concepts
- Load Balancing: Service discovery tells you where instances are; load balancing decides which instance gets each request. They're complementary: discovery provides the list, balancing picks from it.
- Service Mesh: A service mesh bundles service discovery, load balancing, mTLS, and observability into sidecar proxies. It's the "all-in-one" approach that makes discovery transparent to application code.
- Microservices: Service discovery becomes necessary as soon as you split a monolith into multiple services. It's foundational infrastructure for any microservices architecture.
- API Gateway: API gateways handle external-to-internal routing. Service discovery handles internal-to-internal routing. In many architectures, the gateway itself uses service discovery to find backend services.