Service mesh
Learn how a service mesh eliminates duplicated networking code across microservices, enforces zero-trust mTLS by default, and gives you end-to-end observability without touching your application code.
TL;DR
- A service mesh is a dedicated infrastructure layer for all service-to-service communication, implemented as sidecar proxies co-located with every service instance.
- Without a mesh, every team independently implements the same six concerns β auth, retries, timeouts, circuit breaking, metrics, and distributed tracing β producing 50+ divergent implementations inside a 10-service architecture.
- The data plane is every Envoy proxy in your cluster intercepting traffic. The control plane (Istio, Linkerd, Consul) pushes certificates, routing rules, and policies to those proxies β without touching your service code.
- Core capabilities: mutual TLS (mTLS) encrypts all east-west traffic by default; traffic management enables canary traffic shifting and instant rollback with zero pod restarts β just change a weight in YAML; circuit breaking is configured as policy, not library code; observability emerges automatically from proxy telemetry.
- The break-even point is roughly 10 microservices: below that, the operational complexity of running a control plane outweighs the benefits. Above that, the cost of NOT having a mesh grows with every new service you add.
The Problem It Solves
You've just hit 15 microservices. Congratulations. Now look at what every team is doing in their service code.
Team A (User Service) wrote their own retry middleware with exponential backoff β 3 lines in a shared library. Team B (Order Service) wrote theirs from scratch because they missed the library β 47 lines with a subtle bug that retries on client errors. Team C (Payment Service) is using a circuit breaker from a different library than Team D (Inventory Service). Nobody's distributed tracing correlates because they each use a different header format for the trace ID.
And none of it is visible from a single place. When a latency spike hits the Order Service, you can't tell whether it's coming from the User Service, the Inventory Service, or the Payment Service gateway β because each team's observability points only at their own service.
I've seen this pattern at companies from Series B startups to Fortune 50 enterprises. The problem isn't that teams are negligent. The problem is that these cross-cutting concerns are genuinely hard to standardize across teams β and so they don't get standardized until something breaks at 3am.
The hidden cost of NΒ² duplication
15 services Γ 6 cross-cutting concerns (auth, retries, timeouts, circuit breaking, metrics, tracing) = 90 separate implementations. Each one was correct when written. Two quarters later, they've diverged. The retry logic in your checkout service retries on 503s; the retry logic in your recommendation service doesn't. You discover the inconsistency when a downstream service starts returning 503s during a deploy and checkout loops while recommendations fail fast.
The fix is not another shared library. Shared libraries have the same problem: different services on different versions, and the library can only do what it knows the caller is willing to do. The fix is moving networking concerns out of service code entirely β into the network layer itself.
What Is It?
A service mesh is an infrastructure layer that handles all service-to-service communication in a microservices architecture. It works by deploying a lightweight network proxy β typically Envoy β as a sidecar alongside every service instance. All traffic, both inbound and outbound, is transparently intercepted by this proxy before the service code sees it.
Analogy: Think of an airport. Every flight needs the same things: takeoff clearance, collision avoidance, weather routing, and a landing slot. One option: every pilot manages all of this manually in their own cockpit. The other option: air traffic control handles it for every plane uniformly, and pilots focus on their flight. A service mesh is air traffic control for your microservices. Your services focus on their logic; the mesh handles the networking.
The key word above is transparently. Your service doesn't know the proxy exists. Linux iptables rules in the pod redirect all outbound and inbound traffic to the proxy process (Envoy listens on port 15001 for outbound, 15006 for inbound). Your service code makes a plain HTTP call to http://user-service; the proxy intercepts it, verifies mTLS with the destination proxy, applies retry policy, records a trace span, and emits a metric β all before the bytes leave the pod.
The result is a clean separation of concerns: service code handles business logic, the network layer handles reliability and security. A team shipping features never touches auth middleware or retry configuration β they declare a policy in YAML, and the control plane distributes it to the right proxies.
How It Works
Every service mesh has two layers working together. Understanding the split is the key to understanding everything else.
The Data Plane
The data plane is every Envoy proxy running in your cluster. It's the layer that actually makes or breaks each individual request. When Service A calls Service B, the sequence is:
- Service A makes an outbound call β e.g.,
GET http://order-service/orders/456. The service code sees nothing unusual. - iptables intercepts the packet β before the SYN packet leaves the pod's network namespace, a rule redirects it to Envoy's outbound listener on port 15001.
- Envoy processes the request β the proxy determines the destination (
order-service), looks up its current load balancing state (via EDS β Endpoint Discovery Service), applies retry + timeout policy from its xDS config, and initiates a mTLS handshake with the destination-side Envoy. - mTLS handshake completes β both proxies present their SPIFFE certificates (issued by the mesh's Certificate Authority). Both sides are authenticated. The connection is encrypted.
- Request arrives at destination proxy β the Order Service's Envoy receives the request and applies inbound policies: is this caller authorized? Is the request rate within allowance? It then proxies the request to the Order Service on localhost.
- Order Service responds on localhost β the response travels back through the destination Envoy (which records the response metrics and finishes the trace span) and back through the source Envoy to Service A's code.
Total added latency from two proxy hops: approximately 3β8ms for a typical gRPC or HTTP request. For inter-datacenter calls that already cost 10β50ms, this is negligible. For sub-millisecond in-memory calls, it's not β but those shouldn't be remote calls at all.
# Istio DestinationRule β circuit breaker (outlier detection) + connection pool
# Retries are NOT configured here β use VirtualService for retry policy
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 10
outlierDetection:
# Circuit breaker: eject hosts with 3 consecutive 5xx errors for 30s
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
# VirtualService: retry policy for the same destination
# Retry config lives here, not in DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- retries:
attempts: 3
perTryTimeout: 500ms
retryOn: "5xx,reset,connect-failure"
route:
- destination:
host: order-service
The Control Plane
The control plane manages configuration distribution, certificate lifecycle, and policy enforcement. It never sits in the request path β it's the out-of-band management layer.
In Istio (the most widely deployed mesh), the control plane is a single binary called istiod, which consolidates three functions:
- Pilot β watches Kubernetes Service/Deployment resources and translates them into Envoy xDS configuration. When a new pod starts or an old one dies, Pilot updates every relevant Envoy's endpoint table within seconds.
- Citadel β the certificate authority. Issues X.509 SVID certificates to every service (bound to Kubernetes service accounts). Rotates them automatically every 24 hours.
- Webhook validation β user-submitted Istio CRDs (VirtualServices, DestinationRules) are validated at
kubectl applytime via a Kubernetes admission webhook registered by istiod. Misconfigured resources are rejected before they ever reach Pilot. (In Istio < 1.5, this was a separate component called Galley.)
The data plane and control plane communicate via the xDS API (discovery services), an open protocol Envoy implements. This means the control plane doesn't have to be Istio β any system that speaks xDS can manage Envoy proxies. This is why Linkerd (which uses its own proxy) and Consul (which can manage Envoy) both work in different ways.
Key Components
| Component | Role | What breaks without it |
|---|---|---|
| Sidecar proxy | Envoy or Linkerd-proxy deployed alongside every service instance in the same pod | No traffic interception β services must handle networking themselves |
| Data plane | The collective set of all sidecar proxies making and receiving requests | No enforcement of policies, no metrics, no mTLS |
| Control plane | Manages config distribution (Pilot), certificate issuance (Citadel), and policy validation | Proxies hold stale routes and stale certificates; new services never learned |
| SPIFFE/SVID | Cryptographic service identity tied to the workload (Kubernetes service account) | No way to assert "I am the payment service" β mTLS verification is impossible |
| Certificate Authority (CA) | Issues and rotates X.509 certificates that proxies use for mTLS | Expired certs fail handshakes; no cert rotation means one compromised cert is permanent |
| xDS API | Protocol over which the control plane pushes config updates to proxies | Config changes require proxy restarts or are never distributed |
| VirtualService | Istio CRD declaring traffic routing rules (canary %, header matching, fault injection) | No layer-7 traffic management; you're back to DNS-only routing |
| DestinationRule | Istio CRD declaring connection policies per destination (circuit breaker, retries, mTLS mode) | Policies are global instead of per-destination; no circuit breaker declarations |
Core Capabilities
Mutual TLS β Zero-Trust by Default
Regular TLS is one-way: the client authenticates the server (you verify the bank's certificate). Mutual TLS (mTLS) requires both sides to present a certificate. The payment service proves it is the payment service; the order service proves it is the order service. No code changes required β the proxies handle the handshake.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.