Microservices
Learn how microservices decompose monolithic applications into independently deployable services, when the operational overhead is worth it, and how to avoid the distributed traps that swallow engineering teams.
TL;DR
- Microservices split a large application into small, independently deployable services โ each owning its own data, its own runtime, and a single bounded domain of business logic.
- The payoff is independent deployment and scaling: the checkout service handles Black Friday load without touching the notification service. Teams ship features without coordinating across seven squads.
- The cost is operational complexity: every benefit requires distributed tracing, service discovery, API gateways, and eventually-consistent data across service boundaries โ none of which you need in a monolith.
- The #1 mistake is moving to microservices before you've felt the monolith's pain. The infrastructure overhead costs 3โ6 months of engineering time minimum to set up correctly.
- If you're under 50 engineers or under 500K DAU and asking "should we do microservices?" โ the answer is no.
The Problem It Solves
It's 11 a.m. on Black Friday. Your e-commerce platform is handling 200,000 concurrent users. A bug in the payment module throws an uncaught exception, exhausts the JVM heap, and starts cascading OOM errors across the process. Within 90 seconds, the entire application is down โ including the homepage, product search, and user profiles that have nothing to do with payment.
Your on-call engineer opens the deployment pipeline to hotfix it. Estimated build time: 22 minutes. Because the entire 300,000-line codebase compiles, tests, and ships as a single deployable unit. The payment team has a fix ready in 8 minutes. The other 11 teams sit and wait.
By the time the fix ships, you've been down for four hours. Cart abandonment is in the millions. The post-mortem has two findings: the payment bug, and the fact that a payment bug had no business taking down the homepage.
I'll often see engineers attribute these problems to "technical debt" or "bad code." The real culprit is the architecture. No amount of refactoring fixes the fundamental problem: when everything runs in one process, everything fails in one process.
The monolith's hidden failure mode
It's not that monoliths are slow or broken โ most start out fast. The failure mode is coupling. One bad deployment window, one shared database schema migration, one memory leak in any module, and the whole application pays. That coupling is invisible until traffic and team size make it expensive.
The fix isn't better code inside the monolith. It's a different architectural boundary: isolate the payment domain so it can fail, scale, and deploy without touching anything else. That's the job microservices exist to do.
What Is It?
A microservice is a small, independently deployable application that owns a single bounded domain of business logic and its own persistent data store. Microservices communicate over the network โ typically via REST, gRPC, or a message broker โ and can be deployed, scaled, and updated without coordinating with other services.
Analogy: Think of a city with specialized districts โ a financial district, a restaurant quarter, a hospital complex. Each district has its own staff, its own operating hours, and its own supply chain. The financial district going dark doesn't close the restaurants. The hospital scaling for a flu season doesn't affect the banks.
A monolith is a city run out of one single building: when one department floods, everyone evacuates. Microservices restructure that into separate, self-sufficient districts that communicate via well-defined interfaces โ "I need payment authorization" โ while keeping their internal workings and data completely isolated.
flowchart TD
subgraph Internet["๐ Client Layer"]
Users(["๐ค Users\nMobile ยท Web ยท Partners"])
end
subgraph Gateway["๐ API Gateway Layer"]
GW["๐ API Gateway\nAuth ยท Rate Limit ยท Routing\nSSL termination ยท Request fan-out"]
end
subgraph Services["โ๏ธ Microservices โ Independently Deployed"]
US["โ๏ธ User Service\nAuth ยท Profile ยท JWT"]
OS["โ๏ธ Order Service\nCart ยท Checkout\nOrder state machine"]
PS["โ๏ธ Product Service\nCatalog ยท Inventory\nSearch indices"]
NS["โ๏ธ Notification Service\nEmail ยท SMS ยท Push\nPreferences"]
end
subgraph Async["๐จ Async Messaging Layer"]
MB["๐จ Message Broker\nKafka ยท RabbitMQ\nEvent fan-out ยท Durable log"]
end
subgraph DataTier["๐๏ธ Data Tier โ Isolated Per Service"]
UD[("๐ข User DB\nPostgres")]
OD[("๐ข Order DB\nPostgres")]
PD[("๐ข Product DB\nMongoDB")]
Cache["โก Shared Cache\nRedis ยท Read-aside ยท Hot reads"]
end
Users -->|"HTTPS ยท TLS"| GW
GW -->|"Route /users ยท Auth"| US
GW -->|"Route /orders"| OS
GW -->|"Route /products"| PS
OS -->|"Publish: Order.placed"| MB
MB -->|"Subscribe: Order.placed"| NS
US -->|"Reads / writes"| UD
OS -->|"Reads / writes"| OD
PS -->|"Reads / writes"| PD
US & OS & PS -->|"Hot reads ยท < 1ms"| Cache
Each service is a self-contained island. Order Service doesn't know or care how Notification Service works โ it publishes an event and moves on. That decoupling is the core architectural property microservices optimize for.
For your interview: when you introduce microservices, name your service boundaries (don't say "a bunch of services") and immediately move to the API Gateway โ that's where most of the interesting design complexity lives.
How It Works
Here's exactly what happens when a user places an order on a microservices platform:
- Client sends request โ
POST /ordershits the API Gateway. The gateway validates the JWT, checks the rate limit, and routes the request to the Order Service. - Order Service handles the write โ creates the order record in its own Postgres database. It does NOT call the Notification Service directly.
- Event published to broker โ the Order Service publishes an
Order.placedevent to Kafka with the order ID and user ID. This takes ~5ms withacks=all(ISR acknowledgement). Order Service immediately returns201withstatus: pendingto the client. - Notification Service consumes the event โ reads from the Kafka topic, fetches the user's email preference from its own store, and sends the confirmation email. This happens ~50โ200ms after step 3, completely asynchronously.
- Payment Service processes โ also consumes the
Order.placedevent, but checks idempotency first: "Have I already processedorder_id: abc123?" If yes, it discards the duplicate. If no, it charges the card and publishesPayment.completed. If payment fails, a compensating event fires and the Order Service updates the order status topayment_failed, triggering a user notification.
The critical insight: the client gets a 201 with status: pending in step 3, not a confirmed order. Order Service owns the order record and updates its status as downstream events arrive. The decoupling means order creation and payment processing scale and fail independently โ but the 201 is a promise to process, not a confirmation of completion.
// Order Service โ handles POST /orders
export async function createOrder(
userId: string,
items: OrderItem[],
): Promise<Order> {
// 1. Write to Order Service's own database (isolated schema)
const order = await orderRepo.create({
userId,
items,
status: 'pending',
createdAt: new Date(),
});
// 2. Publish event โ wait for broker ack (acks=all), NOT for consumers to finish
// ~5ms with acks=all (ISR acknowledgement โ required for payment-adjacent flows)
await eventBus.publish('order.placed', {
orderId: order.id,
userId: order.userId,
totalAmount: order.totalAmount,
timestamp: order.createdAt.toISOString(),
});
// 3. Return immediately with status: pending โ not a payment confirmation
return { ...order, status: 'pending' };
}
// Payment Service โ idempotent Kafka consumer
kafkaConsumer.subscribe('order.placed', async (event: OrderPlacedEvent) => {
// CRITICAL: Kafka guarantees at-least-once delivery โ duplicates happen on
// consumer restart, partition rebalance, or broker hiccup. Always check first.
const alreadyProcessed = await paymentRepo.existsByOrderId(event.orderId);
if (alreadyProcessed) return; // Discard duplicate โ idempotency key = orderId
const charge = await paymentGateway.charge(event.userId, event.totalAmount);
await paymentRepo.save({ orderId: event.orderId, chargeId: charge.id });
await eventBus.publish('payment.completed', { orderId: event.orderId, chargeId: charge.id });
});
// Notification Service โ idempotent consumer (email deduplication)
kafkaConsumer.subscribe('order.placed', async (event: OrderPlacedEvent) => {
const sent = await notifRepo.existsByOrderId(event.orderId);
if (sent) return; // Do not send duplicate confirmation email
const user = await userClient.getById(event.userId); // gRPC, cached in Redis
await emailService.sendOrderConfirmation(user.email, event.orderId);
await notifRepo.markSent(event.orderId);
});
Interview tip: state the pending status and idempotency
Two things to say explicitly when you draw this flow: (1) "The 201 carries status: pending โ it's a promise to process, not a receipt." (2) "Every Kafka consumer must be idempotent because Kafka delivers at-least-once โ I'd use the order ID as the deduplication key." Both signal distributed systems fluency. Most candidates forget both.
I'll walk through both the sync path (direct service calls) and the async path (event-driven) in an interview โ and explicitly say which I'm choosing for each interaction and why. Interviewers care less about the specific choice and more about whether you can defend it.
Key Components
| Component | Role |
|---|---|
| API Gateway | Single entry point for all client traffic. Handles auth, rate limiting, request routing, protocol translation (REST โ gRPC), and SSL termination. Without it, every client must know every service's address and implement its own auth. |
| Service Registry | A live directory of running service instances and their health. Consul, Kubernetes DNS, or AWS Cloud Map. Services register on startup, deregister on shutdown. The gateway uses it for routing decisions. Stale registrations are evicted by health-check TTL. |
| Message Broker | Decouples producers from consumers for async flows. Kafka for high-throughput durable event streams (millions of events/second). RabbitMQ for lower-throughput task queues. Guarantees at-least-once delivery without direct service coupling. |
| Circuit Breaker | Wraps outbound service calls and opens (fails fast) when a downstream service is consistently unavailable โ preventing cascade failures from thread pool exhaustion. Resilience4j, Hystrix, or Istio sidecar proxy. Non-negotiable for any synchronous inter-service call in production. |
| Distributed Tracing | Propagates trace IDs (e.g., traceparent via W3C TraceContext) across service boundaries so a single user request can be reconstructed end-to-end across 10+ services. Jaeger, Zipkin, or OpenTelemetry. Without this, debugging a cross-service latency spike is guesswork. |
| Container Orchestrator | Kubernetes (or ECS) manages service deployment, scaling, health checks, and inter-service networking. Each microservice runs as a container with its own resource envelope. Rolling deployments and auto-scaling operate per service independently. |
| Service Mesh | Optional sidecar proxy layer (Istio, Linkerd) that handles mTLS, retries, circuit breaking, and traffic observability at the infrastructure level โ without any code changes in services. Worth the complexity only at 50+ services with strict security or observability requirements. |
| Service Identity / mTLS | Each service has a cryptographic identity (SPIFFE/SPIRE or your mesh's built-in CA). Services prove identity via mutual TLS on every internal call โ no hard-coded secrets, no IP allow-lists. Without this, a compromised service can freely call Billing, Auth, or Payment. The API Gateway handles TLS from external clients; your mesh handles mTLS internally. These are different trust boundaries. |
Communication Patterns
So when does a service call another service synchronously versus publish an event? Getting this wrong is the source of most microservices production incidents.
Synchronous (REST / gRPC)
Use synchronous calls when the caller needs the response to continue. The canonical cases:
- Auth check โ verify the JWT before serving any request
- Inventory check โ "Is this item in stock?" must be answered before showing Add to Cart
- Read path โ fetching user profile data to render a page
The danger is latency chaining: if Service A calls B and B calls C, the total latency is A + B + C. At scale, a 200ms downstream call turns a 50ms endpoint into a 250ms endpoint. Under load, that compounds hard.
// Synchronous gRPC call from Order Service to Inventory Service
// If Inventory takes 200ms, order creation blocks for 200ms
const inventoryStatus = await inventoryClient.checkAvailability({
productId: item.productId,
quantity: item.quantity,
});
if (!inventoryStatus.available) {
throw new OutOfStockError(item.productId);
}
// Only continue if inventory confirmed โ synchronous is correct here
Asynchronous (Events / Message Broker)
Use async when the caller does not need the response to proceed. The canonical cases:
- Notifications โ confirmation email after an order is placed
- Analytics โ log the page view without blocking the page response
- Side effects โ update search index, invalidate downstream caches, trigger fulfillment
The payoff is fault isolation: if the Notification Service is down, the Order Service is unaffected. The event sits durably in Kafka until Notification recovers and processes it โ with no data loss and no coupling.
sequenceDiagram
participant C as ๐ค Client
participant GW as ๐ API Gateway
participant OS as โ๏ธ Order Service
participant K as ๐จ Kafka
participant PS as ๐ณ Payment Service
participant NS as ๐ง Notification Service
C->>GW: POST /orders
GW->>OS: Route (auth validated)
activate OS
OS->>OS: Write order to DB ยท ~5ms
OS->>K: Publish order.placed event ยท ~5ms (acks=all)
OS-->>GW: 201 Created ยท {orderId}
deactivate OS
GW-->>C: 201 Created ยท {orderId}
Note over K,PS: Async โ decoupled from client latency
K->>PS: Consume order.placed ยท charge card
K->>NS: Consume order.placed ยท send email
PS-->>K: Publish payment.completed
Note over NS: Confirmation email sent ~150ms after order
The client gets a response after ~15ms total. Payment and Notification run on their own timelines with their own retry logic and their own failure modes โ fully decoupled from the order response.
The rule: if the failure of the downstream operation would change your response to the client, use sync. If not, use async. That one rule eliminates most incorrect communication pattern decisions in system design interviews.
Trade-offs
| Benefit | Cost |
|---|---|
| Independent deployment โ ship one service without touching others | Distributed systems complexity โ network partitions, partial failures, retry storms |
| Independent scaling โ scale only the bottleneck service 10ร during a sale | Data consistency challenges โ no ACID across service boundaries; eventual consistency only |
| Fault isolation โ one service crash doesn't cascade through the system | Operational overhead โ each service needs its own CI/CD pipeline, alerting, log aggregation |
| Technology freedom โ Postgres for Orders, MongoDB for Products, Redis for Sessions | Latency cost per hop โ every inter-service call adds 1โ5ms network overhead |
| Team autonomy โ the Orders team owns their full stack end-to-end | Distributed tracing required โ debugging request failures across 10 services without trace IDs is impossible |
| Smaller, understandable codebases โ engineers master their bounded context | Testing complexity โ integration tests require live dependent services or complex mocks |
The fundamental tension here is developer velocity versus operational complexity.
A well-run microservices organization ships features faster because teams are autonomous and deployments are independent. But getting to "well-run" costs months of platform engineering investment โ Kubernetes, service mesh, distributed tracing, contract testing. In the meantime, developer velocity is worse than the monolith.
When to Use It / When to Avoid It
Ok, but here's the honest answer on when this actually makes sense.
Use microservices when:
- Multiple teams own separate domains โ 3+ teams stepping on each other's deployment windows. The coordination tax exceeds the migration cost.
- Services have wildly different scaling needs โ your video transcoding service needs GPU nodes and 10ร the memory of your auth service. Independent scaling is the only clean answer.
- Failure isolation is a hard requirement โ a payment outage must not affect product browsing. Financial or healthcare systems often mandate this.
- Compliance demands separation โ PCI-DSS for payment, HIPAA for health data. Each domain needs isolated access control, encryption, and audit logs.
- You've already felt the monolith's pain โ you can point to specific incidents: blocked deployments, 30-minute CI runs, cascading failures. If those incidents are real, microservices are worth the migration cost.
Avoid microservices when:
- Under 50 engineers โ you lack the bandwidth to run separate CI/CD pipelines, service registries, and distributed tracing per service. Platform work will eat your product roadmap.
- Under 500K DAU โ a well-tuned modular monolith handles millions of daily requests. You're paying the complexity tax without needing the scaling benefit.
- Your team is new to distributed systems โ microservices surface network partition failures, split-brain scenarios, and partial failures that monoliths hide entirely. Knowledge gaps here cause production incidents within weeks.
- Your domain boundaries aren't clear โ if you can't articulate exactly where one service's responsibility ends and another's begins, your service cut will be wrong. You'll end up with a distributed monolith: all the complexity, none of the benefits.
If you're unsure whether you need microservices โ you probably don't yet. Start with a modular monolith, identify the actual coupling pain points when you hit them, then extract the one service causing the most team friction.
Real-World Examples
Amazon built the "two-pizza team" rule in the early 2000s. The counter-intuitive lesson: Amazon restructured the org chart before restructuring the architecture. Teams became small and service-owning first; the microservices topology followed Conway's Law, not the other way around. If you adopt microservices without the corresponding team structure โ splitting one large team into smaller ones with real ownership โ you get the deployment artifacts of microservices with none of the velocity. The architecture didn't make Amazon fast. The org design did.
Netflix decomposed their monolith in 2009 after a database corruption incident took the entire service down for three days. The lesson isn't the 700+ services they ended up with โ it's what they were forced to build because of them: Hystrix (circuit breaker), Eureka (service registry), Zuul (API gateway), Chaos Monkey (resilience testing). None of these existed before Netflix needed them. When you adopt microservices today, you inherit this infrastructure via Kubernetes and Istio. If you skip the operational discipline Netflix built anyway โ distributed tracing, circuit breakers, chaos engineering โ you're running Netflix's 2009 architecture on 2025 infra with 2009's failure modes.
Uber decomposed into 1,000+ microservices by 2016. The less-told part: by 2020, the service graph had become operationally invisible โ no team knew which services called which at runtime, latency attribution was guesswork, and debugging a checkout regression required paging five on-call rotations. Uber introduced Envoy-based service mesh and internal tooling (Cadence, Peloton) specifically to regain visibility. The lesson: at 1,000+ services, the problem shifts entirely from deployment independence to governance, visibility, and policy enforcement. Microservices solve the deployment bottleneck and introduce the observability problem.
How This Shows Up in Interviews
Microservices come up organically in almost every system design question above a certain scale. You won't be asked "design a microservices architecture" โ you'll be asked to design Twitter, Netflix, or Uber, and the right answer naturally involves decomposing the system into bounded domain services.
When to proactively bring it up:
I'll almost always introduce microservices proactively when the prompt mentions "multiple teams," "independent scaling," or "high availability per component." If neither is mentioned, I wait until the interviewer asks why I'm not using a monolith โ and have an honest answer ready: "At this scale, yes. If you told me this was a 10-engineer team at Day 1, I'd start with a modular monolith."
- When the system has clearly distinct domains with wildly different scaling profiles (video transcoding vs. user auth in YouTube need fundamentally different infrastructure)
- When the interviewer introduces team structure or asks about deployment strategy
- When you're describing how components can fail without cascading โ fault isolation is microservices' strongest argument
What depth is expected at senior/staff level:
- Name specific services and their ownership boundaries โ not "a bunch of services," but "User Service, Feed Service, Notification Service, each owning its own database"
- Explain why each service is a service: what coupling would exist if it were combined with another
- Describe the communication pattern choice: which calls are sync vs. async with explicit reasoning โ and say "the 201 carries status: pending because payment is async"
- Address data consistency: how do you handle an operation that needs atomic-looking behavior across the Order DB and User DB? Name the Saga pattern.
- Cover idempotency: every async consumer must be idempotent. Kafka delivers at-least-once. The order ID is your deduplication key.
- Know the failure modes: circuit breaker for cascading failures, distributed tracing for diagnosis, consumer-driven contract tests for API breakage
Interview Q&A:
| Interviewer asks | Strong answer |
|---|---|
| "How do services find each other?" | "Service registry โ Consul or Kubernetes DNS. Services register on startup, deregister on shutdown. Stale entries are evicted by health-check TTL. The API Gateway resolves names at request time." |
| "What if a downstream service is down?" | "Circuit breaker opens after N consecutive failures, subsequent calls fail fast instead of blocking. Graceful degradation: serve stale cached data or a reduced response. Async consumers retry from the broker with exponential backoff." |
| "How do you handle a transaction that spans services?" | "Saga pattern โ a sequence of local transactions with compensating events on failure. Orchestrator-based sagas for critical checkout flows (explicit, observable), choreography for loosely-coupled side effects. No distributed ACID at any point." |
| "How do you debug a request that spans 10 services?" | "Distributed tracing with W3C traceparent headers. Each service adds a span. Reconstruct the full call graph in Jaeger or Datadog. Without this, a 4-second p99 across 10 services is completely opaque." |
| "How do services authenticate to each other?" | "mTLS โ each service has a cryptographic identity via SPIFFE/SPIRE or your service mesh CA. No hard-coded secrets, no IP allow-lists. The API Gateway does TLS termination for external traffic; the mesh handles mTLS internally โ different trust boundaries." |
My recommendation: pick one design decision to go deep on โ usually the async event flow with idempotency, or the data consistency challenge with sagas. That's where most of the signal lives. Candidates who stay at the topology layer ("here's my service diagram") look shallow compared to those who can defend precisely why the 201 carries status: pending and what breaks if a Kafka consumer isn't idempotent.
Test Your Understanding
Quick Recap
- Microservices split a system into independently deployable services, each owning its own bounded domain and its own database โ failure and deployment are isolated by design.
- The API Gateway is the single entry point: it handles auth, rate limiting, and routing before requests reach any service, removing the need for every client to know every service address.
- Synchronous calls (REST/gRPC) chain latency โ use them only when the downstream response is required to continue. Every other inter-service interaction is a candidate for async events.
- Asynchronous events via Kafka or RabbitMQ decouple producers from consumers: Order Service doesn't know or care whether Notification Service is running when it publishes
order.placed. - Data consistency across service boundaries requires the Saga pattern โ a sequenced set of local transactions with compensating events on failure. ACID transactions do not exist across microservice boundaries.
- Distributed tracing with a propagated Trace ID is non-negotiable beyond 3 services โ without it, diagnosing cross-service latency spikes or partial failures is guesswork in production.
- If you're debating whether you need microservices, you don't yet โ find the specific coupling pain in your monolith first, then extract the one service that eliminates it.
Related Concepts
- API Gateway โ The required front door for every microservices architecture. Covers auth, routing, rate limiting, and the failure modes of gateway centralization.
- Message Queues โ The async backbone for decoupled microservices. Covers Kafka vs. RabbitMQ, consumer group semantics, and exactly-once delivery guarantees.
- Service Mesh โ The infrastructure layer for large-scale microservices: mTLS, retries, circuit breaking, and observability via Istio or Linkerd without code changes.
- Circuit Breaker โ The pattern that prevents cascading failures across synchronously-coupled services. Essential reading after this article.
- Monolith vs. Microservices โ The full trade-off breakdown: when the architectural overhead pays off and when it actively hurts you.