Monolith vs microservices
Learn exactly when to split a monolith into microservices, what the real costs are (hint: it's not lines of code), and how to make that call without destroying your team.
TL;DR
- Use a monolith when your team is under 10 engineers, the domain is not yet fully understood, or you're building an MVP. Monolith is not a failure state. It's a starting point.
- Use microservices when independent teams need independent deployment cadences, when components have wildly different scaling profiles, or when a compliance boundary (PCI, HIPAA) demands physical isolation.
- The biggest microservices cost is not infrastructure. It's operational complexity: distributed tracing, eventual consistency, network failure handling, and the cognitive load of reasoning across service boundaries.
- "Distributed monolith" is the worst outcome: you split services but kept shared databases or synchronous chains of 6 blocking calls. You now have all the complexity with none of the benefits.
- A modular monolith with strong internal boundaries is almost always the right intermediate step before splitting into services.
The Framing
In 2019, a well-funded startup rewrote their 18-month-old Rails monolith into 23 microservices. The reason given in the retrospective: "We wanted to scale like Netflix."
Eighteen months later, they wrote a second post. Their p99 API latency had gone from 80ms to 380ms; on-call now required expertise in Kubernetes, Istio, distributed tracing, and 12 separate service repositories; and a bug in the user-profile service took down checkout through an undiscovered synchronous dependency chain. Three senior engineers quit.
The actual traffic when they migrated: 8,000 daily active users.
Now compare that with Shopify. For over a decade, Shopify ran one of the largest e-commerce platforms in the world on a single Ruby on Rails monolith. They served hundreds of thousands of merchants and survived Black Friday spikes that would topple most architectures.
Their secret was not microservices. It was a well-modularized monolith with relentless attention to database performance and caching.
The question is never "which is better?" It's "which is right for where we are today, given our team size, operational maturity, and domain clarity?"
I've seen this exact retrospective written at least a half-dozen times across different companies. The details change; the timeline doesn't.
How Each Works
The Monolith
A monolith is a single deployable unit where all application concerns live in the same process. Presentation logic, business logic, and data access code all run together and share the same memory space.
When UserService calls OrderService, it's a function call in the same process: sub-microsecond, no network hop, no serialization, no retry budget. When you deploy, you build one artifact, push it, and restart one process.
The operational profile is simple: one binary to monitor, one log stream, one set of metrics, one deployment pipeline. Your junior engineer on-call at 2am is dealing with one thing that's broken, not hunting through twelve service dashboards to find which one threw the first error.
The monolith's advantage is operability and simplicity. Its disadvantage is coupling: everything deploys together, and a bug anywhere can affect everything.
I've seen teams describe themselves as 'microservices shops' while every service still deploys in lockstep from the same pipeline. The architecture label changed; the operational risk didn't.
Microservices
A microservices architecture splits the application into independently deployable services, each responsible for one well-defined business capability. The services communicate over the network, typically via REST, gRPC, or an async message queue.
When Order Service needs user data, it makes an HTTP or gRPC call to User Service. That's a network hop: 1-10ms under normal conditions, potentially much longer under load or during failures. The call can time out, fail, or return stale data.
The core trade-off: you get independent deployability and independent scaling, but you take on the full complexity of distributed systems. Timeouts, retries, circuit breakers, distributed tracing, eventual consistency, and service discovery all become your problem.
Head-to-Head Comparison
| Dimension | Monolith | Microservices |
|---|---|---|
| Deployment unit | Single artifact (JAR, binary, Docker image): deploy all or nothing | Per-service artifact, deployable independently |
| Inter-component latency | Sub-microsecond (in-process function call) | 1-10ms per network hop (compounds across chains) |
| Failure isolation | A bug anywhere can crash everything | A failing service is isolated (if circuit breakers are configured) |
| Scaling granularity | Scale the entire app vertically or clone the whole thing | Scale individual services independently (CPU-heavy vs. I/O-heavy) |
| Data consistency | ACID transactions across all data | Each service has its own DB; cross-service requires Saga (2PC is possible but generally avoided due to coordination overhead and availability impact) |
| Operational overhead | One log stream, one APM service, one deploy pipeline | Per-service logging, tracing, alerting, deployment, runbooks |
| Developer onboarding | Clone one repo, run one service, debug one process | Learn service graph, network topology, local dev orchestration |
| Organizational fit | Small, unified team. Mono-repo friendly | Multiple teams with clear bounded contexts and on-call ownership |
| Domain modeling clarity | Boundaries are internal (modules, packages), easy to refactor | Boundaries are API contracts; changing them requires coordination and versioning |
| Testing complexity | One integration test suite | Per-service tests plus contract tests plus end-to-end across service boundaries |
| Time to first production deployment | Weeks for a minimal version | Months (infrastructure, service mesh, CI/CD pipelines, runbooks must exist first) |
| Right baseline team size | 2-15 engineers | 10+ engineers (at least one platform team of 3-4) |
The fundamental tension here is independent deployability vs. operational simplicity. Microservices buy you one; monoliths give you the other.
When the Monolith Wins
Here's the honest answer on when to stay with (or start with) a monolith.
The domain isn't fully understood yet. Service boundaries drawn on day one are almost always wrong.
Changing one in a monolith is an afternoon refactor; changing one across services means a migration, versioned APIs, dual-write periods, and coordinated team deploys. Get the domain right first, inside the monolith.
Your team is under 10 engineers. A microservices architecture requires a platform engineering function: someone to own the CI/CD pipelines, the service mesh, the distributed tracing, the secrets management.
If this is the same person who's also writing features, they will be perpetually behind. The overhead eats the productivity gain.
Deployment cadence is shared anyway. If all your services deploy together because they share database migrations or version-matched APIs, you don't have independent deployability. You have a distributed monolith with all the operational costs and none of the benefits.
Stay in one place until you can genuinely release independently.
You're debugging your second incident this week. A monolith is much easier to debug: one log stream, one stack trace, one process to restart.
Before adding distributed systems complexity, push the monolith to its ceiling: read replicas, caching, async jobs, better indexing.
I often see teams reach for microservices when what they actually need is a better database query or a Redis cache. The root bottleneck rarely requires distributed systems.
The modular monolith is the underrated middle ground. Structure your monolith with strict module boundaries: each major domain (users, orders, payments) is a separate module with a published API surface, no direct cross-module database joins, and a separate schema. You get most of the architectural benefits without the operational cost.
Shopify ran this model profitably for over a decade.
When Microservices Win
So when does it actually make sense to split?
Different components have genuinely different scaling profiles. Your image-processing pipeline is CPU-bound and needs 16-core instances, your API tier is I/O-bound and runs on many small containers, and your reporting service is bursty once per day.
Forced co-location means over-provisioning everything to the worst case, or running identical instances where only 10% of the code is in active use.
Compliance requires physical isolation. PCI DSS scope reduction: if your payment processing is in a separate service, only that service's infrastructure needs to be in PCI scope. HIPAA data segmentation works the same way, and this is not a 'nice to have': it's a concrete regulatory benefit that saves months of audit work.
Teams are genuinely stepping on each other. If three teams are all making PRs to the same service and blocking each other's releases, that's a Conway's Law problem.
The cure is not just splitting the code: it's splitting ownership. Each service needs one team that fully owns it: product decisions, deployments, and the 3am on-call page. Without that assignment, you get a service without an owner and an incident without a responder.
In my experience, the team friction signal shows up before the scaling signal every time. By the time you're hitting CPU limits, you've usually already had the merge-conflict fight.
You need to experiment and ship at different rates. A/B testing, ML model updates, algorithm changes in a recommendation service should not be gated behind the review cycle of your checkout team. Independent services enable independent iteration rates.
For your interview: you need microservices when you have team-level autonomy requirements, not just technical scaling requirements. The organizational reason is almost always stronger than the performance reason.
The Nuance
The Distributed Monolith Anti-Pattern
This is the failure mode I see most often. A team splits their monolith into 10 services, but:
Order Servicedirectly queriesUser Service's database (shared schema, just a different process)Checkout Servicemakes 6 synchronous HTTP calls to 6 other services in sequence before returning a response- All 10 services deploy from the same pipeline in lockstep because service-b depends on service-a's v2 API
You now have all the complexity of microservices (distributed failure modes, network overhead, tracing requirements) and none of the benefits (independent scaling, independent deployability). This is worse than the original monolith.
The 'distributed monolith' is worse than both alternatives
If any two of the above are true, you have all the complexity of distributed systems with zero of the autonomy. Fix boundaries before adding more services.
The distributed monolith is the worst outcome: you paid the migration cost and ended up slower and more fragile than before.
The Modular Monolith: The Option Nobody Talks About
A modular monolith sits between "spaghetti code" and "microservices". It's a monolith where domain boundaries are enforced at the code level:
- Each domain (users, orders, payments) is a separate module with a published interface
- Cross-module access happens only through that interface, never direct DB access
- Each module has its own schema namespace (separate tables, ideally separate DB credentials)
- Module interfaces look like service interfaces; you're essentially pre-practicing the API contract
When you're ready to extract a service, you almost don't have to change the interface: just deploy it separately and switch in-process calls for HTTP calls. The modular monolith is the architecturally honest intermediate step.
The 'premature extraction' cost is real
Martin Fowler coined "microservices premium": the operational overhead that only becomes worth paying once you have multiple product teams with genuine service ownership. He argued the benefit only materializes above that threshold. Most teams that migrate at 10 engineers are paying the premium years before they earn the return.
The microservices premium is real. The question is not whether to pay it, but whether your team is large enough and your domain clear enough to earn the return.
You Must Be This Tall to Ride
Microservices have organizational prerequisites. If these don't exist, your migration will fail or create the distributed monolith anti-pattern:
- Platform engineering function. Someone owns the CI/CD pipelines for all services, the container orchestration layer (Kubernetes), secrets management, and service discovery. This is a full-time job at 5+ services.
- Distributed tracing from day one. Without OpenTelemetry or Jaeger, a 400ms p95 response in production is impossible to diagnose when the request touched 6 services. You will be blind.
- Each service has one team on-call owner. A service without an on-call owner is a service that goes down and nobody fixes it at 3am.
- Contract testing between services. Without Pact or similar, a schema change in
User Servicesilently breaksOrder Servicein production because integration tests didn't catch it. - Feature flags and canary deployments. When a service has 100K requests/minute, you can't just deploy and hope. You need the ability to shift 5% of traffic to the new version and watch error rates before promoting.
Get these prerequisites in place before you cut the first service boundary, not after.
Service Boundaries: Where Migrations Actually Fail
Wrong service boundaries are worse than a monolith. Here's how to draw them right.
The theoretically correct approach comes from Domain-Driven Design: find bounded contexts, which are domains where a term means one consistent thing and has one consistent owner. "Order" in the commerce context and "Order" in the fulfillment context are different things: they share an ID but diverge in lifecycle, state machine, and data model, and that divergence is where the service boundary belongs.
In practice, the correct signals for splitting are:
- Different scaling characteristics. Image processing wants 32 CPUs. API handling wants 64 small containers. Forced co-location means you provision for the extreme case across all instances.
- Compliance isolation. Cardholder data (PCI) must be physically separated. Health records (HIPAA) must have audited access paths. Service boundaries here are mandated by law, not preference.
- Team ownership and commit frequency. If Team A ships Payment Service 5 times a day and Team B ships Notification Service once a week, combined deployment is a drag. Split at the team boundary.
- Genuine data isolation. If Service A never needs to join its data with Service B's data in the same query, they can have separate databases. The moment you need cross-service joins constantly, your boundary is wrong.
The wrong boundary is the most expensive mistake I see in real migrations. Once it's set in stone as a service contract, fixing it requires another multi-week extraction project.
Interview tip: the boundary question
When an interviewer asks how you'd structure services for a design like Uber or DoorDash, don't just list features. Say: "I'd separate the ride-matching engine from the driver-tracking service because their scaling profiles are completely different: matching is CPU-heavy and bursty, tracking is write-heavy and continuous. I'd keep user profiles and preferences in the same service initially because they're almost always accessed together." That reasoning shows you understand the why, not just the what.
The database boundary is the architectural truth test: if two services share schema migrations, they are not actually independent services yet.
Real-World Examples
| Company | Architecture | Key learning |
|---|---|---|
| Shopify | Single Rails monolith for 12+ years, serving 600K+ merchants, $120B+ GMV annually | Proved that a monolith with excellent database design, caching (Redis), and CDN can scale to enormous traffic without microservices. They eventually introduced "modular monolith" patterns internally to enforce domain separation. The constraint was team discipline, not technology. |
| Netflix | ~1,000 microservices for video streaming, recommendation, and billing | Netflix was one of the earliest at this scale. What they don't tell you: they also built Hystrix, Eureka, Ribbon, Zuul, and Chaos Monkey because the platform had to exist before the services ran reliably. Budget 18 months of platform investment before the first service is stable. |
| Amazon | Full microservices transformation in the early 2000s (the famous Bezos API mandate) | The Bezos mandate was not "use microservices." It was "no direct database access between teams, all communication via APIs." This was primarily about organizational independence and eliminating hidden dependencies. The architecture followed the org, not the other way around. |
| Stack Overflow | Single monolith (primarily one SQL Server) handling ~15M+ page views per day | As of 2023, Stack Overflow (one of the top 50 US websites) still runs primarily on a monolith with approximately 9 web servers. The site's performance gains came from SQL optimization, caching layers, and CDN, not architectural decomposition. |
| Segment (acquired by Twilio) | Migrated FROM microservices back to a monolith in 2020 after hitting operational overhead | At 140+ engineers, Segment had built ~130 microservices. The overhead of maintaining them: operational runbooks, on-call rotations, 40% of engineering time on infrastructure. The overhead exceeded the benefit. They consolidated back to a controlled monolith + async event routing. |
Segment's 'goodbye microservices' is required reading
Segment published a detailed post-mortem on their return to a monolith. The key finding: their services were too small ("nano-services"), so every feature required changes to 5+ repos, 5+ test suites, and 5+ deployments. They had split along technical lines (one data ingestion step per service) rather than business domain lines. The lesson: service granularity matters as much as the decision to split.
The pattern I see across all five: architectures built for the wrong team size or the wrong domain clarity eventually pay for it on-call.
How This Shows Up in Interviews
When to Bring It Up
You should raise the monolith vs. microservices trade-off in your system design interview as soon as the topic naturally arises, usually after defining non-functional requirements. Don't wait to be asked.
"Before I sketch the architecture, I want to flag one key decision: should this be a monolith or microservices? Given the team is still small and the domain is early, I'd start with a modular monolith and extract services only where we see scaling pressure or team friction. Does that approach make sense, or do you want me to assume we're already at scale?"
That question signals architectural maturity, not just knowledge of the pattern. Most interviewers will say "assume we're already at scale." By asking, you've already demonstrated that the correct default is to question the assumption, not blindly accept microservices as the starting point.
I've used this framing in interviews and the response is nearly always the same: the interviewer nods, says "assume scale," and the question alone has already differentiated you from the candidate who just started drawing boxes.
Depth Expected at Senior/Staff Level
- Articulate the distributed monolith anti-pattern and how to detect it
- Know the Conway's Law implication: service boundaries follow team boundaries, not feature boundaries
- Know the platform engineering prerequisites (service discovery, distributed tracing, contract testing, secret management)
- Know the Saga pattern and why cross-service ACID transactions require it
- Know when a modular monolith is the right answer and how it differs from a poorly organized monolith
- Articulate the latency tax: estimate inter-service call overhead and how it compounds across call chains
- Give real examples of companies that stayed with a monolith at scale (Stack Overflow, Shopify); this shows you're not cargo-culting Netflix
Follow-up Q&A
| Interviewer asks | Strong answer |
|---|---|
| "How do microservices communicate?" | "Synchronous: HTTP/REST for request-response where the caller needs an immediate answer. gRPC for high-throughput internal calls that benefit from binary protocol and streaming. Async: Kafka or SQS for event-driven flows where the caller doesn't need an immediate response: orders, notifications, analytics. The default I'd use: gRPC internal, REST external (client-facing), Kafka for anything post-transaction." |
| "What happens if a downstream service is unavailable?" | "Without resilience patterns, a synchronous call failure cascades upward; the caller gets a 500, and if the caller is also called synchronously, the failure propagates. The fix: circuit breaker (short-circuit calls to a failing service after N failures), timeout (never wait indefinitely), and fallback (cached or degraded response). In practice: a circuit breaker + 500ms timeout prevents one failing service from taking down everything else." |
| "How do you handle data consistency across services?" | "You give up ACID. Cross-service operations use eventual consistency via the Saga pattern: each service completes its local transaction and publishes an event. If a downstream step fails, you execute compensating transactions (refund the payment, restore the inventory). The Outbox Pattern ensures events aren't lost even if the event broker is temporarily down." |
| "When would you merge two services back together?" | "When two services change together on every PR (they should be one service), when a synchronous call between them is on the hot path and network overhead is measurable (check your trace spans; if service-to-service latency is 20% of total response time, that boundary has a cost), or when no single team owns both services but both are required for every feature. Service merging is a valid operation." |
| "How do you migrate from a monolith to microservices?" | "Strangler Fig pattern: don't rewrite, strangle. Put an API Gateway in front of the monolith. Identify the first service to extract based on a clear boundary signal (compliance, scaling, team ownership). Build the new service alongside the monolith, route a percentage of traffic to it, verify, then cut over. Never do a big-bang rewrite. Every retrospective from Amazon to Segment concludes the same: full rewrites slip, drift from the original requirements, and usually ship half-broken or get cancelled." |
Test Your Understanding
Quick Recap
- A monolith is a single deployable unit where all application concerns share one process and typically one database. It is simpler to develop, test, and operate, but couples deployment and scaling across all components.
- Microservices decompose the application into independently deployable units. Each service owns its data and its deployment, but you take on the full cost of distributed systems: network failures, eventual consistency, distributed tracing, and service discovery.
- The distributed monolith is the failure mode to avoid: services that share databases or synchronous call chains of 5+ hops have all the complexity with none of the autonomy.
- The correct default for a new product or small team is a modular monolith: one deployable, enforced module boundaries, no cross-module database access, interfaces that look like future service APIs.
- The signal to extract a service: compliance isolation requirement, genuinely different scaling profile, or team ownership friction that module separation cannot fix.
- Before going microservices, you need: a platform team, distributed tracing, per-service CI/CD, circuit breakers, and an on-call assignment for each service. Without these prerequisites, you will create a distributed monolith.
- In interviews, naming Shopify's decade-long monolith success and Segment's microservices-to-monolith reversal shows you understand this is a real tradeoff, not a ladder from worse to better.
Related Trade-offs
- Sync vs async communication: The communication pattern between microservices shapes failure modes, consistency guarantees, and latency. Choosing between REST, gRPC, and Kafka is the first decision after choosing microservices.
- API gateway: The API Gateway is the entry point to a microservices cluster and handles cross-cutting concerns (auth, rate limiting, routing) that would otherwise be duplicated in every service.
- Saga pattern: When microservices need to coordinate state changes across multiple databases, the Saga pattern is the standard replacement for ACID transactions. Required knowledge if you're designing any transactional microservices flow.
- Service mesh: At 10+ services, a service mesh (Istio, Linkerd) automates the circuit breakers, retries, mutual TLS, and observability that you'd otherwise build per-service. The point at which it becomes worth the overhead.
- CQRS: When microservices diverge on read versus write access patterns, CQRS provides the pattern to separate these paths without coupling service deployments to shared query models.