Airbnb: Breaking up the monolith
How Airbnb migrated from a Ruby on Rails monolith to a service-oriented architecture, what it took to extract services safely, and the lessons from a decade-long process.
TL;DR
- Airbnb ran a single Ruby on Rails monolith (nicknamed "Monorail") for nearly a decade, growing to 100K+ lines of Ruby with 1,000+ engineers contributing to one codebase.
- By 2018, deploys took 45+ minutes, the test suite ran 30+ minutes, and a single bad commit could take down every product surface simultaneously.
- They rejected a "big bang" rewrite in favor of the strangler fig pattern: new features as services, gradual extraction of existing domains at natural seams.
- The hardest part was data extraction, not code extraction. The monolith's 400+ table database had foreign keys spanning every domain boundary.
- Transferable lesson: invest in your service platform (mesh, gateway, registry) before extracting services. Without it, you get a distributed monolith.
The Trigger
In 2017, Airbnb's deploy pipeline had become the company's most expensive bottleneck. A single change to Monorail, no matter how small, required a full deploy cycle of 45+ minutes. Engineers batched their changes into weekly "release trains" to avoid burning half their day waiting on deploys.
The math was brutal. With 1,000+ engineers and a 45-minute deploy window, a failed deploy didn't just block one team. It blocked every team waiting in the queue behind it. Rollbacks happened frequently enough that the deploy pipeline had its own incident taxonomy.
I've seen this exact scenario at three different companies. The monolith isn't "too big" in some abstract sense. It becomes too big when the deploy pipeline is the primary constraint on engineering velocity, and no amount of CI optimization can fix it because the problem is architectural.
The triggering incident was a cascade failure in late 2017 where a payments-related change caused a brief outage in the search and listing pages. Payments, search, and listings had no logical dependency, but they shared the same deploy artifact. One bad import path in a payments module triggered an exception at startup that took down the entire Monorail process. That incident crystallized what engineers already knew: the blast radius of every change was "everything."
The System Before
Monorail was a textbook Rails monolith. Every domain (listings, bookings, payments, messaging, search, reviews, user profiles, pricing, calendar) lived in a single Ruby on Rails application backed by a single MySQL database cluster.
This architecture worked from 2008 to roughly 2015. A single codebase meant every engineer could grep the entire system. Debugging was straightforward because everything ran in one process. Database transactions were simple because every table lived in the same MySQL instance.
The cracks showed as scale increased:
| Metric | 2012 | 2018 |
|---|---|---|
| Engineers on Monorail | ~50 | 1,000+ |
| Lines of Ruby | ~20K | 100K+ |
| Deploy time | ~5 min | 45+ min |
| Test suite runtime | ~3 min | 30+ min |
| DB tables | ~80 | 400+ |
| Deploy failures/month | Rare | Weekly |
The database was particularly tangled. Foreign key constraints spanned every domain boundary. The bookings table had FKs pointing to listings, users, payments, and calendar_events. The reviews table referenced bookings, users, and listings. No table was an island.
Teams couldn't deploy independently, couldn't scale independently, and couldn't fail independently. A memory leak in the reviews module increased latency for search. A slow migration on the payments table locked deploys for everyone.
The monolith that launched a $100B company had become its primary engineering bottleneck.
Why Not Just Rewrite Everything?
The obvious answer ("rewrite it as microservices") was the first idea on the table and the first to be rejected.
Airbnb's engineering leadership evaluated a full decomposition and identified three blockers that made a big-bang approach unacceptable.
Distributed transactions are genuinely hard. Booking a listing in the monolith was a single database transaction: check availability, create booking, charge payment, send confirmation, update calendar. All atomic, all in one MySQL transaction. As separate services, each step is a separate database with its own transaction scope. You need saga patterns or two-phase commit, and both add latency and failure modes that didn't exist before.
Data ownership was undefined. The 400+ table database had no documented ownership. The users table was referenced by every module. The listings table contained pricing, availability, and property data that multiple domains needed. Before you can split a database, you need to know who owns each table. That knowledge didn't exist in documentation; it lived in tribal knowledge spread across dozens of teams.
Teams couldn't pause feature work. Airbnb was in a competitive market with Booking.com and Vrbo. Telling 1,000 engineers to freeze features for a year while the architecture team rewrote the backend was not a realistic option. The migration had to happen alongside ongoing product development.
The 'big bang rewrite' trap
Every company at this scale considers a full rewrite. Almost none succeed. The ones that try typically end up maintaining two systems in parallel for years, with the rewrite perpetually "90% done." Airbnb's leadership recognized this pattern early and explicitly rejected it.
The strategic decision: migrate incrementally using the strangler fig pattern. New features ship as services from day one. Existing functionality gets extracted domain by domain, at natural seams, when a team has the bandwidth and the boundary is clear.
The Decision
Airbnb chose a three-pronged strategy executed over roughly three years (2018-2021):
- Platform first, services second. Build the service infrastructure (mesh, gateway, registry, observability) before any team extracts a service.
- Strangler fig extraction. New features launch as independent services. Existing domains get extracted one at a time, starting with the most cleanly bounded.
- Service owns its data. Every extracted service owns its own database. No shared tables, no cross-service foreign keys, no backdoor SQL queries.
The "platform first" decision was the most important and least intuitive. Engineering leadership pushed back on teams that wanted to immediately start extracting services. The reasoning: if 10 teams each build their own service discovery, health checking, retry logic, and circuit breaking, you haven't decomposed a monolith. You've created 10 tightly coupled services with 10 different approaches to the same cross-cutting concerns.
I've watched two companies skip this step and regret it within six months. One ended up with services that couldn't talk to each other reliably during partial outages because each team had implemented retries differently. The platform-first approach costs more upfront but pays off by the third or fourth service extraction.
The platform investment included:
- Envoy + Istio service mesh: Sidecar proxies handle all service-to-service communication. Retries with exponential backoff, circuit breaking, mutual TLS, and distributed tracing all come from the mesh, not from application code.
- Internal service registry: Services register themselves and their capabilities. Other services discover endpoints through the registry, not hardcoded URLs.
- API gateway: All external traffic enters through a unified gateway that handles routing, authentication, rate limiting, and request transformation.
- Shared client libraries: Standard libraries for service communication, so every service speaks the same protocol with the same error handling.
Interview tip: platform before services
When discussing monolith decomposition in interviews, mention the service platform first. Say: "Before extracting any service, I'd invest in a service mesh and API gateway so every extracted service gets observability, retries, and circuit breaking for free." This shows you understand that the hard part of microservices is operations, not code splitting.
The Migration Path
Airbnb's extraction followed the strangler fig pattern in three phases. Each phase had clear entry criteria, rollback plans, and success metrics.
Phase 1: Platform foundation
Before any service existed, the infrastructure team spent roughly 6-9 months building the service platform. This felt slow to product teams eager to extract their domains, but it prevented the "distributed monolith" antipattern.
The key deliverables: Envoy sidecar proxies running alongside every process, an Istio control plane managing traffic policies, and a service registry where new services could announce themselves. The mesh gave every service automatic retries, circuit breaking, mutual TLS, and distributed tracing without writing a single line of application code.
Phase 2: New features as services
Once the platform existed, the policy changed: all new product features launched as independent services, not as additions to Monorail. This was the strangler fig in action. The monolith stopped growing.
New services like improved search ranking and enhanced pricing algorithms were built outside Monorail from day one. They registered with the service mesh, got their own databases, and communicated with Monorail through well-defined APIs. This validated the platform under real production traffic before any existing domain was extracted.
Phase 3: Domain extraction
This was the hard phase, repeated for each domain. Airbnb extracted domains in order of boundary clarity, starting with the most cleanly separable.
Reviews went first. Reviews had relatively few inbound dependencies (bookings reference reviews, but reviews don't write back to bookings). The data model was self-contained enough to extract without complex saga patterns.
Messaging followed. Host-guest messaging had clear API boundaries and minimal shared state with other domains.
Payments came later, after the team had practice. Payments was the most complex extraction because it touched bookings, listings, user billing, and host payouts. It required saga patterns for multi-step financial operations.
The data extraction problem
Code extraction is the easy part. Data extraction is where teams get stuck, and Airbnb was no exception. The monolith's MySQL database had 400+ tables with foreign keys crossing every domain boundary.
Here's what the dependency graph looked like for a single domain:
Monorail MySQL: domain entanglement
listings --[FK]--> users
bookings --[FK]--> listings, users, payments, calendar_events
reviews --[FK]--> bookings, users, listings
payments --[FK]--> bookings, users
messages --[FK]--> bookings, users
For each domain extraction, the team followed a four-step data migration:
- Remove foreign key constraints. Replace DB-enforced referential integrity with application-level validation. This is the scariest step because you lose the safety net the database provided for free.
- Replicate referenced data. Copy the data the new service needs into its own database. For example, the Reviews service needs user display names and listing titles, so those get replicated (not moved) to the Reviews DB.
- Cut over reads and writes. The new service starts reading from its own database. Monorail stops writing to the extracted tables. A dual-write period validates consistency.
- Delete original tables. Once the new service is stable and the dual-write period shows zero discrepancies, drop the original tables from the monolith database.
Step 1 is the point of no return
Removing foreign key constraints is irreversible in practice. Once application code assumes FKs don't exist, re-adding them requires auditing every write path for consistency. Plan this step carefully and add application-level consistency checks before removing the constraint, not after.
Step 1 alone took weeks per domain. The team had to audit every query that relied on the FK constraint (joins, cascading deletes, ON DELETE SET NULL behaviors) and replace them with explicit application logic. I've seen teams underestimate this step by 3-4x consistently. If you're planning a similar migration, double your estimate for data extraction and then add buffer.
For your interview: the phrase "data extraction is harder than code extraction" immediately signals to an interviewer that you understand the real complexity of monolith decomposition. Lead with it.
The System After
After three years of incremental extraction, Airbnb's architecture looked fundamentally different. Monorail still existed but handled a shrinking set of domains. The majority of traffic flowed through independent services communicating via the service mesh.
The key architectural properties of the new system:
- Independent deploys. The Reviews team deploys without affecting Payments. Deploy times dropped from 45 minutes to under 5 minutes per service.
- Independent scaling. Search handles 10x more read traffic than Messaging. Each service scales its compute and database independently.
- Blast radius containment. A bug in the Pricing service doesn't crash the Messaging service. Circuit breakers in the mesh prevent cascading failures.
- Service-owned data. Every service owns its tables. Cross-service data access happens through APIs or event consumption, never through shared database queries.
The team topology changed alongside the architecture. Each domain service was owned by a dedicated team (6-10 engineers) with full ownership of their service, database, deploy pipeline, and on-call rotation. Conway's Law in action: the org structure mirrored the service boundaries.
The Results
| Metric | Before (Monorail) | After (SOA) |
|---|---|---|
| Deploy time | 45+ minutes | < 5 minutes per service |
| Test suite runtime | 30+ minutes (full) | 2-5 minutes per service |
| Deploy frequency | Weekly release trains | Multiple deploys per day per team |
| Blast radius | Entire platform | Single service |
| Engineer onboarding | Weeks (understand full monolith) | Days (understand one service) |
| Incident scope | Cross-domain cascades common | Isolated to service boundary |
| DB schema changes | Blocked all teams | Independent per service |
The deploy velocity improvement was the most visible win. Teams went from weekly coordinated releases to shipping multiple times per day. A payments engineer could push a billing fix without waiting for the search team to finish their migration.
The less visible but equally important win was incident isolation. Before the migration, the on-call rotation for Monorail required understanding every domain. After, on-call engineers only needed deep knowledge of their own service. Mean time to resolution (MTTR) dropped because the person debugging the issue was the person who wrote the code.
The cost was real, though. The engineering team invested roughly 18 months of platform work before seeing productivity improvements. The total migration stretched over three years with dedicated infrastructure engineers. For a smaller company, this investment would be disproportionate to the benefit.
What They'd Do Differently
Based on public talks and retrospectives from Airbnb engineers:
Start the service registry earlier. The registry was one of the last platform components built, but it turned out to be one of the most critical. Without it, early services used hardcoded URLs that broke during deployments. Teams that started extraction before the registry was ready had to retrofit service discovery later.
Define data ownership before writing code. The hardest part of every extraction was answering "who owns this table?" Airbnb eventually created a formal data ownership catalog, but it came 6+ months into the migration. Starting with that catalog would have accelerated every extraction.
Invest more in contract testing. Integration tests between services were initially written as end-to-end tests that were slow and flaky. Contract testing (where each service verifies the API contract independently) would have caught interface mismatches earlier without the overhead of full integration environments.
Honest assessment: the migration took longer than projected and required more infrastructure investment than initially estimated. But the alternative (continuing with Monorail) would have constrained engineering velocity indefinitely. The question was never "should we migrate" but "when and how fast."
Architecture Decision Guide
Use this flowchart when deciding whether your own system needs a monolith breakup.
The first question is always "is the monolith actually causing problems?" If deploys are fast, teams aren't stepping on each other, and failures are contained, you don't need microservices. Airbnb's monolith served them well for a decade. Premature decomposition creates more problems than it solves.
Transferable Lessons
1. Platform before services, always.
Airbnb invested 6-9 months in service mesh, API gateway, and service registry before extracting a single domain. This feels slow, but it meant every extracted service got observability, circuit breaking, and service discovery for free. Without the platform, each team would have built these capabilities independently, creating inconsistent implementations that fail differently under load. The general rule: if you're planning to extract more than 2-3 services, the platform investment pays for itself.
2. Data extraction is the real migration, not code extraction.
Moving Ruby code from Monorail into a new service took weeks. Untangling 400+ tables with cross-domain foreign keys took months per domain. The database is where coupling actually lives. In any monolith breakup, estimate the data migration at 3-4x the code migration effort and you'll be closer to reality.
3. Extract at the seams, not at the center.
Airbnb started with Reviews (few inbound dependencies) and Messaging (clear API boundary), not with Bookings (touches every other domain). The order of extraction matters enormously. Start with the domain that has the fewest inbound foreign keys and the clearest API surface. Save the most entangled domains for last, after your team has practice and tooling.
4. The monolith isn't the enemy until it is.
Monorail was the right architecture for a startup figuring out product-market fit. It became the wrong architecture when 1,000+ engineers needed to ship independently. The trigger for decomposition should be measurable pain (deploy times, incident blast radius, team coordination overhead), not architectural ideology.
5. Org structure follows architecture (Conway's Law in practice).
Each extracted service at Airbnb got a dedicated team with full ownership. The team owned the code, the database, the deploy pipeline, and the on-call rotation. Extracting a service without changing the team structure just creates a service that still requires cross-team coordination, defeating the purpose.
How This Shows Up in Interviews
When an interviewer asks "would you build this as a monolith or microservices?", Airbnb is a perfect case study to cite because it illustrates both sides of the tradeoff.
The key sentence: "Start with a monolith to learn your domain boundaries, then extract services when the monolith's coordination cost exceeds the distribution tax."
| Interviewer asks | Strong answer citing this case study |
|---|---|
| "Monolith or microservices?" | "Monolith first, like Airbnb did for 10 years. Extract when deploy times and blast radius become the bottleneck, not before." |
| "How would you migrate?" | "Strangler fig pattern: new features as services, existing domains extracted at seams. Platform (mesh + gateway) built before any extraction." |
| "What's the hardest part?" | "Data extraction, not code. Airbnb's 400+ tables with cross-domain FKs took months per domain. I'd start by mapping table ownership." |
| "How do you avoid a distributed monolith?" | "Invest in a service platform first. Mesh gives you retries, circuit breaking, and tracing. Without it, each team reinvents those wheels differently." |
| "When is the monolith 'too big'?" | "When deploy time exceeds 15-20 minutes, when unrelated changes cause cross-domain failures, and when 50+ engineers are on one codebase." |
The money quote
Memorize this: "Decomposition without platform investment creates a distributed monolith." It's a one-liner that demonstrates you understand why most microservice migrations fail.
Quick Recap
- Airbnb's Monorail served them well for 10 years but hit team-scale limits: 45-minute deploys, 30-minute test suites, and weekly cross-domain incidents with 1,000+ contributors.
- They rejected a big-bang rewrite because distributed transactions, undefined data ownership, and ongoing feature pressure made it impractical.
- The strangler fig pattern let them migrate incrementally: new features as services, existing domains extracted at natural seams (Reviews, then Messaging, then Payments).
- Data extraction was the hardest part. Removing foreign key constraints from 400+ tables, replicating data to service-local databases, and replacing DB-enforced consistency with application-level checks took months per domain.
- They invested 6-9 months in a service platform (Envoy + Istio mesh, API gateway, service registry) before extracting any domain, which prevented the distributed monolith antipattern.
- The result: deploy times from 45+ minutes to under 5, weekly release trains to multiple deploys per day, and incident blast radius contained to individual services.
- The transferable lesson: decompose at domain boundaries, invest in platform infrastructure first, and start extraction with the cleanest seams.
Related Concepts
- Monolith vs. microservices - The core tradeoff Airbnb navigated: when the coordination cost of a monolith exceeds the distribution tax of services.
- Strangler fig pattern - The migration pattern Airbnb used to incrementally replace Monorail without a big-bang rewrite.
- Premature microservices - Why Airbnb was right to keep the monolith for a decade before decomposing, and what goes wrong when you split too early.