Airbnb: Breaking up the monolith
How Airbnb migrated from a Ruby on Rails monolith to a service-oriented architecture, what it took to extract services safely, and the lessons from a decade-long process.
TL;DR
- Airbnb ran a single Ruby on Rails monolith (nicknamed "Monorail") for nearly a decade, growing to 100K+ lines of Ruby with 1,000+ engineers contributing to one codebase.
- By 2018, deploys took 45+ minutes, the test suite ran 30+ minutes, and a single bad commit could take down every product surface simultaneously.
- They rejected a "big bang" rewrite in favor of the strangler fig pattern: new features as services, gradual extraction of existing domains at natural seams.
- The hardest part was data extraction, not code extraction. The monolith's 400+ table database had foreign keys spanning every domain boundary.
- Transferable lesson: invest in your service platform (mesh, gateway, registry) before extracting services. Without it, you get a distributed monolith.
The Trigger
In 2017, Airbnb's deploy pipeline had become the company's most expensive bottleneck. A single change to Monorail, no matter how small, required a full deploy cycle of 45+ minutes. Engineers batched their changes into weekly "release trains" to avoid burning half their day waiting on deploys.
The math was brutal. With 1,000+ engineers and a 45-minute deploy window, a failed deploy didn't just block one team. It blocked every team waiting in the queue behind it. Rollbacks happened frequently enough that the deploy pipeline had its own incident taxonomy.
I've seen this exact scenario at three different companies. The monolith isn't "too big" in some abstract sense. It becomes too big when the deploy pipeline is the primary constraint on engineering velocity, and no amount of CI optimization can fix it because the problem is architectural.
The triggering incident was a cascade failure in late 2017 where a payments-related change caused a brief outage in the search and listing pages. Payments, search, and listings had no logical dependency, but they shared the same deploy artifact. One bad import path in a payments module triggered an exception at startup that took down the entire Monorail process. That incident crystallized what engineers already knew: the blast radius of every change was "everything."
The System Before
Monorail was a textbook Rails monolith. Every domain (listings, bookings, payments, messaging, search, reviews, user profiles, pricing, calendar) lived in a single Ruby on Rails application backed by a single MySQL database cluster.
This architecture worked from 2008 to roughly 2015. A single codebase meant every engineer could grep the entire system. Debugging was straightforward because everything ran in one process. Database transactions were simple because every table lived in the same MySQL instance.
The cracks showed as scale increased:
| Metric | 2012 | 2018 |
|---|---|---|
| Engineers on Monorail | ~50 | 1,000+ |
| Lines of Ruby | ~20K | 100K+ |
| Deploy time | ~5 min | 45+ min |
| Test suite runtime | ~3 min | 30+ min |
| DB tables | ~80 | 400+ |
| Deploy failures/month | Rare | Weekly |
The database was particularly tangled. Foreign key constraints spanned every domain boundary. The bookings table had FKs pointing to listings, users, payments, and calendar_events. The reviews table referenced bookings, users, and listings. No table was an island.
Teams couldn't deploy independently, couldn't scale independently, and couldn't fail independently. A memory leak in the reviews module increased latency for search. A slow migration on the payments table locked deploys for everyone.
The monolith that launched a $100B company had become its primary engineering bottleneck.
Why Not Just Rewrite Everything?
The obvious answer ("rewrite it as microservices") was the first idea on the table and the first to be rejected.
Airbnb's engineering leadership evaluated a full decomposition and identified three blockers that made a big-bang approach unacceptable.
Distributed transactions are genuinely hard. Booking a listing in the monolith was a single database transaction: check availability, create booking, charge payment, send confirmation, update calendar. All atomic, all in one MySQL transaction. As separate services, each step is a separate database with its own transaction scope. You need saga patterns or two-phase commit, and both add latency and failure modes that didn't exist before.
Data ownership was undefined. The 400+ table database had no documented ownership. The users table was referenced by every module. The listings table contained pricing, availability, and property data that multiple domains needed. Before you can split a database, you need to know who owns each table. That knowledge didn't exist in documentation; it lived in tribal knowledge spread across dozens of teams.
Teams couldn't pause feature work. Airbnb was in a competitive market with Booking.com and Vrbo. Telling 1,000 engineers to freeze features for a year while the architecture team rewrote the backend was not a realistic option. The migration had to happen alongside ongoing product development.
The 'big bang rewrite' trap
Every company at this scale considers a full rewrite. Almost none succeed. The ones that try typically end up maintaining two systems in parallel for years, with the rewrite perpetually "90% done." Airbnb's leadership recognized this pattern early and explicitly rejected it.
The strategic decision: migrate incrementally using the strangler fig pattern. New features ship as services from day one. Existing functionality gets extracted domain by domain, at natural seams, when a team has the bandwidth and the boundary is clear.
The Decision
Airbnb chose a three-pronged strategy executed over roughly three years (2018-2021):
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.