Airbnb: Breaking up the monolith

TL;DR

Airbnb ran a single Ruby on Rails monolith (nicknamed "Monorail") for nearly a decade, growing to 100K+ lines of Ruby with 1,000+ engineers contributing to one codebase.
By 2018, deploys took 45+ minutes, the test suite ran 30+ minutes, and a single bad commit could take down every product surface simultaneously.
They rejected a "big bang" rewrite in favor of the strangler fig pattern: new features as services, gradual extraction of existing domains at natural seams.
The hardest part was data extraction, not code extraction. The monolith's 400+ table database had foreign keys spanning every domain boundary.
Transferable lesson: invest in your service platform (mesh, gateway, registry) before extracting services. Without it, you get a distributed monolith.

In 2017, Airbnb's deploy pipeline had become the company's most expensive bottleneck. A single change to Monorail, no matter how small, required a full deploy cycle of 45+ minutes. Engineers batched their changes into weekly "release trains" to avoid burning half their day waiting on deploys.

The math was brutal. With 1,000+ engineers and a 45-minute deploy window, a failed deploy didn't just block one team. It blocked every team waiting in the queue behind it. Rollbacks happened frequently enough that the deploy pipeline had its own incident taxonomy.

I've seen this exact scenario at three different companies. The monolith isn't "too big" in some abstract sense. It becomes too big when the deploy pipeline is the primary constraint on engineering velocity, and no amount of CI optimization can fix it because the problem is architectural.

The triggering incident was a cascade failure in late 2017 where a payments-related change caused a brief outage in the search and listing pages. Payments, search, and listings had no logical dependency, but they shared the same deploy artifact. One bad import path in a payments module triggered an exception at startup that took down the entire Monorail process. That incident crystallized what engineers already knew: the blast radius of every change was "everything."

The System Before

Monorail was a textbook Rails monolith. Every domain (listings, bookings, payments, messaging, search, reviews, user profiles, pricing, calendar) lived in a single Ruby on Rails application backed by a single MySQL database cluster.

This architecture worked from 2008 to roughly 2015. A single codebase meant every engineer could grep the entire system. Debugging was straightforward because everything ran in one process. Database transactions were simple because every table lived in the same MySQL instance.

The cracks showed as scale increased:

Metric	2012	2018
Engineers on Monorail	~50	1,000+
Lines of Ruby	~20K	100K+
Deploy time	~5 min	45+ min
Test suite runtime	~3 min	30+ min
DB tables	~80	400+
Deploy failures/month	Rare	Weekly

The database was particularly tangled. Foreign key constraints spanned every domain boundary. The bookings table had FKs pointing to listings, users, payments, and calendar_events. The reviews table referenced bookings, users, and listings. No table was an island.

Teams couldn't deploy independently, couldn't scale independently, and couldn't fail independently. A memory leak in the reviews module increased latency for search. A slow migration on the payments table locked deploys for everyone.

The monolith that launched a $100B company had become its primary engineering bottleneck.

Why Not Just Rewrite Everything?

The obvious answer ("rewrite it as microservices") was the first idea on the table and the first to be rejected.

Airbnb's engineering leadership evaluated a full decomposition and identified three blockers that made a big-bang approach unacceptable.

Distributed transactions are genuinely hard. Booking a listing in the monolith was a single database transaction: check availability, create booking, charge payment, send confirmation, update calendar. All atomic, all in one MySQL transaction. As separate services, each step is a separate database with its own transaction scope. You need saga patterns or two-phase commit, and both add latency and failure modes that didn't exist before.

Data ownership was undefined. The 400+ table database had no documented ownership. The users table was referenced by every module. The listings table contained pricing, availability, and property data that multiple domains needed. Before you can split a database, you need to know who owns each table. That knowledge didn't exist in documentation; it lived in tribal knowledge spread across dozens of teams.

Teams couldn't pause feature work. Airbnb was in a competitive market with Booking.com and Vrbo. Telling 1,000 engineers to freeze features for a year while the architecture team rewrote the backend was not a realistic option. The migration had to happen alongside ongoing product development.

The 'big bang rewrite' trap

Every company at this scale considers a full rewrite. Almost none succeed. The ones that try typically end up maintaining two systems in parallel for years, with the rewrite perpetually "90% done." Airbnb's leadership recognized this pattern early and explicitly rejected it.

The strategic decision: migrate incrementally using the strangler fig pattern. New features ship as services from day one. Existing functionality gets extracted domain by domain, at natural seams, when a team has the bandwidth and the boundary is clear.

The Decision

Airbnb chose a three-pronged strategy executed over roughly three years (2018-2021):

TL;DR

Airbnb ran a single Ruby on Rails monolith (nicknamed "Monorail") for nearly a decade, growing to 100K+ lines of Ruby with 1,000+ engineers contributing to one codebase.
By 2018, deploys took 45+ minutes, the test suite ran 30+ minutes, and a single bad commit could take down every product surface simultaneously.
They rejected a "big bang" rewrite in favor of the strangler fig pattern: new features as services, gradual extraction of existing domains at natural seams.
The hardest part was data extraction, not code extraction. The monolith's 400+ table database had foreign keys spanning every domain boundary.
Transferable lesson: invest in your service platform (mesh, gateway, registry) before extracting services. Without it, you get a distributed monolith.

The Trigger

The System Before

The cracks showed as scale increased:

Metric	2012	2018
Engineers on Monorail	~50	1,000+
Lines of Ruby	~20K	100K+
Deploy time	~5 min	45+ min
Test suite runtime	~3 min	30+ min
DB tables	~80	400+
Deploy failures/month	Rare	Weekly

The monolith that launched a $100B company had become its primary engineering bottleneck.

Why Not Just Rewrite Everything?

The obvious answer ("rewrite it as microservices") was the first idea on the table and the first to be rejected.

Airbnb's engineering leadership evaluated a full decomposition and identified three blockers that made a big-bang approach unacceptable.

The 'big bang rewrite' trap

The Decision

Airbnb chose a three-pronged strategy executed over roughly three years (2018-2021):

Airbnb: Breaking up the monolith

TL;DR

The Trigger

The System Before

Why Not Just Rewrite Everything?

The Decision

Continue Reading with Premium

Comments

Airbnb: Breaking up the monolith

TL;DR

The Trigger

The System Before

Why Not Just Rewrite Everything?

The Decision

Continue Reading with Premium

Comments