GitHub: MySQL to Vitess migration

TL;DR

GitHub ran all of its core data (repositories, users, pull requests, issues) on a single MySQL primary with read replicas, hitting hard operational limits by 2018.
Failovers took 1-3 minutes with their orchestrator-based setup, causing elevated error rates across every GitHub feature during each promotion.
Vitess added a transparent proxy layer: VTGate handles connection multiplexing (100K+ app connections to ~2K MySQL connections), VTTablet manages sub-second failover via Raft consensus, and the topology service coordinates shard routing.
The migration ran over 2-3 years in four phases: shadow reads, writes through Vitess, failover validation, then table-by-table sharding.
Transferable lesson: a database proxy layer is often the right intermediate step between "we're running out of headroom" and "we need to shard everything."

By 2018, GitHub's MySQL infrastructure was showing cracks that tuning alone could not fix. The platform served tens of millions of developers, hosting over 100 million repositories, processing thousands of git operations per second. All of that metadata (repositories, users, organizations, pull requests, issues, commits) lived in a single MySQL primary with a fleet of read replicas.

The breaking point was not a single dramatic failure. It was the accumulation of three operational problems that got worse every quarter.

Failover time was measured in minutes, not seconds. GitHub used an orchestrator-based failover system to promote replicas when the primary went down. Each failover took 1-3 minutes. During that window, every write to the database failed.

Applications retried, queues backed up, and users saw error pages. For a platform where developers push code, open PRs, and merge changes continuously, even a 90-second write outage is painful. Multiply that by the frequency of planned maintenance failovers and unplanned hardware issues, and you get a steady background hum of degradation.

Connection limits created a hard ceiling. MySQL uses a thread-per-connection model. The practical maximum hovered around 1,500 concurrent connections before performance degraded sharply. GitHub's application tier had hundreds of servers, each opening multiple database connections. Connection pooling at the app layer helped, but it pushed complexity into every service that needed database access.

Schema changes on production were slow and risky. Altering a table with 200+ million rows required either a blocking DDL (which locks the table) or gh-ost, GitHub's own online schema migration tool. Even gh-ost took hours for large tables. During that window, the team held their breath, watching for replication lag or lock contention.

I've worked at companies where the "database is fine, we just need to optimize queries" conversation lasted two years too long. GitHub's team recognized the pattern early: each individual problem had a workaround, but the workarounds were compounding into operational debt that slowed down every team.

The total cost was not just engineering effort. It was developer velocity. When every deploy risks a connection spike, when every schema change needs a multi-day migration window, when every failover triggers a page, teams move slower. The database becomes the bottleneck for the entire organization, not just the infrastructure team.

The System Before

GitHub's pre-Vitess architecture was a well-understood MySQL primary-replica topology with orchestrator handling failover. The design served GitHub well for years, but it had a fundamental constraint: everything converged on a single write path.

What worked

The single-primary model is simple to reason about. There is one source of truth for writes, replication fans out to read replicas, and the application splits read/write traffic at the connection level. For GitHub's first decade, this was the right architecture.

Read replicas handled the majority of traffic. Most GitHub operations (viewing repos, browsing code, reading issues) are reads. The replicas absorbed that load effectively, and adding more replicas scaled read capacity linearly.

GitHub's gh-ost tool for online schema changes was itself a testament to how well the team understood MySQL internals. They had invested heavily in making the single-cluster model work. This was not a team that jumped to a new architecture out of ignorance.

Where it broke down

The write path had no horizontal scaling story. Every INSERT, UPDATE, and DELETE for repositories, pull requests, issues, and user data went through one MySQL process. When write throughput grew, the only option was a bigger machine (vertical scaling), and that has hard limits.

Orchestrator-based failover was built for rare events, not routine operations. Promoting a replica required detecting the failure, electing a new primary, reconfiguring replication topology, and updating the application's connection routing. Each step added seconds. In production, the total failover time ranged from 60 to 180 seconds.

Connection pooling lived in the wrong place. Each Rails instance managed its own connection pool to MySQL. There was no centralized multiplexing layer. If you had 400 app servers each holding 4 connections, that is 1,600 connections against a database that starts struggling above 1,500.

Why Not Just Scale MySQL Vertically?

The obvious first question: why not just buy a bigger MySQL server? GitHub did. They ran MySQL on increasingly powerful hardware. But vertical scaling hits three walls simultaneously.

Memory wall. MySQL's InnoDB buffer pool needs to hold the working set in RAM for acceptable read performance. As the dataset grew past what a single machine's RAM could cache, read latency increased and became unpredictable.

Connection wall. More application servers means more connections. Vertical scaling does not change MySQL's thread-per-connection model. A machine with 128 cores still struggles with 3,000 concurrent threads competing for locks. Each thread consumes memory for its stack, and mutex contention grows non-linearly with thread count.

Operational wall. A single massive MySQL instance is a single massive point of failure. Backups take longer. Restores take longer. Schema changes take longer. Every operational task scales with the data volume on that one machine.

The combination of all three walls hitting simultaneously is what makes vertical scaling a dead end at GitHub's scale. Any one wall alone might be solvable. All three together force an architectural change.

Why not just add ProxySQL?

ProxySQL (or similar connection poolers) solves the connection multiplexing problem but not the failover or sharding problems. GitHub needed all three. Vitess bundles connection pooling, automated failover, and transparent sharding into a single operational layer.

The team also considered building custom sharding into the Rails application. This would mean every query needs to know its shard key, every migration runs against N databases instead of one, and cross-shard queries require application-level JOINs. I've seen teams attempt application-level sharding and it works, but the operational complexity is enormous. It infects every layer of the codebase. GitHub chose to push that complexity into infrastructure instead.

The Decision

TL;DR

GitHub ran all of its core data (repositories, users, pull requests, issues) on a single MySQL primary with read replicas, hitting hard operational limits by 2018.
Failovers took 1-3 minutes with their orchestrator-based setup, causing elevated error rates across every GitHub feature during each promotion.
Vitess added a transparent proxy layer: VTGate handles connection multiplexing (100K+ app connections to ~2K MySQL connections), VTTablet manages sub-second failover via Raft consensus, and the topology service coordinates shard routing.
The migration ran over 2-3 years in four phases: shadow reads, writes through Vitess, failover validation, then table-by-table sharding.
Transferable lesson: a database proxy layer is often the right intermediate step between "we're running out of headroom" and "we need to shard everything."

GitHub: MySQL to Vitess migration

TL;DR

The Trigger

The System Before

What worked

Where it broke down

Why Not Just Scale MySQL Vertically?

The Decision

Continue Reading with Premium

Comments

GitHub: MySQL to Vitess migration

TL;DR

The Trigger

The System Before

What worked

Where it broke down

Why Not Just Scale MySQL Vertically?

The Decision

Continue Reading with Premium

Comments