GitHub: MySQL to Vitess migration
How GitHub moved from a single master MySQL cluster to Vitess, enabling horizontal sharding at scale while maintaining MySQL compatibility for application code.
TL;DR
- GitHub ran all of its core data (repositories, users, pull requests, issues) on a single MySQL primary with read replicas, hitting hard operational limits by 2018.
- Failovers took 1-3 minutes with their orchestrator-based setup, causing elevated error rates across every GitHub feature during each promotion.
- Vitess added a transparent proxy layer: VTGate handles connection multiplexing (100K+ app connections to ~2K MySQL connections), VTTablet manages sub-second failover via Raft consensus, and the topology service coordinates shard routing.
- The migration ran over 2-3 years in four phases: shadow reads, writes through Vitess, failover validation, then table-by-table sharding.
- Transferable lesson: a database proxy layer is often the right intermediate step between "we're running out of headroom" and "we need to shard everything."
The Trigger
By 2018, GitHub's MySQL infrastructure was showing cracks that tuning alone could not fix. The platform served tens of millions of developers, hosting over 100 million repositories, processing thousands of git operations per second. All of that metadata (repositories, users, organizations, pull requests, issues, commits) lived in a single MySQL primary with a fleet of read replicas.
The breaking point was not a single dramatic failure. It was the accumulation of three operational problems that got worse every quarter.
Failover time was measured in minutes, not seconds. GitHub used an orchestrator-based failover system to promote replicas when the primary went down. Each failover took 1-3 minutes. During that window, every write to the database failed.
Applications retried, queues backed up, and users saw error pages. For a platform where developers push code, open PRs, and merge changes continuously, even a 90-second write outage is painful. Multiply that by the frequency of planned maintenance failovers and unplanned hardware issues, and you get a steady background hum of degradation.
Connection limits created a hard ceiling. MySQL uses a thread-per-connection model. The practical maximum hovered around 1,500 concurrent connections before performance degraded sharply. GitHub's application tier had hundreds of servers, each opening multiple database connections. Connection pooling at the app layer helped, but it pushed complexity into every service that needed database access.
Schema changes on production were slow and risky. Altering a table with 200+ million rows required either a blocking DDL (which locks the table) or gh-ost, GitHub's own online schema migration tool. Even gh-ost took hours for large tables. During that window, the team held their breath, watching for replication lag or lock contention.
I've worked at companies where the "database is fine, we just need to optimize queries" conversation lasted two years too long. GitHub's team recognized the pattern early: each individual problem had a workaround, but the workarounds were compounding into operational debt that slowed down every team.
The total cost was not just engineering effort. It was developer velocity. When every deploy risks a connection spike, when every schema change needs a multi-day migration window, when every failover triggers a page, teams move slower. The database becomes the bottleneck for the entire organization, not just the infrastructure team.
The System Before
GitHub's pre-Vitess architecture was a well-understood MySQL primary-replica topology with orchestrator handling failover. The design served GitHub well for years, but it had a fundamental constraint: everything converged on a single write path.
What worked
The single-primary model is simple to reason about. There is one source of truth for writes, replication fans out to read replicas, and the application splits read/write traffic at the connection level. For GitHub's first decade, this was the right architecture.
Read replicas handled the majority of traffic. Most GitHub operations (viewing repos, browsing code, reading issues) are reads. The replicas absorbed that load effectively, and adding more replicas scaled read capacity linearly.
GitHub's gh-ost tool for online schema changes was itself a testament to how well the team understood MySQL internals. They had invested heavily in making the single-cluster model work. This was not a team that jumped to a new architecture out of ignorance.
Where it broke down
The write path had no horizontal scaling story. Every INSERT, UPDATE, and DELETE for repositories, pull requests, issues, and user data went through one MySQL process. When write throughput grew, the only option was a bigger machine (vertical scaling), and that has hard limits.
Orchestrator-based failover was built for rare events, not routine operations. Promoting a replica required detecting the failure, electing a new primary, reconfiguring replication topology, and updating the application's connection routing. Each step added seconds. In production, the total failover time ranged from 60 to 180 seconds.
Connection pooling lived in the wrong place. Each Rails instance managed its own connection pool to MySQL. There was no centralized multiplexing layer. If you had 400 app servers each holding 4 connections, that is 1,600 connections against a database that starts struggling above 1,500.
Why Not Just Scale MySQL Vertically?
The obvious first question: why not just buy a bigger MySQL server? GitHub did. They ran MySQL on increasingly powerful hardware. But vertical scaling hits three walls simultaneously.
Memory wall. MySQL's InnoDB buffer pool needs to hold the working set in RAM for acceptable read performance. As the dataset grew past what a single machine's RAM could cache, read latency increased and became unpredictable.
Connection wall. More application servers means more connections. Vertical scaling does not change MySQL's thread-per-connection model. A machine with 128 cores still struggles with 3,000 concurrent threads competing for locks. Each thread consumes memory for its stack, and mutex contention grows non-linearly with thread count.
Operational wall. A single massive MySQL instance is a single massive point of failure. Backups take longer. Restores take longer. Schema changes take longer. Every operational task scales with the data volume on that one machine.
The combination of all three walls hitting simultaneously is what makes vertical scaling a dead end at GitHub's scale. Any one wall alone might be solvable. All three together force an architectural change.
Why not just add ProxySQL?
ProxySQL (or similar connection poolers) solves the connection multiplexing problem but not the failover or sharding problems. GitHub needed all three. Vitess bundles connection pooling, automated failover, and transparent sharding into a single operational layer.
The team also considered building custom sharding into the Rails application. This would mean every query needs to know its shard key, every migration runs against N databases instead of one, and cross-shard queries require application-level JOINs. I've seen teams attempt application-level sharding and it works, but the operational complexity is enormous. It infects every layer of the codebase. GitHub chose to push that complexity into infrastructure instead.
The Decision
GitHub chose Vitess, a database clustering system originally built at YouTube to solve the same class of problems at Google-scale MySQL deployments. The decision came down to three properties that matched GitHub's constraints.
MySQL protocol compatibility. Vitess speaks the MySQL wire protocol. Applications connect to VTGate exactly as they would connect to MySQL. Prepared statements, transactions, and connection lifecycle all work the same way. This meant GitHub's massive Rails codebase did not need a rewrite.
For a team maintaining millions of lines of Ruby on Rails code with deeply embedded MySQL assumptions, protocol compatibility was not a nice-to-have. It was the gating requirement. Any solution that required changing query syntax or ORM configuration across thousands of files was off the table.
Operational primitives built in. Vitess did not just solve one problem. It solved failover (VTTablet with Raft-based consensus), connection pooling (VTGate multiplexing), and sharding (topology-aware query routing) as a unified system. Adopting three separate tools (ProxySQL + orchestrator replacement + custom sharding) would have tripled the integration surface.
Proven at scale. YouTube had already run Vitess in production for years, handling traffic volumes comparable to GitHub's. This was not a research project or a startup's v1 product. The failure modes were documented and understood.
The alternative candidates (CockroachDB, TiDB, Spanner) would have required moving off MySQL entirely. For a Rails application with a decade of MySQL-specific queries, stored procedures, and operational tooling, that was a non-starter. Vitess offered the rarest combination in infrastructure: meaningful improvement with minimal application disruption.
Vitess is not free
Vitess adds operational complexity of its own. You now run VTGate proxies, VTTablet sidecars next to every MySQL instance, a topology service (etcd or ZooKeeper), and a schema management layer. The trade is worth it at GitHub's scale, but for a database handling 500 queries/second, Vitess is overkill. Know your threshold.
The Migration Path
GitHub's migration from bare MySQL to Vitess ran over approximately 2-3 years. The team did not flip a switch. They built confidence incrementally, with each phase validating the next.
Phase 1: Shadow reads
The team started by routing a percentage of read queries through Vitess while simultaneously sending them directly to MySQL. Both result sets were compared. Any discrepancy (whether in row count, column values, or query execution errors) was logged and investigated.
This phase caught subtle compatibility issues. Certain MySQL-specific query patterns, edge cases in prepared statement handling, and timeout behaviors differed slightly through VTGate. The team fixed these before moving any production traffic.
Shadow reads are the safest possible starting point. If Vitess returns a wrong result, the application still uses the direct MySQL result. Zero user impact.
The shadow read phase also provided an unexpected benefit: it gave the team detailed performance profiling of every query through VTGate. They identified slow queries, connection lifecycle issues, and prepared statement edge cases before any production traffic depended on Vitess.
Phase 2: Writes through Vitess
Once reads were clean, the team routed all writes through VTGate. At this stage, Vitess was purely a proxy. The same single MySQL primary sat behind it. No sharding, no topology changes.
The purpose was to validate write latency, transaction isolation, and connection behavior under production load. VTGate's connection multiplexing kicked in here: hundreds of thousands of application-tier connections were funneled into roughly 2,000 actual MySQL connections.
I've done this exact proxy-insertion pattern at a previous company (with PgBouncer, not Vitess), and the connection multiplexing alone justified the migration. Going from "we're at 90% of our connection limit" to "we have headroom for 50x growth" changes the operational conversation entirely.
Phase 3: Failover validation
With all traffic flowing through Vitess, the team triggered planned primary failovers to test VTTablet's Raft-based promotion. In the old orchestrator model, a failover meant 1-3 minutes of write failures. With VTTablet coordinating the promotion, failover completed in under one second.
The team ran these failovers repeatedly, during business hours, under real production load. Each iteration built confidence. They monitored error rates, replication lag on new replicas, and application retry behavior.
Test failovers in production
If you cannot trigger a failover during business hours without fear, your failover mechanism is not production-ready. GitHub's team made planned failovers routine. By the time an unplanned failover happened, the system had already proven it could handle it.
Phase 4: Table-by-table sharding
Not every table needed sharding. Many of GitHub's tables fit comfortably on a single MySQL instance. But the highest-scale tables (repositories, commits, pull requests) were candidates for horizontal splitting.
For each table, the team chose a shard key (typically owner_id or repository_id), defined a VSchema (Vitess's sharding configuration), and split the data across multiple MySQL instances. Vitess handled the routing transparently. Application queries that targeted a single shard key worked without changes.
Cross-shard queries were the main casualty. JOINs across sharded tables do not work through Vitess. The team refactored those queries to either resolve at the application layer or restructure the data access pattern. This was the most labor-intensive part of the migration.
The System After
The post-Vitess architecture kept MySQL as the storage engine but added a proxy layer that solved connection pooling, failover, and sharding in a unified system.
How the components work together
VTGate is the entry point. Every application connection lands here. VTGate parses the incoming MySQL query, consults the topology service to determine which shard owns the data, and routes the query to the correct VTTablet. It also handles connection multiplexing: those 100,000+ application connections are funneled into roughly 2,000 actual MySQL connections across all shards.
VTTablet runs as a sidecar next to each MySQL instance. It manages replication, handles health checks, and coordinates primary failover using Raft-based consensus. When a primary fails, VTTablet elects a new primary among the replicas in under one second. No external orchestrator needed.
Topology service (backed by etcd) stores the shard map: which keyspace ranges live on which MySQL instances, which instance is currently primary, and the health status of every tablet. VTGate consults this on startup and watches for changes.
What didn't change
The key insight: MySQL itself is still the storage engine. Applications still write SQL. Prepared statements still work. Transactions within a single shard still have full ACID guarantees. The Vitess layer is transparent for the vast majority of queries.
Application code changes were minimal. The Rails app switches its database connection string from pointing at MySQL directly to pointing at VTGate. Most queries worked without modification. The exceptions were cross-shard JOINs and certain MySQL-specific syntax that VTGate does not support.
This backward compatibility was the single most important property of the migration. GitHub's Rails monolith contained years of MySQL-specific queries. A migration that required rewriting those queries would have taken a decade, not three years.
The Results
| Metric | Before (Orchestrator + MySQL) | After (Vitess) |
|---|---|---|
| Primary failover time | 1-3 minutes | < 1 second |
| Max application connections | ~1,500 (MySQL limit) | 100,000+ (VTGate multiplexes) |
| Actual MySQL connections | ~1,500 | ~2,000 across all shards |
| Write scaling | Vertical only (single primary) | Horizontal (add shards) |
| Schema change duration (large table) | Hours (single 200M+ row table) | Minutes (smaller per-shard tables) |
| Schema change risk | High (single massive database) | Lower (per-shard, isolated blast radius) |
| Application code changes | N/A | Minimal (connection string + cross-shard JOINs) |
| Migration duration | N/A | ~2-3 years (phased rollout) |
The failover improvement alone justified the migration. Going from minutes of elevated error rates to sub-second recovery changed failover from a "wake up the on-call engineer" event to a routine, barely noticeable operation. The on-call burden dropped significantly because failovers no longer required human intervention.
Connection multiplexing removed what had been a hard scaling ceiling. The application tier could add servers freely without worrying about exhausting MySQL's connection limit. This unblocked horizontal scaling of the app tier independently from the database tier.
Schema changes became less terrifying. Altering a sharded table meant running the DDL against smaller per-shard MySQL instances instead of one massive database. The gh-ost migrations that previously took hours on a 200-million-row table now ran against shards with a fraction of that data. What was once a multi-hour, high-stress operation became routine.
Horizontal write scaling unlocked a growth path that vertical scaling could never provide. When a shard approaches capacity, the team splits it into two. No downtime, no application changes, no limits beyond the number of MySQL instances you are willing to operate.
Vitess is now GitHub's default
By 2022, Vitess was not just a migration target for the largest tables. It became the standard way GitHub runs MySQL. New services connect through VTGate by default. The team no longer maintains the old orchestrator-based failover infrastructure.
What They'd Do Differently
GitHub engineers have spoken publicly about lessons from the migration. A few recurring themes stand out.
Start the proxy layer earlier. The connection pooling and failover benefits of Vitess are valuable even without sharding. Running Vitess as a pure proxy (no sharding) for a year before starting the shard migration would have de-risked the project further. The proxy-only phase gave GitHub instant wins (connection pooling, fast failover) with zero sharding complexity.
In retrospect, the team wished they had started this phase sooner. The benefits were immediate and low-risk, while the sharding work took years.
Invest in cross-shard query patterns up front. The cross-shard JOIN limitation was the highest-friction part of the migration. Identifying all cross-shard query patterns early and building application-layer alternatives before sharding would have saved rework. Several teams discovered their queries broke after sharding was already live.
Build better observability into the proxy layer from day one. Debugging query latency through VTGate, VTTablet, and MySQL requires tracing across three layers. The team built this tooling incrementally, but having it from the start would have shortened debugging cycles significantly.
Document the cross-shard query workarounds as a cookbook. Several teams independently solved similar cross-shard query problems in different ways. A centralized cookbook of patterns (denormalization, application-layer joins, materialized views) would have prevented duplicated effort.
I've noticed this pattern in every large migration: the technical work is 60% of the effort, and the coordination across teams is the other 40%. GitHub's phased approach helped, but the cross-team communication around shard boundaries was still the hardest part.
Architecture Decision Guide
Use this flowchart when evaluating whether to add a database proxy layer (Vitess, ProxySQL, PgBouncer, or similar) to your MySQL or Postgres infrastructure.
Transferable Lessons
1. The database scaling ladder has predictable rungs.
GitHub's journey followed a pattern that repeats across the industry: read replicas, then connection pooling, then query optimization, then caching, then a proxy layer, then sharding. Each step buys time for the next. Skipping rungs (jumping straight to sharding) creates unnecessary complexity. When you are designing a system, know which rung you are on and what the next one looks like.
2. Proxy layers decouple application scaling from database scaling.
Before Vitess, adding more app servers pushed GitHub closer to MySQL's connection limit. After Vitess, the app tier and database tier scale independently. This decoupling is the core value of a database proxy, even if you never shard. I recommend inserting a proxy layer the moment your connection count exceeds 70% of your database's practical limit.
3. Protocol compatibility is the migration enabler.
Vitess speaks MySQL wire protocol. This single property meant GitHub could migrate without rewriting their Rails application. When evaluating any infrastructure migration, the first question should be: "Can we swap the underlying system without changing the interface?" If yes, the migration is an operational project. If no, it is a rewrite.
4. Failover speed is a design requirement, not an operational afterthought.
Most teams treat failover as "something the DBA handles." GitHub's experience shows that failover time directly impacts user experience and engineering velocity. Sub-second failover means you can run planned failovers during business hours without fear. That changes everything about how confidently you operate your database.
5. Cross-shard queries are the tax on transparent sharding.
Vitess makes most queries shard-unaware, but JOINs across shards are the exception. Every team that adopts transparent sharding eventually hits this constraint. Design your data model with shard boundaries in mind from the start, even if you are not sharding yet.
The practical rule: if two tables are frequently JOINed together, they should share the same shard key. If that is impossible, one of them should not be sharded, or the JOIN should be replaced by an application-level lookup.
How This Shows Up in Interviews
When an interviewer asks you to design a system that stores relational data at scale (URL shortener, social media, e-commerce), the database scaling strategy is a critical part of your answer. Mentioning a proxy layer like Vitess as an intermediate step (after read replicas, before full sharding) demonstrates real-world operational awareness.
The sentence to drop: "Before jumping to sharding, I'd add a database proxy layer like Vitess for connection multiplexing and sub-second failover, which buys significant headroom without application code changes."
| Interviewer asks | Strong answer citing this case study |
|---|---|
| "How would you scale the database layer?" | "Follow the scaling ladder: read replicas first, then a proxy layer for connection pooling and failover (like GitHub did with Vitess), then shard only the tables that need it." |
| "What happens when your primary database fails over?" | "With a proxy like Vitess, failover takes under 1 second via Raft consensus. Without it, orchestrator-based failover takes 1-3 minutes, like GitHub experienced before Vitess." |
| "How do you handle 100K+ database connections?" | "Connection multiplexing at the proxy layer. VTGate funnels 100K app connections into ~2K actual MySQL connections. The application tier and database tier scale independently." |
| "When would you shard vs. scale vertically?" | "Shard when vertical scaling hits memory, connection, or operational walls simultaneously. But add a proxy layer first, it often buys 12-18 months of headroom." |
| "What are the downsides of transparent sharding?" | "Cross-shard JOINs break. You need to design your data model so that queries stay within shard boundaries, or resolve JOINs at the application layer." |
Quick Recap
- GitHub ran all core data on a single MySQL primary with read replicas, hitting operational limits around 2018: 1-3 minute failovers, ~1,500 max connections, and painful schema changes.
- Vitess is a MySQL-compatible proxy layer that bundles connection multiplexing, sub-second failover via Raft consensus, and transparent sharding into one system.
- VTGate handles query routing and connection pooling, VTTablet manages each MySQL instance and coordinates failover, and the topology service (etcd) stores the shard map.
- The migration ran in four phases over 2-3 years: shadow reads, writes through Vitess, failover validation, then table-by-table sharding.
- Application code changes were minimal because Vitess speaks MySQL wire protocol. The main exception: cross-shard JOINs require application-layer refactoring.
- The database scaling ladder (read replicas, connection pooling, proxy layer, sharding) is a repeatable pattern. Know which rung you are on before jumping to the next.
Related Concepts
- Sharding explains the fundamentals of horizontal data partitioning, which is the capability Vitess enables transparently for GitHub's MySQL infrastructure.
- Database Fundamentals covers the MySQL storage engine internals (InnoDB, B+ trees, buffer pool) that Vitess leaves unchanged underneath its proxy layer.
- Replication details primary-replica replication strategies, which Vitess manages through VTTablet instead of manual orchestrator configuration.
- Load Balancing covers traffic distribution patterns at the network layer, analogous to how VTGate distributes query traffic across database shards.