Rewrite vs. refactor decision
The technical and organizational framework for deciding whether to rewrite a system from scratch or refactor it incrementally β what signals to look for and how to justify the decision.
The Joel Spolsky Rule
Joel Spolsky famously called rewriting from scratch "the single worst strategic mistake a software company can make." He published that in 2000. It's still right most of the time, and wrong in important edge cases. Understanding when it's wrong is the point.
The question isn't "is this code bad?" Code is always bad. The question is: "Can the current codebase be improved incrementally to meet our needs, or has it accumulated constraints that make incremental improvement impractical?"
When Refactoring Is Correct (the default)
Refactoring is the correct default because:
- Working software encodes discovered behavior that no design document captures. A rewrite will lose much of this.
- Users won't see "we rewrote it" as value. Every engineer-year spent on rewrite is a year not spent on features.
- Rewrites take 3x longer than estimated. Every rewrite.
- During a rewrite, the old system still needs maintenance. You're paying double.
Signs that incremental refactoring can work:
- The core data model is sound; the problems are in the service layer
- You can add tests to the existing code and use them as a refactoring harness
- The new behavior you need is additive, not a replacement of fundamental assumptions
- The coupling is local, not systemic
When Rewriting Is Justified
Rewriting is justified when the existing system has one or more of these properties:
1. The data model is wrong at a fundamental level If the data structure encoded in the database is wrong, every query, every business rule, every API contract is built on the wrong foundation. Adding features requires working around the wrong model, and the workarounds accumulate.
Example: A payments system built on a ledger model where debits and credits
are in separate tables and linked by a mutable "reconciled" flag. Adding
multi-currency support is impossible without a different data model β
you can't incrementally change the ledger structure.
2. The runtime environment is being retired Legacy language runtime (Python 2, Ruby 1.9, Java 6), framework (Rails 3), or platform (Heroku Cedar-14) approaching end-of-life forces a migration. Sometimes that migration is larger than the refactor that would have been needed anyway.
3. Incremental change is structurally blocked
Signals:
- Every PR takes 2+ weeks because test suite takes 3 hours to run
- Adding a new feature requires changing 8 separate files due to coupling
- The deployment pipeline is so fragile that releases require 4 people
- You cannot run the system locally; developers test in staging only
When the development environment is so degraded that incremental change is more expensive than rewrite, the cost-benefit shifts.
4. The team maintaining it can't understand it anymore "Tribal knowledge" debt. If the system can only be maintained by 2 specific engineers because no one else can reason about its behavior, the risk of those engineers leaving (or the bus factor) is a legitimate business risk.
The Incremental Rewrite: The Safe Path
When you do justify a rewrite, the safest approach is the strangler fig pattern:
You migrate one feature at a time. At no point is there a cutover where "everything switches at once." The new system earns trust through production traffic before it handles the full load.
The full green-field rewrite deployed as a big-bang replacement has almost no successful examples at scale.
The Justification Document
If you're proposing a rewrite, you need a clear written case. The document should answer:
1. What is wrong with the current system, specifically?
(Not "it's messy." What failure modes does it create? what slowdowns?)
2. What have you tried incrementally, and why did it fail?
(Shows you didn't jump to rewrite as first resort)
3. What is the scope of the rewrite?
(What gets rewritten? What stays? What's out of scope?)
4. What is the migration path?
(Strangler fig? Shadow reads? Cutover? When do users notice anything?)
5. What does "done" look like?
(Concrete criteria, not "when it feels better than before")
6. What is the risk if the rewrite runs over?
(What's the plan if it takes 2x longer than estimated?)
Story Structure
Context (30s): What was the system, what was the problem, what made it difficult?
Your analysis (2 min): What did you find? Why was refactoring insufficient? What data supported the rewrite case?
The decision process (1 min): How did you get alignment? Who was skeptical?
The execution (1 min): What approach did you use? What was harder than expected?
Outcome (30s): Where is the system now? What do you wish you'd done differently?
Quick Recap
- Refactoring is the correct default. Rewrites take 3x longer than estimated and lose encoded behavior.
- Rewriting is justified when: the data model is fundamentally wrong, the runtime is being retired, incremental change is structurally blocked, or no one understands the system anymore.
- When rewriting, use the strangler fig pattern β migrate one feature at a time, never a big-bang cutover.
- Write a justification document: what's wrong, what you tried, scope, migration path, done criteria, and risk if it overruns.
- The hardest part of a rewrite is maintaining the old system while building the new one β budget for double the engineering overhead.