Rewrite vs. refactor decision

The Joel Spolsky Rule

Joel Spolsky famously called rewriting from scratch "the single worst strategic mistake a software company can make." He published that in 2000. It's still right most of the time, and wrong in important edge cases. Understanding when it's wrong is the point.

The question isn't "is this code bad?" Code is always bad. The question is: "Can the current codebase be improved incrementally to meet our needs, or has it accumulated constraints that make incremental improvement impractical?"

When Refactoring Is Correct (the default)

Refactoring is the correct default because:

Working software encodes discovered behavior that no design document captures. A rewrite will lose much of this.
Users won't see "we rewrote it" as value. Every engineer-year spent on rewrite is a year not spent on features.
Rewrites take 3x longer than estimated. Every rewrite.
During a rewrite, the old system still needs maintenance. You're paying double.

Signs that incremental refactoring can work:

The core data model is sound; the problems are in the service layer
You can add tests to the existing code and use them as a refactoring harness
The new behavior you need is additive, not a replacement of fundamental assumptions
The coupling is local, not systemic

When Rewriting Is Justified

Rewriting is justified when the existing system has one or more of these properties:

1. The data model is wrong at a fundamental level If the data structure encoded in the database is wrong, every query, every business rule, every API contract is built on the wrong foundation. Adding features requires working around the wrong model, and the workarounds accumulate.

Example: A payments system built on a ledger model where debits and credits 
are in separate tables and linked by a mutable "reconciled" flag. Adding 
multi-currency support is impossible without a different data model — 
you can't incrementally change the ledger structure.

2. The runtime environment is being retired Legacy language runtime (Python 2, Ruby 1.9, Java 6), framework (Rails 3), or platform (Heroku Cedar-14) approaching end-of-life forces a migration. Sometimes that migration is larger than the refactor that would have been needed anyway.

3. Incremental change is structurally blocked

Signals:
- Every PR takes 2+ weeks because test suite takes 3 hours to run
- Adding a new feature requires changing 8 separate files due to coupling
- The deployment pipeline is so fragile that releases require 4 people
- You cannot run the system locally; developers test in staging only

When the development environment is so degraded that incremental change is more expensive than rewrite, the cost-benefit shifts.

4. The team maintaining it can't understand it anymore "Tribal knowledge" debt. If the system can only be maintained by 2 specific engineers because no one else can reason about its behavior, the risk of those engineers leaving (or the bus factor) is a legitimate business risk.

The Incremental Rewrite: The Safe Path

When you do justify a rewrite, the safest approach is the strangler fig pattern:

You migrate one feature at a time. At no point is there a cutover where "everything switches at once." The new system earns trust through production traffic before it handles the full load.

The full green-field rewrite deployed as a big-bang replacement has almost no successful examples at scale.

The Justification Document

If you're proposing a rewrite, you need a clear written case. The document should answer:

1. What is wrong with the current system, specifically?
   (Not "it's messy." What failure modes does it create? what slowdowns?)

2. What have you tried incrementally, and why did it fail?
   (Shows you didn't jump to rewrite as first resort)

3. What is the scope of the rewrite?
   (What gets rewritten? What stays? What's out of scope?)

4. What is the migration path?
   (Strangler fig? Shadow reads? Cutover? When do users notice anything?)

5. What does "done" look like?
   (Concrete criteria, not "when it feels better than before")

6. What is the risk if the rewrite runs over?
   (What's the plan if it takes 2x longer than estimated?)

Story Structure

Context (30s): What was the system, what was the problem, what made it difficult?

Your analysis (2 min): What did you find? Why was refactoring insufficient? What data supported the rewrite case?

The decision process (1 min): How did you get alignment? Who was skeptical?

The execution (1 min): What approach did you use? What was harder than expected?

Outcome (30s): Where is the system now? What do you wish you'd done differently?

Quick Recap

Refactoring is the correct default. Rewrites take 3x longer than estimated and lose encoded behavior.
Rewriting is justified when: the data model is fundamentally wrong, the runtime is being retired, incremental change is structurally blocked, or no one understands the system anymore.
When rewriting, use the strangler fig pattern — migrate one feature at a time, never a big-bang cutover.
Write a justification document: what's wrong, what you tried, scope, migration path, done criteria, and risk if it overruns.
The hardest part of a rewrite is maintaining the old system while building the new one — budget for double the engineering overhead.

The Joel Spolsky Rule

When Refactoring Is Correct (the default)

Refactoring is the correct default because:

Working software encodes discovered behavior that no design document captures. A rewrite will lose much of this.
Users won't see "we rewrote it" as value. Every engineer-year spent on rewrite is a year not spent on features.
Rewrites take 3x longer than estimated. Every rewrite.
During a rewrite, the old system still needs maintenance. You're paying double.

Signs that incremental refactoring can work:

The core data model is sound; the problems are in the service layer
You can add tests to the existing code and use them as a refactoring harness
The new behavior you need is additive, not a replacement of fundamental assumptions
The coupling is local, not systemic

When Rewriting Is Justified

Rewriting is justified when the existing system has one or more of these properties:

Example: A payments system built on a ledger model where debits and credits 
are in separate tables and linked by a mutable "reconciled" flag. Adding 
multi-currency support is impossible without a different data model — 
you can't incrementally change the ledger structure.

3. Incremental change is structurally blocked

Signals:
- Every PR takes 2+ weeks because test suite takes 3 hours to run
- Adding a new feature requires changing 8 separate files due to coupling
- The deployment pipeline is so fragile that releases require 4 people
- You cannot run the system locally; developers test in staging only

When the development environment is so degraded that incremental change is more expensive than rewrite, the cost-benefit shifts.

The Incremental Rewrite: The Safe Path

When you do justify a rewrite, the safest approach is the strangler fig pattern:

You migrate one feature at a time. At no point is there a cutover where "everything switches at once." The new system earns trust through production traffic before it handles the full load.

The full green-field rewrite deployed as a big-bang replacement has almost no successful examples at scale.

The Justification Document

If you're proposing a rewrite, you need a clear written case. The document should answer:

1. What is wrong with the current system, specifically?
   (Not "it's messy." What failure modes does it create? what slowdowns?)

2. What have you tried incrementally, and why did it fail?
   (Shows you didn't jump to rewrite as first resort)

3. What is the scope of the rewrite?
   (What gets rewritten? What stays? What's out of scope?)

4. What is the migration path?
   (Strangler fig? Shadow reads? Cutover? When do users notice anything?)

5. What does "done" look like?
   (Concrete criteria, not "when it feels better than before")

6. What is the risk if the rewrite runs over?
   (What's the plan if it takes 2x longer than estimated?)

Story Structure

Context (30s): What was the system, what was the problem, what made it difficult?

Your analysis (2 min): What did you find? Why was refactoring insufficient? What data supported the rewrite case?

The decision process (1 min): How did you get alignment? Who was skeptical?

The execution (1 min): What approach did you use? What was harder than expected?

Outcome (30s): Where is the system now? What do you wish you'd done differently?

Quick Recap

Refactoring is the correct default. Rewrites take 3x longer than estimated and lose encoded behavior.
Rewriting is justified when: the data model is fundamentally wrong, the runtime is being retired, incremental change is structurally blocked, or no one understands the system anymore.
When rewriting, use the strangler fig pattern — migrate one feature at a time, never a big-bang cutover.
Write a justification document: what's wrong, what you tried, scope, migration path, done criteria, and risk if it overruns.
The hardest part of a rewrite is maintaining the old system while building the new one — budget for double the engineering overhead.

Rewrite vs. refactor decision

The Joel Spolsky Rule

When Refactoring Is Correct (the default)

When Rewriting Is Justified

The Incremental Rewrite: The Safe Path

The Justification Document

Story Structure

Quick Recap

Comments

Rewrite vs. refactor decision

The Joel Spolsky Rule

When Refactoring Is Correct (the default)

When Rewriting Is Justified

The Incremental Rewrite: The Safe Path

The Justification Document

Story Structure

Quick Recap

Comments