Make vs. buy framework for system design

TL;DR

Make vs. buy is the single most common judgment call staff+ engineers face, and it reveals more about your technical maturity than any whiteboard algorithm question.
Buy (managed service) wins when the problem is commodity, your team lacks operational expertise, or you are pre-PMF and speed matters more than cost optimization.
Build (custom) wins when the component is core differentiation, managed options hit hard limits, or your scale economics make a dedicated team cheaper than the vendor bill.
Open source (self-hosted) is not free. It is closer to "build" than "buy" in operational cost. You own the upgrades, the 3 AM pages, and the security patches.
The most common trap: building an abstraction layer "in case we switch vendors" that costs 6 months of engineering time for a migration that never happens.
In interviews, the one sentence that works: "I'd use managed X here because Y, unless we have a specific requirement that Z." Then move on.

Why This Decision Defines Staff-Level Judgment

A team I worked with once decided to build their own email delivery pipeline. The reasoning sounded solid: "We send 20 million emails a month, SES costs $2,000/month, and we can build it ourselves." Twelve months later, they had two engineers spending 40% of their time on deliverability issues, IP warm-up, bounce handling, suppression lists, and DKIM rotation. The fully-loaded cost of those two engineers (at $200K each, half-time) was $200,000 per year. They had replaced a $24,000/year SES bill with a $200,000/year internal project that delivered worse inbox placement rates.

This is the pattern. The build cost is almost never "just the initial engineering." It is the initial build plus 24 months of operational overhead, on-call burden, upgrade cycles, and opportunity cost of engineers who could be working on your actual product.

Make vs. buy decisions show up in every staff+ design review and every system design interview at that level. They reveal whether you can zoom out from the technical details and reason about total cost of ownership, organizational capacity, and strategic priorities. Senior engineers pick the right technology. Staff engineers decide whether to pick a technology at all, or just pay someone else to handle it.

Interview signal

When you proactively address make vs. buy in a system design interview (without being asked), interviewers notice. It signals that you think about engineering as a business function, not just a coding exercise.

The Three Options

Every infrastructure decision has three paths, and each has a cost model that looks very different from what teams initially estimate.

Build (custom software)

Definition: You write the software from scratch (or near-scratch), own the codebase, deploy it, and maintain it indefinitely.

True cost model: Engineering time to build (typically 2-6 months) + engineering time to operate (0.25-1.0 FTEs ongoing) + on-call cost + opportunity cost of what those engineers could have built instead.

Hidden costs people miss:

On-call rotation. Someone pages at 3 AM when it breaks.
Security patches. You own every CVE in every dependency.
Scaling work. The thing that handles 1K QPS today needs rework at 50K QPS.
Knowledge concentration risk. When the engineer who built it leaves, you have a bus factor of zero.

When it is the right call: The component is core to your product differentiation, and no vendor or open-source project meets your specific requirements. Netflix building Open Connect (their custom CDN) makes sense because CDN performance directly affects their user experience, and they push 15%+ of global internet traffic.

When it is a trap: When engineers want to build it because it is technically interesting, not because it is strategically necessary. Building a custom CI/CD pipeline when GitHub Actions exists is almost never justified.

Open source (self-hosted)

Definition: You deploy an open-source project (Kafka, PostgreSQL, Prometheus, Redis, Elasticsearch) on your own infrastructure and maintain it.

True cost model: Integration time (1-4 weeks) + infrastructure cost + operational overhead (0.25-0.5 FTEs per major system) + upgrade treadmill.

Hidden costs people miss:

The upgrade treadmill. You are always 2-3 versions behind, and the gap grows.
Community support is not SLA-backed support. When you hit a production-breaking bug at 2 AM, a GitHub issue does not resolve your outage.
Configuration tuning. Default configs work at small scale. At production scale, you need expertise to tune JVM heap sizes, connection pools, replication settings, and compaction strategies.

When it is the right call: Your team has genuine operational expertise with this software, the marginal cost of adding another cluster is low, and you need configuration flexibility that managed services do not offer.

When it is a trap: When the team assumes "open source = free" and does not budget for the operational cost. I have seen teams adopt self-hosted Kafka thinking it would save money, only to discover they needed a dedicated half-time engineer just for cluster management.

Buy (managed service / SaaS)

Definition: You pay a vendor (AWS, Confluent, Datadog, Stripe, Auth0) to run the software for you. You consume it through an API or managed console.

True cost model: Monthly fee x 24 months + integration time (1-4 weeks) + potential lock-in cost (migration effort if you ever leave).

Hidden costs people miss:

Lock-in cost is real but frequently overestimated. Teams spend months building "vendor-agnostic" abstractions for a migration that has a less than 5% chance of happening.
Feature gaps. The managed service does 90% of what you need. That last 10% either does not exist or requires a workaround.
Price increases. Vendors raise prices. Budget for 10-20% annual increases.

When it is the right call: The problem is commodity (auth, payments, email, observability, blob storage), your team does not have specialized expertise, and you value time-to-market over cost optimization.

When it is a trap: When you buy before understanding your requirements. I once saw a team sign a $180K/year Datadog contract, then discover six months later that 60% of their monitoring needs could have been handled by Prometheus plus Grafana at a fraction of the cost. Know what you need before you buy.

Head-to-head comparison

Dimension	Build (custom)	Open source (self-hosted)	Buy (managed)
Upfront cost	High (2-6 months eng)	Medium (1-4 weeks integration)	Low (days to weeks)
Ongoing cost	High (0.5-1.0 FTE)	Medium (0.25-0.5 FTE)	Predictable monthly fee
Time to production	Months	Weeks	Days to weeks
Control	Full	High (code access)	Low (API/config only)
Expertise required	Deep domain + ops	Ops + config tuning	Integration only
Lock-in risk	None	Low (standard formats)	Medium to high
Scaling effort	You handle it	You handle it	Vendor handles it
Example	Custom payment engine	Self-hosted Kafka	Confluent Cloud

The Decision Framework

Here is a structured six-step process that works in both real architecture decisions and interview discussions. Each step has concrete guidance, not vague advice.

Step 1: Define requirements specifically

"We need a cache" is not a requirement. A requirement sounds like this: "We need a read-through cache for our product catalog API, handling 50K reads/second with p99 latency under 5ms, storing up to 500K keys averaging 2KB each, with a staleness tolerance of 30 seconds."

The specificity matters because it determines whether a managed service can meet your needs. "We need Kafka" might actually mean "we need a durable message queue with at-least-once delivery and partition-level ordering for 10K messages/second." SQS handles that. You do not need Kafka.

Before evaluating options, write down:

Throughput requirements (QPS, messages/second, bandwidth)
Latency requirements (p50, p99)
Durability and consistency requirements
Data volume and retention period
Integration points (what talks to this component?)

Step 2: List the best managed options

For every common infrastructure need, there is a mature managed option. Evaluate at least two.

Need	Managed options	Quick evaluation
Message queue	SQS, Confluent Cloud, MSK, Google Pub/Sub	SQS for simple queuing, Confluent for streaming semantics
Cache	ElastiCache, MemoryDB, Momento	ElastiCache for standard Redis, Momento for serverless
Search	Elastic Cloud, Algolia, OpenSearch Service	Algolia for simple search UX, Elastic for complex queries
Auth	Auth0, Cognito, Clerk	Auth0 for flexibility, Clerk for developer experience
Payments	Stripe, Adyen, Braintree	Stripe for simplicity, Adyen for enterprise global
Observability	Datadog, New Relic, Grafana Cloud	Datadog for all-in-one, Grafana Cloud for open-source stack
Blob storage	S3, GCS, Azure Blob	S3 is the default. Almost never build this.

Step 3: Estimate buy cost (24-month TCO)

Monthly fee x 24 + integration engineering time + lock-in migration cost estimate.

Example: Confluent Cloud for a mid-size workload

100 MB/s throughput, 3 topics, 7-day retention
Confluent Cloud cost: roughly $2,500-$4,000/month depending on cluster type
24-month cost: $60,000-$96,000
Integration time: 2 engineers x 2 weeks = $8,000 (at $200K/year fully-loaded)
Lock-in migration cost: medium. Kafka wire protocol is standard, but Confluent-specific features (Schema Registry SaaS, ksqlDB) add switching cost.
Total 24-month estimate: $70,000-$105,000

Step 4: Estimate build/run cost (24-month TCO)

Engineering time x fully-loaded cost + infrastructure + 24 months operational overhead.

Example: Self-hosted Kafka for the same workload

Build/setup time: 1 engineer x 4 weeks = $4,000
Infrastructure: 6 brokers (3 per AZ) on m5.xlarge = roughly $2,400/month
Operational overhead: 0.3 FTE dedicated to Kafka ops = $5,000/month (at $200K/year)
24-month infrastructure: $57,600
24-month ops: $120,000
Total 24-month estimate: $181,600

The self-hosted option costs nearly 2x the managed option for this workload. This is the math that most "let's just run it ourselves" conversations skip.

Step 5: Consider lock-in risk

Lock-in is real, but it is also the most commonly over-weighted factor in make vs. buy decisions. Ask these specific questions:

Is the data portable? Can you export your data in a standard format (CSV, Parquet, JSON, Avro)? If yes, lock-in risk is low regardless of the vendor.
Is the API proprietary? DynamoDB's API is proprietary. PostgreSQL on RDS uses standard SQL. The portability difference is enormous.
What is the realistic migration timeline? If migration takes 2 weeks of engineering, the lock-in cost is $4,000. If it takes 6 months, it is $100,000. Be specific.
What is the probability of migration? Be honest. Most vendor migrations never happen. If the probability is under 10%, weight the lock-in cost accordingly.

The lock-in overreaction

Teams routinely spend 3-6 months building "vendor-agnostic" abstraction layers to avoid lock-in. The abstraction itself becomes a maintenance burden, and the migration it was designed to enable almost never happens. Build for what you need today. If you migrate in 3 years, the rewrite will probably be warranted anyway.

Step 6: Make the call based on numbers

Compare the 24-month TCO of buy vs. build, factor in the qualitative dimensions (lock-in risk, team expertise, time-to-market pressure), and decide. If the numbers are within 20% of each other, let the qualitative factors tip the balance. If one option is 2x cheaper, the qualitative factors rarely overcome that gap.

The decision should look like this in a design review or interview: "For our message queue, I'd use Confluent Cloud. The 24-month cost is roughly $80K versus $180K for self-hosted Kafka, our team doesn't have deep Kafka operational expertise, and the standard wire protocol gives us an exit path if we ever need to migrate."

When Managed Services Win (with Real Numbers)

1. Commodity problems

Auth, email delivery, payment processing, observability, blob storage, DNS, CDN. These are solved problems with mature vendors who have invested hundreds of millions of dollars in reliability.

TL;DR

Make vs. buy is the single most common judgment call staff+ engineers face, and it reveals more about your technical maturity than any whiteboard algorithm question.
Buy (managed service) wins when the problem is commodity, your team lacks operational expertise, or you are pre-PMF and speed matters more than cost optimization.
Build (custom) wins when the component is core differentiation, managed options hit hard limits, or your scale economics make a dedicated team cheaper than the vendor bill.
Open source (self-hosted) is not free. It is closer to "build" than "buy" in operational cost. You own the upgrades, the 3 AM pages, and the security patches.
The most common trap: building an abstraction layer "in case we switch vendors" that costs 6 months of engineering time for a migration that never happens.
In interviews, the one sentence that works: "I'd use managed X here because Y, unless we have a specific requirement that Z." Then move on.

Why This Decision Defines Staff-Level Judgment

Interview signal

The Three Options

Every infrastructure decision has three paths, and each has a cost model that looks very different from what teams initially estimate.

Build (custom software)

Definition: You write the software from scratch (or near-scratch), own the codebase, deploy it, and maintain it indefinitely.

Hidden costs people miss:

On-call rotation. Someone pages at 3 AM when it breaks.
Security patches. You own every CVE in every dependency.
Scaling work. The thing that handles 1K QPS today needs rework at 50K QPS.
Knowledge concentration risk. When the engineer who built it leaves, you have a bus factor of zero.

Open source (self-hosted)

Definition: You deploy an open-source project (Kafka, PostgreSQL, Prometheus, Redis, Elasticsearch) on your own infrastructure and maintain it.

True cost model: Integration time (1-4 weeks) + infrastructure cost + operational overhead (0.25-0.5 FTEs per major system) + upgrade treadmill.

Hidden costs people miss:

The upgrade treadmill. You are always 2-3 versions behind, and the gap grows.
Community support is not SLA-backed support. When you hit a production-breaking bug at 2 AM, a GitHub issue does not resolve your outage.
Configuration tuning. Default configs work at small scale. At production scale, you need expertise to tune JVM heap sizes, connection pools, replication settings, and compaction strategies.

Buy (managed service / SaaS)

Definition: You pay a vendor (AWS, Confluent, Datadog, Stripe, Auth0) to run the software for you. You consume it through an API or managed console.

True cost model: Monthly fee x 24 months + integration time (1-4 weeks) + potential lock-in cost (migration effort if you ever leave).

Hidden costs people miss:

Lock-in cost is real but frequently overestimated. Teams spend months building "vendor-agnostic" abstractions for a migration that has a less than 5% chance of happening.
Feature gaps. The managed service does 90% of what you need. That last 10% either does not exist or requires a workaround.
Price increases. Vendors raise prices. Budget for 10-20% annual increases.

Head-to-head comparison

Dimension	Build (custom)	Open source (self-hosted)	Buy (managed)
Upfront cost	High (2-6 months eng)	Medium (1-4 weeks integration)	Low (days to weeks)
Ongoing cost	High (0.5-1.0 FTE)	Medium (0.25-0.5 FTE)	Predictable monthly fee
Time to production	Months	Weeks	Days to weeks
Control	Full	High (code access)	Low (API/config only)
Expertise required	Deep domain + ops	Ops + config tuning	Integration only
Lock-in risk	None	Low (standard formats)	Medium to high
Scaling effort	You handle it	You handle it	Vendor handles it
Example	Custom payment engine	Self-hosted Kafka	Confluent Cloud

The Decision Framework

Here is a structured six-step process that works in both real architecture decisions and interview discussions. Each step has concrete guidance, not vague advice.

Step 1: Define requirements specifically

Before evaluating options, write down:

Throughput requirements (QPS, messages/second, bandwidth)
Latency requirements (p50, p99)
Durability and consistency requirements
Data volume and retention period
Integration points (what talks to this component?)

Step 2: List the best managed options

For every common infrastructure need, there is a mature managed option. Evaluate at least two.

Need	Managed options	Quick evaluation
Message queue	SQS, Confluent Cloud, MSK, Google Pub/Sub	SQS for simple queuing, Confluent for streaming semantics
Cache	ElastiCache, MemoryDB, Momento	ElastiCache for standard Redis, Momento for serverless
Search	Elastic Cloud, Algolia, OpenSearch Service	Algolia for simple search UX, Elastic for complex queries
Auth	Auth0, Cognito, Clerk	Auth0 for flexibility, Clerk for developer experience
Payments	Stripe, Adyen, Braintree	Stripe for simplicity, Adyen for enterprise global
Observability	Datadog, New Relic, Grafana Cloud	Datadog for all-in-one, Grafana Cloud for open-source stack
Blob storage	S3, GCS, Azure Blob	S3 is the default. Almost never build this.

Step 3: Estimate buy cost (24-month TCO)

Monthly fee x 24 + integration engineering time + lock-in migration cost estimate.

Example: Confluent Cloud for a mid-size workload

100 MB/s throughput, 3 topics, 7-day retention
Confluent Cloud cost: roughly $2,500-$4,000/month depending on cluster type
24-month cost: $60,000-$96,000
Integration time: 2 engineers x 2 weeks = $8,000 (at $200K/year fully-loaded)
Lock-in migration cost: medium. Kafka wire protocol is standard, but Confluent-specific features (Schema Registry SaaS, ksqlDB) add switching cost.
Total 24-month estimate: $70,000-$105,000

Step 4: Estimate build/run cost (24-month TCO)

Engineering time x fully-loaded cost + infrastructure + 24 months operational overhead.

Example: Self-hosted Kafka for the same workload

Build/setup time: 1 engineer x 4 weeks = $4,000
Infrastructure: 6 brokers (3 per AZ) on m5.xlarge = roughly $2,400/month
Operational overhead: 0.3 FTE dedicated to Kafka ops = $5,000/month (at $200K/year)
24-month infrastructure: $57,600
24-month ops: $120,000
Total 24-month estimate: $181,600

The self-hosted option costs nearly 2x the managed option for this workload. This is the math that most "let's just run it ourselves" conversations skip.

Step 5: Consider lock-in risk

Lock-in is real, but it is also the most commonly over-weighted factor in make vs. buy decisions. Ask these specific questions:

Is the data portable? Can you export your data in a standard format (CSV, Parquet, JSON, Avro)? If yes, lock-in risk is low regardless of the vendor.
Is the API proprietary? DynamoDB's API is proprietary. PostgreSQL on RDS uses standard SQL. The portability difference is enormous.
What is the realistic migration timeline? If migration takes 2 weeks of engineering, the lock-in cost is $4,000. If it takes 6 months, it is $100,000. Be specific.
What is the probability of migration? Be honest. Most vendor migrations never happen. If the probability is under 10%, weight the lock-in cost accordingly.

The lock-in overreaction

Make vs. buy framework for system design

TL;DR

Why This Decision Defines Staff-Level Judgment

The Three Options

Build (custom software)

Open source (self-hosted)

Buy (managed service / SaaS)

Head-to-head comparison

The Decision Framework

Step 1: Define requirements specifically

Step 2: List the best managed options

Step 3: Estimate buy cost (24-month TCO)

Step 4: Estimate build/run cost (24-month TCO)

Step 5: Consider lock-in risk

Step 6: Make the call based on numbers

When Managed Services Win (with Real Numbers)

1. Commodity problems

Continue Reading with Premium

Comments

Make vs. buy framework for system design

TL;DR

Why This Decision Defines Staff-Level Judgment

The Three Options

Build (custom software)

Open source (self-hosted)

Buy (managed service / SaaS)

Head-to-head comparison

The Decision Framework

Step 1: Define requirements specifically

Step 2: List the best managed options

Step 3: Estimate buy cost (24-month TCO)

Step 4: Estimate build/run cost (24-month TCO)

Step 5: Consider lock-in risk

Step 6: Make the call based on numbers

When Managed Services Win (with Real Numbers)

1. Commodity problems

Continue Reading with Premium

Comments