Make vs. buy framework for system design
A practical decision framework for choosing between building custom infrastructure, using open-source, or buying a managed service, with concrete criteria and common traps.
TL;DR
- Make vs. buy is the single most common judgment call staff+ engineers face, and it reveals more about your technical maturity than any whiteboard algorithm question.
- Buy (managed service) wins when the problem is commodity, your team lacks operational expertise, or you are pre-PMF and speed matters more than cost optimization.
- Build (custom) wins when the component is core differentiation, managed options hit hard limits, or your scale economics make a dedicated team cheaper than the vendor bill.
- Open source (self-hosted) is not free. It is closer to "build" than "buy" in operational cost. You own the upgrades, the 3 AM pages, and the security patches.
- The most common trap: building an abstraction layer "in case we switch vendors" that costs 6 months of engineering time for a migration that never happens.
- In interviews, the one sentence that works: "I'd use managed X here because Y, unless we have a specific requirement that Z." Then move on.
Why This Decision Defines Staff-Level Judgment
A team I worked with once decided to build their own email delivery pipeline. The reasoning sounded solid: "We send 20 million emails a month, SES costs $2,000/month, and we can build it ourselves." Twelve months later, they had two engineers spending 40% of their time on deliverability issues, IP warm-up, bounce handling, suppression lists, and DKIM rotation. The fully-loaded cost of those two engineers (at $200K each, half-time) was $200,000 per year. They had replaced a $24,000/year SES bill with a $200,000/year internal project that delivered worse inbox placement rates.
This is the pattern. The build cost is almost never "just the initial engineering." It is the initial build plus 24 months of operational overhead, on-call burden, upgrade cycles, and opportunity cost of engineers who could be working on your actual product.
Make vs. buy decisions show up in every staff+ design review and every system design interview at that level. They reveal whether you can zoom out from the technical details and reason about total cost of ownership, organizational capacity, and strategic priorities. Senior engineers pick the right technology. Staff engineers decide whether to pick a technology at all, or just pay someone else to handle it.
Interview signal
When you proactively address make vs. buy in a system design interview (without being asked), interviewers notice. It signals that you think about engineering as a business function, not just a coding exercise.
The Three Options
Every infrastructure decision has three paths, and each has a cost model that looks very different from what teams initially estimate.
Build (custom software)
Definition: You write the software from scratch (or near-scratch), own the codebase, deploy it, and maintain it indefinitely.
True cost model: Engineering time to build (typically 2-6 months) + engineering time to operate (0.25-1.0 FTEs ongoing) + on-call cost + opportunity cost of what those engineers could have built instead.
Hidden costs people miss:
- On-call rotation. Someone pages at 3 AM when it breaks.
- Security patches. You own every CVE in every dependency.
- Scaling work. The thing that handles 1K QPS today needs rework at 50K QPS.
- Knowledge concentration risk. When the engineer who built it leaves, you have a bus factor of zero.
When it is the right call: The component is core to your product differentiation, and no vendor or open-source project meets your specific requirements. Netflix building Open Connect (their custom CDN) makes sense because CDN performance directly affects their user experience, and they push 15%+ of global internet traffic.
When it is a trap: When engineers want to build it because it is technically interesting, not because it is strategically necessary. Building a custom CI/CD pipeline when GitHub Actions exists is almost never justified.
Open source (self-hosted)
Definition: You deploy an open-source project (Kafka, PostgreSQL, Prometheus, Redis, Elasticsearch) on your own infrastructure and maintain it.
True cost model: Integration time (1-4 weeks) + infrastructure cost + operational overhead (0.25-0.5 FTEs per major system) + upgrade treadmill.
Hidden costs people miss:
- The upgrade treadmill. You are always 2-3 versions behind, and the gap grows.
- Community support is not SLA-backed support. When you hit a production-breaking bug at 2 AM, a GitHub issue does not resolve your outage.
- Configuration tuning. Default configs work at small scale. At production scale, you need expertise to tune JVM heap sizes, connection pools, replication settings, and compaction strategies.
When it is the right call: Your team has genuine operational expertise with this software, the marginal cost of adding another cluster is low, and you need configuration flexibility that managed services do not offer.
When it is a trap: When the team assumes "open source = free" and does not budget for the operational cost. I have seen teams adopt self-hosted Kafka thinking it would save money, only to discover they needed a dedicated half-time engineer just for cluster management.
Buy (managed service / SaaS)
Definition: You pay a vendor (AWS, Confluent, Datadog, Stripe, Auth0) to run the software for you. You consume it through an API or managed console.
True cost model: Monthly fee x 24 months + integration time (1-4 weeks) + potential lock-in cost (migration effort if you ever leave).
Hidden costs people miss:
- Lock-in cost is real but frequently overestimated. Teams spend months building "vendor-agnostic" abstractions for a migration that has a less than 5% chance of happening.
- Feature gaps. The managed service does 90% of what you need. That last 10% either does not exist or requires a workaround.
- Price increases. Vendors raise prices. Budget for 10-20% annual increases.
When it is the right call: The problem is commodity (auth, payments, email, observability, blob storage), your team does not have specialized expertise, and you value time-to-market over cost optimization.
When it is a trap: When you buy before understanding your requirements. I once saw a team sign a $180K/year Datadog contract, then discover six months later that 60% of their monitoring needs could have been handled by Prometheus plus Grafana at a fraction of the cost. Know what you need before you buy.
Head-to-head comparison
| Dimension | Build (custom) | Open source (self-hosted) | Buy (managed) |
|---|---|---|---|
| Upfront cost | High (2-6 months eng) | Medium (1-4 weeks integration) | Low (days to weeks) |
| Ongoing cost | High (0.5-1.0 FTE) | Medium (0.25-0.5 FTE) | Predictable monthly fee |
| Time to production | Months | Weeks | Days to weeks |
| Control | Full | High (code access) | Low (API/config only) |
| Expertise required | Deep domain + ops | Ops + config tuning | Integration only |
| Lock-in risk | None | Low (standard formats) | Medium to high |
| Scaling effort | You handle it | You handle it | Vendor handles it |
| Example | Custom payment engine | Self-hosted Kafka | Confluent Cloud |
The Decision Framework
Here is a structured six-step process that works in both real architecture decisions and interview discussions. Each step has concrete guidance, not vague advice.
Step 1: Define requirements specifically
"We need a cache" is not a requirement. A requirement sounds like this: "We need a read-through cache for our product catalog API, handling 50K reads/second with p99 latency under 5ms, storing up to 500K keys averaging 2KB each, with a staleness tolerance of 30 seconds."
The specificity matters because it determines whether a managed service can meet your needs. "We need Kafka" might actually mean "we need a durable message queue with at-least-once delivery and partition-level ordering for 10K messages/second." SQS handles that. You do not need Kafka.
Before evaluating options, write down:
- Throughput requirements (QPS, messages/second, bandwidth)
- Latency requirements (p50, p99)
- Durability and consistency requirements
- Data volume and retention period
- Integration points (what talks to this component?)
Step 2: List the best managed options
For every common infrastructure need, there is a mature managed option. Evaluate at least two.
| Need | Managed options | Quick evaluation |
|---|---|---|
| Message queue | SQS, Confluent Cloud, MSK, Google Pub/Sub | SQS for simple queuing, Confluent for streaming semantics |
| Cache | ElastiCache, MemoryDB, Momento | ElastiCache for standard Redis, Momento for serverless |
| Search | Elastic Cloud, Algolia, OpenSearch Service | Algolia for simple search UX, Elastic for complex queries |
| Auth | Auth0, Cognito, Clerk | Auth0 for flexibility, Clerk for developer experience |
| Payments | Stripe, Adyen, Braintree | Stripe for simplicity, Adyen for enterprise global |
| Observability | Datadog, New Relic, Grafana Cloud | Datadog for all-in-one, Grafana Cloud for open-source stack |
| Blob storage | S3, GCS, Azure Blob | S3 is the default. Almost never build this. |
Step 3: Estimate buy cost (24-month TCO)
Monthly fee x 24 + integration engineering time + lock-in migration cost estimate.
Example: Confluent Cloud for a mid-size workload
- 100 MB/s throughput, 3 topics, 7-day retention
- Confluent Cloud cost: roughly $2,500-$4,000/month depending on cluster type
- 24-month cost: $60,000-$96,000
- Integration time: 2 engineers x 2 weeks = $8,000 (at $200K/year fully-loaded)
- Lock-in migration cost: medium. Kafka wire protocol is standard, but Confluent-specific features (Schema Registry SaaS, ksqlDB) add switching cost.
- Total 24-month estimate: $70,000-$105,000
Step 4: Estimate build/run cost (24-month TCO)
Engineering time x fully-loaded cost + infrastructure + 24 months operational overhead.
Example: Self-hosted Kafka for the same workload
- Build/setup time: 1 engineer x 4 weeks = $4,000
- Infrastructure: 6 brokers (3 per AZ) on m5.xlarge = roughly $2,400/month
- Operational overhead: 0.3 FTE dedicated to Kafka ops = $5,000/month (at $200K/year)
- 24-month infrastructure: $57,600
- 24-month ops: $120,000
- Total 24-month estimate: $181,600
The self-hosted option costs nearly 2x the managed option for this workload. This is the math that most "let's just run it ourselves" conversations skip.
Step 5: Consider lock-in risk
Lock-in is real, but it is also the most commonly over-weighted factor in make vs. buy decisions. Ask these specific questions:
- Is the data portable? Can you export your data in a standard format (CSV, Parquet, JSON, Avro)? If yes, lock-in risk is low regardless of the vendor.
- Is the API proprietary? DynamoDB's API is proprietary. PostgreSQL on RDS uses standard SQL. The portability difference is enormous.
- What is the realistic migration timeline? If migration takes 2 weeks of engineering, the lock-in cost is $4,000. If it takes 6 months, it is $100,000. Be specific.
- What is the probability of migration? Be honest. Most vendor migrations never happen. If the probability is under 10%, weight the lock-in cost accordingly.
The lock-in overreaction
Teams routinely spend 3-6 months building "vendor-agnostic" abstraction layers to avoid lock-in. The abstraction itself becomes a maintenance burden, and the migration it was designed to enable almost never happens. Build for what you need today. If you migrate in 3 years, the rewrite will probably be warranted anyway.
Step 6: Make the call based on numbers
Compare the 24-month TCO of buy vs. build, factor in the qualitative dimensions (lock-in risk, team expertise, time-to-market pressure), and decide. If the numbers are within 20% of each other, let the qualitative factors tip the balance. If one option is 2x cheaper, the qualitative factors rarely overcome that gap.
The decision should look like this in a design review or interview: "For our message queue, I'd use Confluent Cloud. The 24-month cost is roughly $80K versus $180K for self-hosted Kafka, our team doesn't have deep Kafka operational expertise, and the standard wire protocol gives us an exit path if we ever need to migrate."
When Managed Services Win (with Real Numbers)
1. Commodity problems
Auth, email delivery, payment processing, observability, blob storage, DNS, CDN. These are solved problems with mature vendors who have invested hundreds of millions of dollars in reliability.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.