Make vs. buy framework for system design
A practical decision framework for choosing between building custom infrastructure, using open-source, or buying a managed service, with concrete criteria and common traps.
TL;DR
- Make vs. buy is the single most common judgment call staff+ engineers face, and it reveals more about your technical maturity than any whiteboard algorithm question.
- Buy (managed service) wins when the problem is commodity, your team lacks operational expertise, or you are pre-PMF and speed matters more than cost optimization.
- Build (custom) wins when the component is core differentiation, managed options hit hard limits, or your scale economics make a dedicated team cheaper than the vendor bill.
- Open source (self-hosted) is not free. It is closer to "build" than "buy" in operational cost. You own the upgrades, the 3 AM pages, and the security patches.
- The most common trap: building an abstraction layer "in case we switch vendors" that costs 6 months of engineering time for a migration that never happens.
- In interviews, the one sentence that works: "I'd use managed X here because Y, unless we have a specific requirement that Z." Then move on.
Why This Decision Defines Staff-Level Judgment
A team I worked with once decided to build their own email delivery pipeline. The reasoning sounded solid: "We send 20 million emails a month, SES costs $2,000/month, and we can build it ourselves." Twelve months later, they had two engineers spending 40% of their time on deliverability issues, IP warm-up, bounce handling, suppression lists, and DKIM rotation. The fully-loaded cost of those two engineers (at $200K each, half-time) was $200,000 per year. They had replaced a $24,000/year SES bill with a $200,000/year internal project that delivered worse inbox placement rates.
This is the pattern. The build cost is almost never "just the initial engineering." It is the initial build plus 24 months of operational overhead, on-call burden, upgrade cycles, and opportunity cost of engineers who could be working on your actual product.
Make vs. buy decisions show up in every staff+ design review and every system design interview at that level. They reveal whether you can zoom out from the technical details and reason about total cost of ownership, organizational capacity, and strategic priorities. Senior engineers pick the right technology. Staff engineers decide whether to pick a technology at all, or just pay someone else to handle it.
Interview signal
When you proactively address make vs. buy in a system design interview (without being asked), interviewers notice. It signals that you think about engineering as a business function, not just a coding exercise.
The Three Options
Every infrastructure decision has three paths, and each has a cost model that looks very different from what teams initially estimate.
Build (custom software)
Definition: You write the software from scratch (or near-scratch), own the codebase, deploy it, and maintain it indefinitely.
True cost model: Engineering time to build (typically 2-6 months) + engineering time to operate (0.25-1.0 FTEs ongoing) + on-call cost + opportunity cost of what those engineers could have built instead.
Hidden costs people miss:
- On-call rotation. Someone pages at 3 AM when it breaks.
- Security patches. You own every CVE in every dependency.
- Scaling work. The thing that handles 1K QPS today needs rework at 50K QPS.
- Knowledge concentration risk. When the engineer who built it leaves, you have a bus factor of zero.
When it is the right call: The component is core to your product differentiation, and no vendor or open-source project meets your specific requirements. Netflix building Open Connect (their custom CDN) makes sense because CDN performance directly affects their user experience, and they push 15%+ of global internet traffic.
When it is a trap: When engineers want to build it because it is technically interesting, not because it is strategically necessary. Building a custom CI/CD pipeline when GitHub Actions exists is almost never justified.
Open source (self-hosted)
Definition: You deploy an open-source project (Kafka, PostgreSQL, Prometheus, Redis, Elasticsearch) on your own infrastructure and maintain it.
True cost model: Integration time (1-4 weeks) + infrastructure cost + operational overhead (0.25-0.5 FTEs per major system) + upgrade treadmill.
Hidden costs people miss:
- The upgrade treadmill. You are always 2-3 versions behind, and the gap grows.
- Community support is not SLA-backed support. When you hit a production-breaking bug at 2 AM, a GitHub issue does not resolve your outage.
- Configuration tuning. Default configs work at small scale. At production scale, you need expertise to tune JVM heap sizes, connection pools, replication settings, and compaction strategies.
When it is the right call: Your team has genuine operational expertise with this software, the marginal cost of adding another cluster is low, and you need configuration flexibility that managed services do not offer.
When it is a trap: When the team assumes "open source = free" and does not budget for the operational cost. I have seen teams adopt self-hosted Kafka thinking it would save money, only to discover they needed a dedicated half-time engineer just for cluster management.
Buy (managed service / SaaS)
Definition: You pay a vendor (AWS, Confluent, Datadog, Stripe, Auth0) to run the software for you. You consume it through an API or managed console.
True cost model: Monthly fee x 24 months + integration time (1-4 weeks) + potential lock-in cost (migration effort if you ever leave).
Hidden costs people miss:
- Lock-in cost is real but frequently overestimated. Teams spend months building "vendor-agnostic" abstractions for a migration that has a less than 5% chance of happening.
- Feature gaps. The managed service does 90% of what you need. That last 10% either does not exist or requires a workaround.
- Price increases. Vendors raise prices. Budget for 10-20% annual increases.
When it is the right call: The problem is commodity (auth, payments, email, observability, blob storage), your team does not have specialized expertise, and you value time-to-market over cost optimization.
When it is a trap: When you buy before understanding your requirements. I once saw a team sign a $180K/year Datadog contract, then discover six months later that 60% of their monitoring needs could have been handled by Prometheus plus Grafana at a fraction of the cost. Know what you need before you buy.
Head-to-head comparison
| Dimension | Build (custom) | Open source (self-hosted) | Buy (managed) |
|---|---|---|---|
| Upfront cost | High (2-6 months eng) | Medium (1-4 weeks integration) | Low (days to weeks) |
| Ongoing cost | High (0.5-1.0 FTE) | Medium (0.25-0.5 FTE) | Predictable monthly fee |
| Time to production | Months | Weeks | Days to weeks |
| Control | Full | High (code access) | Low (API/config only) |
| Expertise required | Deep domain + ops | Ops + config tuning | Integration only |
| Lock-in risk | None | Low (standard formats) | Medium to high |
| Scaling effort | You handle it | You handle it | Vendor handles it |
| Example | Custom payment engine | Self-hosted Kafka | Confluent Cloud |
The Decision Framework
Here is a structured six-step process that works in both real architecture decisions and interview discussions. Each step has concrete guidance, not vague advice.
Step 1: Define requirements specifically
"We need a cache" is not a requirement. A requirement sounds like this: "We need a read-through cache for our product catalog API, handling 50K reads/second with p99 latency under 5ms, storing up to 500K keys averaging 2KB each, with a staleness tolerance of 30 seconds."
The specificity matters because it determines whether a managed service can meet your needs. "We need Kafka" might actually mean "we need a durable message queue with at-least-once delivery and partition-level ordering for 10K messages/second." SQS handles that. You do not need Kafka.
Before evaluating options, write down:
- Throughput requirements (QPS, messages/second, bandwidth)
- Latency requirements (p50, p99)
- Durability and consistency requirements
- Data volume and retention period
- Integration points (what talks to this component?)
Step 2: List the best managed options
For every common infrastructure need, there is a mature managed option. Evaluate at least two.
| Need | Managed options | Quick evaluation |
|---|---|---|
| Message queue | SQS, Confluent Cloud, MSK, Google Pub/Sub | SQS for simple queuing, Confluent for streaming semantics |
| Cache | ElastiCache, MemoryDB, Momento | ElastiCache for standard Redis, Momento for serverless |
| Search | Elastic Cloud, Algolia, OpenSearch Service | Algolia for simple search UX, Elastic for complex queries |
| Auth | Auth0, Cognito, Clerk | Auth0 for flexibility, Clerk for developer experience |
| Payments | Stripe, Adyen, Braintree | Stripe for simplicity, Adyen for enterprise global |
| Observability | Datadog, New Relic, Grafana Cloud | Datadog for all-in-one, Grafana Cloud for open-source stack |
| Blob storage | S3, GCS, Azure Blob | S3 is the default. Almost never build this. |
Step 3: Estimate buy cost (24-month TCO)
Monthly fee x 24 + integration engineering time + lock-in migration cost estimate.
Example: Confluent Cloud for a mid-size workload
- 100 MB/s throughput, 3 topics, 7-day retention
- Confluent Cloud cost: roughly $2,500-$4,000/month depending on cluster type
- 24-month cost: $60,000-$96,000
- Integration time: 2 engineers x 2 weeks = $8,000 (at $200K/year fully-loaded)
- Lock-in migration cost: medium. Kafka wire protocol is standard, but Confluent-specific features (Schema Registry SaaS, ksqlDB) add switching cost.
- Total 24-month estimate: $70,000-$105,000
Step 4: Estimate build/run cost (24-month TCO)
Engineering time x fully-loaded cost + infrastructure + 24 months operational overhead.
Example: Self-hosted Kafka for the same workload
- Build/setup time: 1 engineer x 4 weeks = $4,000
- Infrastructure: 6 brokers (3 per AZ) on m5.xlarge = roughly $2,400/month
- Operational overhead: 0.3 FTE dedicated to Kafka ops = $5,000/month (at $200K/year)
- 24-month infrastructure: $57,600
- 24-month ops: $120,000
- Total 24-month estimate: $181,600
The self-hosted option costs nearly 2x the managed option for this workload. This is the math that most "let's just run it ourselves" conversations skip.
Step 5: Consider lock-in risk
Lock-in is real, but it is also the most commonly over-weighted factor in make vs. buy decisions. Ask these specific questions:
- Is the data portable? Can you export your data in a standard format (CSV, Parquet, JSON, Avro)? If yes, lock-in risk is low regardless of the vendor.
- Is the API proprietary? DynamoDB's API is proprietary. PostgreSQL on RDS uses standard SQL. The portability difference is enormous.
- What is the realistic migration timeline? If migration takes 2 weeks of engineering, the lock-in cost is $4,000. If it takes 6 months, it is $100,000. Be specific.
- What is the probability of migration? Be honest. Most vendor migrations never happen. If the probability is under 10%, weight the lock-in cost accordingly.
The lock-in overreaction
Teams routinely spend 3-6 months building "vendor-agnostic" abstraction layers to avoid lock-in. The abstraction itself becomes a maintenance burden, and the migration it was designed to enable almost never happens. Build for what you need today. If you migrate in 3 years, the rewrite will probably be warranted anyway.
Step 6: Make the call based on numbers
Compare the 24-month TCO of buy vs. build, factor in the qualitative dimensions (lock-in risk, team expertise, time-to-market pressure), and decide. If the numbers are within 20% of each other, let the qualitative factors tip the balance. If one option is 2x cheaper, the qualitative factors rarely overcome that gap.
The decision should look like this in a design review or interview: "For our message queue, I'd use Confluent Cloud. The 24-month cost is roughly $80K versus $180K for self-hosted Kafka, our team doesn't have deep Kafka operational expertise, and the standard wire protocol gives us an exit path if we ever need to migrate."
When Managed Services Win (with Real Numbers)
1. Commodity problems
Auth, email delivery, payment processing, observability, blob storage, DNS, CDN. These are solved problems with mature vendors who have invested hundreds of millions of dollars in reliability.
Example: Stripe vs. custom payment processing
- Stripe fee: 2.9% + $0.30 per transaction
- At $10M annual GMV: roughly $319,000/year in Stripe fees
- Custom payment processor: 2-4 engineers x 12 months to build, plus PCI DSS compliance audit ($50K-$200K/year), plus interchange fees (which you still pay), plus ongoing maintenance
- Stripe makes sense until your GMV exceeds roughly $50M-$100M/year and your team has deep payments domain expertise
Example: Auth0 vs. custom auth
- Auth0 cost: $23-$240/month for up to 10K users (B2B plans go higher)
- Custom auth: 1-2 engineers x 3-6 months to build, plus ongoing security patches, plus the risk of getting authentication wrong (which is a catastrophic security risk)
- Auth0 makes sense for almost every company under 1,000 employees
2. No specialized expertise on the team
Running Elasticsearch well requires deep expertise in JVM tuning, shard sizing, mapping design, and cluster operations. If nobody on your team has this, use Elastic Cloud or OpenSearch Service. A poorly-tuned self-hosted Elasticsearch cluster will give you worse performance than the managed version at higher cost.
This applies equally to Kafka, Cassandra, MongoDB, and any operationally complex system. The managed version comes with expertise baked in.
3. Large operational surface
Some systems have disproportionate operational overhead relative to their strategic importance.
Ops cost estimation method:
- Count the number of distinct operational tasks: upgrades, backups, monitoring, capacity planning, security patches, incident response, configuration tuning
- Estimate hours per month for each
- Multiply by your fully-loaded engineering hourly rate ($100-$150/hour for most companies)
Example: Self-hosted Kafka operational tasks
- Version upgrades: 8 hours/quarter = 2.7 hours/month
- Monitoring and alerting: 4 hours/month
- Capacity planning: 2 hours/month
- Security patches: 2 hours/month
- Incident response (averaged): 4 hours/month
- Configuration tuning: 2 hours/month
- Total: roughly 17 hours/month x $125/hour = $2,125/month in ops cost alone
Add that to infrastructure cost and compare against your managed service bill.
4. Time-to-market priority (pre-PMF vs. post-scale)
If you are pre-product-market-fit, every month you spend building infrastructure instead of product features is a month closer to running out of runway. Buy everything. Optimize later.
I've seen startups spend their first 6 months setting up self-hosted Kubernetes, Kafka, and Prometheus instead of building their product. Three of them ran out of money before launching. The managed-service bills they were trying to avoid would have been a rounding error compared to the funding they burned.
The rule: pre-PMF, buy everything. Post-PMF, optimize the components where the bill exceeds $10K/month.
5. Compliance and security requirements
SOC 2, HIPAA, PCI DSS, and GDPR compliance are dramatically easier with managed services because the vendor handles a large portion of the compliance surface area.
Example: HIPAA-compliant database
- Self-hosted PostgreSQL with HIPAA compliance: encryption at rest, encryption in transit, audit logging, access controls, BAA with your IaaS provider, annual audit of your infrastructure
- RDS for PostgreSQL with HIPAA: AWS provides BAA, encryption is a checkbox, audit logging is built-in, and AWS handles infrastructure-level compliance
- The managed option removes months of compliance engineering work
When Building Wins (with Real Numbers)
1. Core differentiation
If the component directly affects your product quality or competitive advantage, owning it makes sense.
Netflix Open Connect: Netflix built their own CDN because video delivery latency and quality are core to their product. They process over 15% of global internet traffic through 17,000+ servers deployed inside ISP networks. No third-party CDN could offer the same level of control over cache placement and bitrate optimization at that scale.
Shopify Memcached fork: Shopify forked and customized Memcached because their caching patterns were highly specific to e-commerce flash sales. The customizations (extstore for SSD-backed overflow, custom eviction policies) could not be replicated on a managed service.
The test: would a competitor using the managed version of this component have a meaningfully worse product? If yes, consider building. If no, buy.
2. Managed options hit hard limits
Every managed service has limits. When you hit them, you are stuck.
| Managed service | Common hard limit | When you hit it |
|---|---|---|
| Elastic Cloud | Max 150 shards per node, cluster size caps | 50TB+ datasets with complex aggregations |
| MSK (managed Kafka) | Partition count limits, no custom plugins | 10K+ partitions, custom interceptors |
| RDS | Max instance size (db.r6g.16xlarge = 512GB RAM) | Datasets exceeding 10TB with complex queries |
| DynamoDB | 400KB item size, 10GB partition limit | Large document storage, hot partition issues |
| Lambda | 15-minute timeout, 10GB memory | Long-running batch jobs, ML inference |
When your requirements exceed these limits and there is no workaround, self-hosting or building custom is the only option.
3. Genuine lock-in risk (with proprietary data formats)
Lock-in matters most when:
- Your data is stored in a proprietary format that only one vendor can read
- Your application logic depends heavily on vendor-specific APIs with no standard equivalent
- Migration would require rewriting significant application code, not just changing a connection string
DynamoDB is a good example. The data model (partition key, sort key, GSI), the query API, and the pricing model are all proprietary. Moving from DynamoDB to another database requires rewriting your data access layer, not just pointing at a new endpoint.
RDS PostgreSQL is a counter-example. The data is standard PostgreSQL. You can pg_dump, move to any PostgreSQL host, and change the connection string. Lock-in risk is minimal.
4. Team has expertise and marginal cost is low
If your platform team already operates 10 PostgreSQL clusters, the marginal cost of cluster #11 is near zero. The operational playbooks, monitoring dashboards, backup scripts, and upgrade procedures already exist. In this case, self-hosted often wins on cost.
This logic breaks down when the new system is a different technology. "We run Postgres, so we can run Kafka" is a dangerous assumption. They require completely different operational expertise.
5. Scale economics: when self-hosting becomes cheaper
There is a crossover point where your managed service bill exceeds the cost of a dedicated team to self-host.
The crossover analysis:
- Managed Elasticsearch (Elastic Cloud): $15,000/month for a production cluster with 5TB of data
- Self-hosted Elasticsearch: $4,000/month infrastructure + 0.5 FTE ops engineer ($8,300/month) = $12,300/month
- Savings: $2,700/month, or $32,400/year
At $15K/month, the savings from self-hosting barely justify the risk and complexity. But at $40K/month, self-hosting saves $120K+/year, which easily funds a dedicated engineer with budget left over.
Rule of thumb: When your managed service bill for a single component exceeds $25K-$30K/month, run the self-hosting calculation. Below that, the operational risk almost always outweighs the savings.
The Open Source Middle Ground
Open source deserves its own section because teams consistently miscategorize it. Open source is not "free." It is not "buy." It sits in between, and the operational reality is closer to "build" than most teams expect.
Operational reality: you own the 3 AM pages
When your self-hosted Redis cluster loses a node at 3 AM, you do not call Redis Labs support (unless you bought their enterprise product, which makes it "buy," not "open source"). You page your on-call engineer, who needs to understand Redis replication, failover, and recovery procedures well enough to fix it under pressure.
This is the fundamental difference between open source and managed. Managed services externalize operational risk. Open source internalizes it.
The upgrade treadmill
Every open-source project you self-host needs regular upgrades. Security patches, bug fixes, new features, and eventually end-of-life for your current version. Each upgrade requires testing, deployment planning, and potential downtime.
At scale, you might be managing upgrades across 10-20 different open-source systems. That is a significant engineering investment just to stay current, before you build a single product feature.
Community vs. enterprise support
Free community support (GitHub issues, Stack Overflow, Discord/Slack channels) is great for learning and non-urgent questions. It is not a substitute for production support with SLAs. If your self-hosted PostgreSQL has a performance regression that is costing $10K/hour in lost revenue, waiting 48 hours for a community response is not acceptable.
Enterprise support options exist for most major open-source projects (Red Hat, Confluent, Elastic, Percona, Timescale), but at that point you are effectively paying for a hybrid between open source and managed.
The "start managed, graduate to self-hosted" path
This is often the smartest approach. Start with the managed service to get to production quickly and validate your requirements. Once you understand the operational requirements, traffic patterns, and cost structure, evaluate whether self-hosting makes sense.
The path looks like this:
- Month 0-12: Use ElastiCache (managed Redis). Learn your access patterns, cache sizes, and hit rates.
- Month 12-18: Analyze costs. If your bill exceeds $10K/month and your team has Redis expertise, evaluate self-hosting.
- Month 18+: Migrate to self-hosted Redis with confidence because you understand your actual requirements, not your guesses.
Specific examples
| Component | Managed option | Open source self-hosted | When to graduate |
|---|---|---|---|
| Redis | ElastiCache ($200-$5,000+/month) | Redis OSS on EC2 | Bill exceeds $5K/month, team has Redis ops experience |
| PostgreSQL | RDS ($200-$10,000+/month) | PostgreSQL on EC2 | Rarely. RDS operational value is enormous. |
| Prometheus | Grafana Cloud ($0-$500+/month) | Self-hosted Prometheus | Need custom retention, federation, or cost exceeds $2K/month |
| Kafka | Confluent Cloud / MSK | Self-hosted Kafka | Bill exceeds $15K/month, dedicated platform team exists |
| Elasticsearch | Elastic Cloud ($500-$20,000+/month) | Self-hosted ES on EC2 | Bill exceeds $25K/month, team has JVM/ES tuning expertise |
The PostgreSQL exception
PostgreSQL on RDS is one of the rare cases where the managed option almost always wins regardless of scale. RDS handles backups, failover, patching, and monitoring at a cost that is hard to beat with self-hosting. The operational overhead of self-hosted PostgreSQL (especially HA failover) is enormous relative to the savings.
Common Traps
Six named traps that I see repeatedly in design reviews and interviews. Each follows the same pattern: what teams say, what actually happens, and what to do instead.
Trap 1: "The Abstraction Layer"
What they say: "Let's build a vendor-agnostic abstraction over Stripe so we can switch to Braintree if we need to."
What actually happens: The abstraction takes 2-3 months to build. It covers 70% of Stripe's API surface because that is all the team uses today. Six months later, a new feature needs Stripe-specific functionality (Connect, Billing, Radar), and the team either extends the abstraction (more months of work) or bypasses it (defeating its purpose). The migration to Braintree never happens. The abstraction becomes dead weight that every new engineer has to learn.
What to do instead: Use Stripe directly. If you migrate in 3 years (unlikely), the migration cost will be justified by then, and your requirements will have changed enough that the old abstraction would not have helped anyway. The time you save by not building the abstraction is time you can spend on your actual product.
Trap 2: "It's Just Redis"
What they say: "We'll just run Redis ourselves. It's simple. One binary, one config file."
What actually happens: Redis is simple to start. It is not simple to operate at scale. The team discovers they need: Sentinel or Cluster mode for HA, monitoring for memory fragmentation and eviction rates, backup strategies, security configuration (AUTH, TLS), connection pool tuning, and a runbook for split-brain scenarios. After 6 months, the "simple" Redis deployment consumes 8-10 hours of engineering time per month.
What to do instead: Use ElastiCache or MemoryDB. The managed service handles failover, patching, backups, and monitoring. For a typical workload, the cost difference between self-hosted and managed Redis is $200-$500/month. That does not justify the operational risk.
Trap 3: "The Build Trophy"
What they say: "We should build our own service mesh. We have unique networking requirements."
What actually happens: The team spends 6 months building a custom service mesh that handles 30% of what Istio or Linkerd provide out of the box. The engineers who built it are proud of it (it is genuinely good engineering), but it becomes a maintenance burden that only they understand. When they leave, nobody can maintain it. The company eventually migrates to Istio anyway, writing off the custom solution.
What to do instead: Ask the differentiation test: "Would a competitor using the off-the-shelf solution have a meaningfully worse product?" If the answer is no, use the off-the-shelf solution. Save the custom engineering for problems where the answer is yes.
Trap 4: "The Premature Purchase"
What they say: "Let's sign a Datadog contract. Observability is critical and we need the best."
What actually happens: The team signs a $180K/year enterprise Datadog contract before understanding their observability requirements. They use 40% of the features. They discover that their most important monitoring use case (custom business metrics with high-cardinality labels) is the most expensive part of the Datadog pricing model. The contract has a 12-month commitment. They are stuck paying for features they do not use while the features they need cost extra.
What to do instead: Start with a smaller plan or a trial. Understand your requirements first. What metrics do you need? What cardinality? What retention? How many hosts? Then evaluate whether the managed service cost matches the value. For many companies, Grafana Cloud or a self-hosted Prometheus stack handles 80% of their needs at 20% of the cost.
Trap 5: "The Sunk Cost Lock-in"
What they say: "We've already integrated with this vendor. Switching would take months. Let's just stay."
What actually happens: The team stays with a vendor whose product no longer fits their needs because they overestimate the migration cost. They build workarounds on top of workarounds. Over 2 years, the workaround cost exceeds what the migration would have cost. Meanwhile, the vendor keeps raising prices because they know their customers feel locked in.
What to do instead: Periodically (every 12-18 months) re-evaluate major vendor relationships. Estimate the actual migration cost in engineering weeks, not in emotional terms. If the migration cost is 4 weeks of engineering and the annual savings are $50K, the migration pays for itself in 2 months. Do it.
Trap 6: "The Open Source Fairy Tale"
What they say: "We'll use open-source Prometheus. It's free. Zero cost."
What actually happens: The team deploys Prometheus. Within 6 months they need: long-term storage (Prometheus default retention is 15 days), high availability (Prometheus is single-node by default), cross-cluster federation, alerting (Alertmanager setup and configuration), and Grafana dashboards. The "free" Prometheus stack now requires Thanos or Cortex for long-term storage, a complex HA setup, and ongoing operational investment. Total cost: $5,000/month in infrastructure plus 0.3 FTE in engineering time.
What to do instead: Compare the total cost of your "free" open-source stack against a managed alternative like Grafana Cloud. If Grafana Cloud costs $2,000/month and your self-hosted stack costs $5,000/month in infrastructure plus $5,000/month in engineering time, the "free" option is 5x more expensive than the paid one.
Make vs. Buy in System Design Interviews
Every staff+ interview includes at least one moment where you need to make a technology choice. The make vs. buy question is always implicit, and sometimes explicit.
When the interviewer expects you to acknowledge it
At staff level and above, interviewers expect you to briefly acknowledge the make vs. buy tradeoff for at least one major component. You do not need to do a full analysis for every component. Pick the one where the choice is most consequential and address it in 2-3 sentences.
The one sentence that works
"I'd use [managed service] here because [specific reason], unless we have a specific requirement that [what would change the decision]."
Examples:
- "I'd use SQS here because we need simple FIFO queuing with at-least-once delivery, and the throughput is well within SQS limits. If we needed streaming semantics or exactly-once processing, I'd switch to Kafka."
- "I'd use ElastiCache for Redis here because our caching needs are standard (key-value lookups, TTL-based expiration) and the ops overhead of self-hosting is not justified at our scale."
- "I'd use Algolia for search here because our search requirements are straightforward (full-text search with filtering) and time-to-market matters. If we needed custom ranking models or complex aggregations, Elasticsearch would be the better choice."
Component defaults table
For interviews, have a mental default for each common component. Know when to deviate.
| Component | Default choice (buy) | When to build/self-host |
|---|---|---|
| Message queue | SQS (simple) or MSK (streaming) | Custom ordering guarantees, extreme throughput (100K+ msg/sec) |
| Cache | ElastiCache (Redis) | Need custom eviction, data structures beyond Redis, or bill exceeds $30K/month |
| Search | Algolia or Elastic Cloud | Custom ML ranking, 50TB+ indexes, sub-10ms latency requirements |
| Auth | Auth0 or Cognito | Custom MFA flows, on-prem requirements, regulatory needs |
| CDN | CloudFront or Cloudflare | Media company with custom cache logic (Netflix-scale) |
| Object storage | S3 | Never. Just use S3. |
| Database | RDS (PostgreSQL) or DynamoDB | Need for custom storage engine, extreme scale beyond managed limits |
| SES or SendGrid | Email deliverability is core product (email marketing SaaS) | |
| Payments | Stripe | GMV above $50M/year with dedicated payments team |
| Monitoring | Datadog or Grafana Cloud | Need for custom instrumentation, 100K+ hosts, extreme cardinality |
Interview shortcut
Memorize the "default choice" column. In 90% of interview scenarios, the default is the right answer. Your job is to know the 10% where it is not, and to articulate why.
Scenario Walkthrough
Scenario: You are designing a real-time analytics dashboard for an e-commerce platform. The interviewer asks: "How would you handle the data pipeline from event ingestion to dashboard rendering?"
Here is how the make vs. buy analysis plays out in real-time, as you would think through it in an interview.
Step 1: Identify the components that need a make/buy decision.
The data pipeline has three major components: event ingestion, stream processing, and storage/query for the dashboard.
Step 2: Start with defaults and justify.
"For event ingestion, I'd use managed Kafka through MSK or Confluent Cloud. We need durable, ordered event streaming with partition-level guarantees, and our estimated volume of 50K events/second is well within managed Kafka limits. Self-hosting Kafka for this workload would cost roughly 2x the managed price when you factor in operational overhead."
"For stream processing, I'd use Amazon Kinesis Data Analytics (managed Flink) or Confluent's ksqlDB. We need windowed aggregations (count of orders per minute, revenue per product category in the last hour), and managed Flink handles this without us needing Flink operational expertise."
"For the analytics store, this is where it gets interesting. We need sub-second queries on large aggregated datasets. The options are:"
Step 3: Go deeper on the component where the choice matters most.
"For the analytics query layer, I see three options:"
- Buy: Amazon Timestream or ClickHouse Cloud ($500-$3,000/month depending on data volume). Fast to set up, handles our query patterns, but limited customization.
- Self-host: ClickHouse on EC2. More control over configuration, can tune MergeTree engines for our specific access patterns. But requires operational expertise our team may not have.
- Buy: Druid on Imply Cloud. Better for high-cardinality real-time OLAP but more expensive.
"I'd start with ClickHouse Cloud (managed). Our query patterns are standard time-series aggregations, and the managed service eliminates the operational overhead of running ClickHouse clusters. If our data volume grows past 10TB or our query patterns become more specialized, we can evaluate self-hosting at that point."
Step 4: Acknowledge the tradeoff explicitly.
"The tradeoff here is cost vs. flexibility. ClickHouse Cloud at this scale is roughly $1,500/month. Self-hosted would be around $800/month in infrastructure but would require 0.2 FTE in operational overhead, which at $200K/year fully-loaded is roughly $3,300/month. The managed option is cheaper and lets us focus engineering time on the dashboard features that actually differentiate our product."
This walkthrough takes about 2 minutes in an interview. It demonstrates that you think about total cost of ownership, you have practical experience with these tools, and you can make and justify a decision without overthinking it.
Common Mistakes
Mistake 1: Treating make vs. buy as a one-time decision
Teams make a build-or-buy choice at project start and never revisit it. Requirements change, team composition changes, scale changes. A decision to self-host Kafka that made sense when you had a platform team of 5 may not make sense after 3 of them leave. Revisit major infrastructure decisions every 12-18 months.
Mistake 2: Comparing build cost to monthly fee (ignoring operational cost)
The most common mathematical error in make vs. buy analysis: comparing the monthly managed-service fee against the one-time build cost, without factoring in 24 months of operational overhead. "Kafka costs $3,000/month and we can build it in 2 weeks" ignores the ongoing 0.3 FTE of operational work that self-hosting requires.
Mistake 3: Letting engineering culture drive the decision
Some teams default to "build" because the engineering culture rewards building things. Some teams default to "buy" because leadership distrusts custom solutions. Both defaults are wrong. The answer is always "it depends on the numbers and the context." If your team's default is always the same regardless of the problem, you have a culture problem, not a technical strategy.
Mistake 4: Ignoring the opportunity cost of engineering time
Every engineer working on infrastructure is an engineer not working on product features. If you have 20 engineers and 3 of them are maintaining self-hosted infrastructure, that is 15% of your engineering capacity spent on undifferentiated work. For a startup, that might be the difference between shipping a feature that wins a customer and not shipping it.
Mistake 5: Making the decision in the interview without acknowledging the tradeoff
In interviews, candidates often say "I'd use Kafka here" without any analysis. At staff level, the interviewer wants to see that you considered whether Kafka is the right tool, whether managed or self-hosted, and what would change your decision. The absence of this analysis reads as a mid-level response.
How This Shows Up in Interviews
When it appears
Every staff+ system design interview has at least one make vs. buy moment. It is implicit in every technology choice you make. "I'd use Redis for caching" is a buy decision (ElastiCache) or a build decision (self-hosted Redis), and the interviewer wants to know which one you mean and why.
At senior level, saying "I'd use Redis" is sufficient. At staff level, the expected response is: "I'd use ElastiCache Redis here. Our caching needs are standard, and the operational cost of self-hosting is not justified at this scale. If we were processing 500K+ QPS and needed custom eviction policies, I'd evaluate self-hosting."
What interviewers listen for
- Awareness that the choice exists. Do you even acknowledge that there is a build/buy decision here?
- Specific reasoning. Not "managed is easier" but "managed is cheaper for this workload because our volume is under the threshold where self-hosting becomes cost-effective."
- Knowledge of limits. Knowing when managed services break down shows real operational experience.
- Willingness to make a call. "It depends" without a recommendation reads as indecisive. Make a call, then qualify it.
The difference between acknowledging and analyzing
Acknowledging (senior level): "I'd use Confluent Cloud for Kafka."
Analyzing (staff level): "I'd use Confluent Cloud here. Our throughput requirement is 50K messages/second with 7-day retention, which Confluent handles comfortably. Self-hosting at this volume would require 6 brokers and a half-time ops engineer, roughly doubling the cost. The Kafka wire protocol is standard, so we have an exit path if we ever outgrow Confluent's limits."
The second response takes 15 seconds longer but communicates significantly more judgment.
Common interviewer follow-ups
| Follow-up question | What they are testing | Good response pattern |
|---|---|---|
| "What if the vendor raises prices 3x?" | Lock-in risk awareness | "We'd evaluate migration cost. The Kafka wire protocol is portable, so the switching cost is operational, not architectural." |
| "Why not build this yourself?" | Cost reasoning | "At our scale, the ops cost of self-hosting exceeds the managed fee. Building makes sense when the bill exceeds $25K/month." |
| "What if you need a feature the vendor doesn't support?" | Pragmatism vs. purity | "I'd evaluate workarounds first. If the feature is critical and no workaround exists, that changes the calculus." |
| "How would you migrate off this if you needed to?" | Migration planning | "I'd design the integration with standard interfaces where possible, so the migration is a connection-string change, not a rewrite." |
Test Your Understanding
Quick Recap
- Make vs. buy is a structured decision: compare 24-month total cost of ownership (managed fee + integration + lock-in cost) against self-hosted cost (engineering time + infrastructure + operational overhead).
- Managed services win for commodity problems (auth, payments, email, observability), teams without specialized ops expertise, and pre-PMF startups where time-to-market dominates.
- Building wins when the component is core differentiation, managed options hit hard limits, or your scale makes a dedicated team cheaper than the vendor bill (typically above $25K-$30K/month).
- Open source is not free. It is operationally closer to "build" than "buy." Budget 0.25-0.5 FTEs per major self-hosted system.
- The most common trap is overestimating lock-in risk and building vendor abstractions that cost more than the hypothetical migration ever would.
- In interviews, show the analysis: name the managed default, give a specific reason, and state what would change your decision. This takes 15 seconds and signals staff-level judgment.
- Revisit major make vs. buy decisions every 12-18 months. Requirements, team composition, and vendor pricing all change.
Related Articles
- Senior vs. staff expectations explains the broader behavioral differences between senior and staff engineers, of which make vs. buy judgment is one specific dimension.
- Staff engineer design approach covers how staff engineers structure entire system design interviews, including when and how to address infrastructure choices.
- Monolith vs. microservices is a closely related architectural decision where similar cost-vs-flexibility tradeoffs apply at a different level of abstraction.