Multi-tenancy design
Learn how to serve multiple customers from shared infrastructure without data leakage, using silo, bridge, and pool isolation models with tenant-aware routing.
TL;DR
- Multi-tenancy means multiple customers (tenants) share the same system infrastructure while their data remains logically or physically isolated.
- Three isolation models: silo (one database per tenant, maximum isolation, highest cost), bridge (shared database, schema per tenant), pool (shared tables with a tenant_id column, minimum cost, maximum risk).
- Data leakage between tenants is the single most dangerous failure mode. One missed WHERE clause can expose customer data to the wrong tenant.
- Noisy neighbor is the operational risk: one tenant's heavy usage degrades performance for everyone sharing the same resources.
- Most production SaaS systems use a hybrid: small tenants in the pool model, enterprise tenants in silo, with automated migration tooling between tiers.
The Problem It Solves
Your SaaS application has 2,000 paying customers. During a routine database query, an engineer runs a report without a WHERE clause on tenant_id. The resulting CSV contains order data from all 2,000 tenants. It gets attached to an email thread. An enterprise customer's confidential pricing data is now exposed.
Or the less dramatic but equally painful version: one of your largest tenants runs an expensive analytics query that locks shared tables for 30 seconds. During that window, your 1,999 other tenants experience timeouts. Your support queue fills up, but the tenant running the query doesn't even notice because their workload completed successfully.
These are the two fundamental multi-tenancy risks: data leakage (one tenant sees another's data) and noisy neighbor (one tenant's workload degrades everyone else's performance). Both stem from sharing infrastructure between customers.
The single-tenant approach (deploy a completely separate stack per customer) avoids these risks but doesn't scale. Managing 2,000 independent deployments means 2,000 database upgrades, 2,000 schema migrations, and 2,000 monitoring dashboards. The operational cost makes it economically unviable for all but the largest enterprise contracts.
Multi-tenancy is the engineering discipline of sharing infrastructure safely. The question isn't whether to share, it's how much isolation each customer needs, and at what cost.
What Is It?
Multi-tenancy is an architecture where a single instance of software serves multiple customers (tenants), with mechanisms to ensure each tenant's data, performance, and configuration are isolated from every other tenant.
Analogy: Think of an apartment building. Each tenant has their own unit with a lock on the door (data isolation). They share the building's plumbing, electrical, and elevator (shared infrastructure). Some tenants pay extra for a penthouse with a private elevator (silo model). Most share the common elevator but have a maximum occupancy limit so one large family doesn't monopolize it (noisy neighbor controls). The building management company runs one maintenance team for the whole building, not one per unit (operational efficiency).
The isolation level you choose is the central architecture decision. There's a spectrum from full isolation (expensive, simple to secure) to full sharing (cheap, operationally risky).
No single model is correct. The right answer depends on your customer mix, regulatory requirements, and cost constraints. Most mature SaaS products use a hybrid.
How It Works
Let's trace a request through a multi-tenant system. A user at Acme Corp hits acme.myapp.com/api/orders. The system must: (1) identify the tenant, (2) route to the right data, (3) enforce isolation.
Tenant routing middleware
Every request passes through tenant resolution before reaching business logic:
class TenantMiddleware:
def process_request(self, request):
# Strategy 1: Subdomain
tenant_slug = request.host.split('.')[0] # "acme" from acme.myapp.com
# Strategy 2: JWT claim
# tenant_id = request.auth.claims["tenant_id"]
# Strategy 3: API key prefix
# tenant_slug = request.headers["X-API-Key"].split("-")[1]
tenant = cache.get(f"tenant:{tenant_slug}")
if not tenant:
tenant = db.query("SELECT * FROM tenants WHERE slug = %s", tenant_slug)
cache.set(f"tenant:{tenant_slug}", tenant, ttl=3600)
request.tenant = tenant
request.db = TenantScopedDB(tenant.id, tenant.isolation_model)
The middleware caches tenant lookups (they're effectively read-only) and sets a tenant context that all downstream code uses. I'll often see teams skip the caching step, which adds a database round-trip to every single request for data that changes maybe once a month.
Data isolation by model
Silo model routes the entire database connection to a dedicated instance:
class TenantScopedDB:
def get_connection(self):
if self.isolation_model == "silo":
return connection_pool.get(self.tenant.dedicated_db_url)
elif self.isolation_model == "bridge":
conn = connection_pool.get(shared_db_url)
conn.execute(f"SET search_path TO tenant_{self.tenant.id}")
return conn
else: # pool
return connection_pool.get(shared_db_url)
# All queries auto-filtered by tenant_id
Pool model enforces tenant_id at the query layer. This is the most critical code path in any multi-tenant system:
class TenantScopedQuerySet:
"""All database access goes through this. No bypass allowed."""
def __init__(self, tenant_id: UUID):
self._tenant_id = tenant_id
def orders(self) -> QuerySet:
return Order.objects.filter(tenant_id=self._tenant_id)
def users(self) -> QuerySet:
return User.objects.filter(tenant_id=self._tenant_id)
# Never expose the raw ORM to application code
request.db = TenantScopedQuerySet(tenant_id=request.tenant.id)
The pool model's existential risk: missing WHERE clauses
In the pool model, a single query missing WHERE tenant_id = ... returns data from all tenants. This is not a theoretical risk. It's the most common multi-tenancy bug, and it happens when a developer writes a raw SQL query, uses an ORM method that bypasses the scoped wrapper, or forgets the filter in a background job. Enforce tenant scoping at the infrastructure layer (row-level security, query rewriting middleware), not just by convention.
Row-level security as a safety net
PostgreSQL's Row-Level Security (RLS) provides database-enforced isolation for the pool model:
-- Enable RLS on the orders table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- Create a policy: each session can only see rows matching its tenant
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
-- Middleware sets the session variable before every query
SET app.current_tenant_id = 'acme-tenant-uuid';
SELECT * FROM orders; -- RLS automatically filters to Acme's rows only
Even if application code forgets the WHERE clause, the database itself enforces the filter. My recommendation: use RLS as a safety net alongside application-layer filtering, not as a replacement. RLS adds query overhead (the policy evaluation) and complicates EXPLAIN plans, but the data leakage prevention is worth it.
For your interview: say "tenant resolution in middleware, scoped query layer in the app, and RLS in the database as defense in depth." Three layers of isolation, any one of which prevents data leakage.
Key Components
| Component | Role |
|---|---|
| Tenant resolver middleware | Extracts tenant identity from the request (subdomain, JWT, API key) and sets context |
| Tenant registry | Database table mapping tenant IDs to isolation model, connection strings, feature flags |
| Scoped query layer | Application-level wrapper that auto-filters all queries by tenant_id (pool model) |
| Row-level security (RLS) | Database-enforced policy that filters rows by tenant, safety net for pool model |
| Connection router | Routes DB connections to the correct instance (silo) or schema (bridge) per tenant |
| Noisy neighbor controls | Per-tenant rate limits, connection caps, query timeouts, and resource quotas |
| Tenant migration tooling | Moves a tenant between isolation tiers (pool to silo) with data copy, verification, and cutover |
| Tenant-scoped cache | Cache keys prefixed with tenant_id to prevent cross-tenant cache pollution |
Types / Variations
| Dimension | Silo (DB per tenant) | Bridge (schema per tenant) | Pool (shared schema) |
|---|---|---|---|
| Isolation | Physical. Complete separation | Logical. Schema boundaries | Logical. Row-level only |
| Data leakage risk | Near zero | Low (schema misconfiguration) | High (missing WHERE clause) |
| Noisy neighbor risk | None (separate resources) | Medium (shared DB engine) | High (shared tables, indexes) |
| Cost per tenant | $50-200+/month (dedicated DB) | $5-20/month (shared cluster) | $0.50-5/month (shared everything) |
| Tenant provisioning | Minutes (create DB, run migrations) | Seconds (create schema) | Milliseconds (insert row) |
| Schema migrations | Run across N databases | Run across N schemas | Run once |
| Regulatory fit | HIPAA, SOC2, data residency | Moderate compliance | Requires RLS + audit controls |
| Practical tenant limit | Hundreds | Thousands | Millions |
| Best fit | Enterprise contracts, regulated sectors | Mid-market SaaS | Self-serve, high volume |
The rule of thumb: start with pool for your first 100 tenants, add bridge for mid-tier accounts, offer silo for enterprise deals that demand it.
Noisy Neighbor Mitigation
In the pool model, one tenant's heavy usage affects every other tenant sharing those resources. This is not a theoretical risk. I've seen a single tenant's reporting query (a full table scan with GROUP BY across 50M rows) bring response times from 20ms to 8 seconds for all 500 other tenants.
The most effective approach combines all five layers. Rate limiting caps request volume. Connection pool limits prevent one tenant from exhausting shared connections. Query timeouts kill runaway queries. Tenant-aware sharding separates heavy tenants physically. And graduated isolation automatically migrates consistently heavy tenants to their own infrastructure.
Hybrid Model
Most production SaaS systems don't pick one model. They use a hybrid based on customer tier:
| Tier | % of tenants | % of revenue | Isolation model | Reason |
|---|---|---|---|---|
| Free / self-serve | 90% | 20% | Pool | Cost efficiency at scale |
| Professional | 9% | 30% | Bridge | Schema isolation satisfies compliance |
| Enterprise | 1% | 50% | Silo | Contractual isolation, data residency |
New tenants start in the pool. When a customer upgrades to a higher tier (or signs an enterprise contract), your platform migrates their data to the appropriate isolation level. This migration tooling (copy data, verify integrity, switch connection routing, validate) is a core platform capability, not an afterthought.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Lower infrastructure cost (sharing resources) | Data leakage risk in pool model |
| Single codebase to maintain | Every code path must be tenant-aware |
| Faster feature delivery (one deployment) | Schema changes must be backward-compatible across all tenants |
| Easier operational monitoring (one system) | Noisy neighbor risk degrades shared performance |
| Centralized security patching | Compliance complexity (data residency, audit trails) |
| Efficient resource utilization | Tenant migration between models is complex engineering |
The fundamental tension is cost efficiency vs. isolation guarantees. Sharing infrastructure reduces cost per tenant dramatically, but every shared resource is a potential vector for data leakage or performance interference.
When to Use It / When to Avoid It
Use multi-tenancy when:
- You're building SaaS with more than ~10 customers on the same product
- Infrastructure cost per customer must be low (self-serve tiers, freemium models)
- You want a single deployment to maintain, patch, and upgrade
- Your customer base spans a wide range of sizes (many small, few large)
- Regulatory requirements can be satisfied with logical isolation (RLS, encryption)
Avoid multi-tenancy (use single-tenant) when:
- Every customer requires physically separate infrastructure by contract or regulation
- You have fewer than 10 customers, each paying enterprise prices
- Customers need fully independent upgrade schedules (different versions running simultaneously)
- Data residency requirements make shared infrastructure impossible across regions
- The engineering complexity of tenant isolation exceeds the operational cost of separate deployments
Ok, but here's the thing most people miss: multi-tenancy is not a binary decision. You don't "do multi-tenancy" or "not do it." You choose an isolation model per customer tier and build the platform to support migration between them.
Real-World Examples
Salesforce is the original multi-tenant SaaS platform, serving 150,000+ organizations from shared infrastructure since 1999. Their pool model uses a metadata-driven architecture where tenant customizations (fields, objects, workflows) are stored as rows in shared system tables rather than as schema changes. This lets them deploy a single codebase serving every customer from small businesses to Fortune 500 enterprises. Their key innovation: "virtual custom objects" that represent per-tenant schema extensions without actual DDL, enabling millions of tenant-specific fields on shared physical tables.
Slack uses workspace-level isolation with a hybrid approach. Each workspace is a tenant, but the isolation model varies by scale. Small workspaces share infrastructure (pool model). Slack's Enterprise Grid product gives large organizations (100K+ users) dedicated infrastructure with cross-workspace federation. Their largest operational challenge was noisy neighbor mitigation during viral adoption: when a company of 50,000 users onboards, their initial data import can generate 10x normal database write load.
AWS uses account-level silo isolation for its largest multi-tenancy challenge: the cloud itself. Each AWS account is a tenant that gets hard resource boundaries (VPC isolation, IAM policy enforcement, quota limits). For services like S3 (which stores trillions of objects across millions of accounts), internal data partitioning uses a pool-like model with extremely rigorous tenant-id enforcement. Their 2017 S3 outage (triggered by a single operator command) demonstrated why tenant-level blast radius isolation matters even within infrastructure providers.
How This Shows Up in Interviews
When to bring it up: Any SaaS design question ("Design Slack," "Design a CRM," "Design a project management tool") requires multi-tenancy. Also relevant when the interviewer asks about data isolation, compliance, or cost optimization for a platform serving multiple customers.
Depth expected at senior/staff level:
- Name all three isolation models and state cost/isolation tradeoffs of each
- Explain the hybrid model and tenant migration between tiers
- Describe the data leakage risk and how RLS + scoped queries provide defense-in-depth
- Know noisy neighbor mitigations: per-tenant rate limiting, connection caps, query timeouts
- Discuss tenant-aware routing strategies and where tenant context is resolved
- Address compliance (data residency, GDPR right-to-deletion) and how isolation model affects it
Interview shortcut: the multi-tenancy one-liner
When time is short, say: "Pool model with tenant_id column, RLS as safety net, per-tenant rate limiting for noisy neighbors, and the option to migrate enterprise tenants to silo for contractual isolation." That covers the architecture in 10 seconds and shows you know the real tradeoffs.
Follow-up Q&A:
| Interviewer asks | Strong answer |
|---|---|
| "How do you prevent data leakage in a shared database?" | "Three layers: (1) application-level scoped query wrapper that auto-filters by tenant_id, (2) PostgreSQL RLS policies as a database-enforced safety net, (3) automated testing that runs every query path with two different tenant contexts and asserts zero cross-tenant results." |
| "What about noisy neighbors?" | "Per-tenant rate limiting at the API gateway, per-tenant connection pool limits in PgBouncer, statement_timeout on all queries, and graduated isolation that auto-detects heavy tenants and migrates them to dedicated resources." |
| "Why not just give every tenant their own database?" | "Cost. At 10,000 tenants, that's 10,000 PostgreSQL instances to manage, patch, back up, and monitor. Pool model serves those same tenants from one cluster. Silo is reserved for the 1% of tenants who contractually require it and pay accordingly." |
| "How do you handle GDPR right-to-deletion?" | "In pool model, DELETE WHERE tenant_id = X across all tables, with a verification query confirming zero rows remain. In silo, drop the entire database. Migration tooling must handle both. Audit log retention is the complication: you often need to keep anonymized logs for compliance even after deleting the tenant's data." |
| "How do you handle schema migrations?" | "In pool model, one migration runs once. In silo, you need a migration runner that iterates across all tenant databases with rollback support, canary deployments (migrate 5%, verify, then migrate the rest), and alerting on migration failures." |
Test Your Understanding
Quick Recap
- Multi-tenancy serves multiple customers from shared infrastructure, with the core architecture decision being how much isolation to provide at what cost.
- Three models: silo (DB per tenant, max isolation, max cost), bridge (schema per tenant, medium), pool (shared tables with tenant_id, min cost, highest risk).
- Data leakage is prevented through defense-in-depth: scoped query layers in application code, PostgreSQL RLS as a database safety net, and automated cross-tenant testing in CI.
- Noisy neighbor is controlled through per-tenant rate limiting, connection pool caps, query timeouts, and graduated isolation that auto-migrates heavy tenants.
- Most production SaaS systems use a hybrid: pool for free/self-serve, bridge for professional, silo for enterprise, with migration tooling as a core platform capability.
- Tenant-aware routing resolves the tenant from the request (subdomain, JWT claim, API key) in middleware, before any business logic executes.
- In interviews, say "pool with RLS, per-tenant rate limiting, and the ability to migrate enterprise tenants to silo" to show you understand the full tradeoff spectrum.
Related Concepts
- Sharding - The pool model with heavy tenants often evolves into tenant-aware sharding. Understanding partition strategies helps you explain how tenant data is physically distributed.
- Databases - Multi-tenancy is fundamentally a database architecture decision. Knowing PostgreSQL features (RLS, schemas, connection pooling) is prerequisite knowledge.
- Security - Data leakage prevention, tenant authentication, and compliance (GDPR, HIPAA, SOC2) are the security concerns that drive isolation model choices.
- Rate Limiting - Per-tenant rate limiting is the primary noisy neighbor control. Understanding token bucket and sliding window algorithms helps you design tenant-aware limits.