God service anti-pattern
Learn why a single service that everything depends on becomes a single point of failure disguised as modularity, and how to break it apart without a full rewrite.
TL;DR
- A god service is a single service that every other service depends on for core functionality (auth, user data, configuration, or any other cross-cutting concern handled in one place).
- It looks like clean modularity: one service owns "users." In production, it means one service going down takes everything with it.
- The god service becomes the hardest to change (any bug affects all consumers), the hardest to scale (it must handle traffic from every caller), and the highest-stakes deployment in the system.
- Break it apart by caching aggressively at the consumer level, embedding stable data in auth tokens, and decomposing by subdomain to reduce synchronous fan-in.
- When it's acceptable: early-stage systems with fewer than 5 services, or when every consumer has a circuit breaker with graceful degradation.
The Problem
"Every service calls the User Service." This sounds like good design, a single source of truth for user data. In reality, it means the User Service is in every request's critical path.
At 2:47 a.m. on your biggest flash sale, the User Service develops a memory leak. CPU climbs. Response times degrade from 5ms to 500ms. Every API endpoint in the system (product browsing, checkout, order history) slows down by 500ms because every one of them calls the User Service to validate the session or fetch the user's locale.
At 500ms User Service latency, your checkout API degrades from 120ms to 620ms. Your product pages go from 80ms to 580ms. I've seen this exact scenario play out at two different companies. Both times the root cause was the same: a single service sitting in every request path.
You have a god service. Everything depends on it. When it has a bad day, everything has a bad day.
Every arrow into that red box is a synchronous dependency. When the god service is slow, every consumer is slow.
The cascade in action
Here's what the 2:47 a.m. incident actually looks like in a distributed trace:
Two calls to the god service. Each one is 100x slower than normal. The user waits over a second for what should take 140ms. Multiply this across every service that calls the User Service, and you have a system-wide degradation triggered by a single memory leak in a single process.
Why restarts don't help
The natural instinct is to restart the User Service. But if the memory leak was triggered by traffic patterns (a flash sale generating unusual query patterns), the restarted instance hits the same conditions within minutes. Worse, during the restart window, all consumers that don't have fallbacks will fail hard instead of just being slow. Rolling restarts help, but only if you have multiple instances and consumers are load-balanced across them.
The bottom line: you need architectural resilience, not just operational playbooks. The god service pattern turns every performance problem into a company-wide incident.
Why It Happens
God services emerge from correct-sounding design decisions:
- "Auth should be centralized" so every service calls the Auth Service to validate tokens.
- "User data should be consistent" so every service calls the User Service to read user attributes.
- "Configuration should be centralized" so every service calls the Config Service on each request.
Each decision is individually reasonable. In aggregate, they produce a web of hard dependencies on a few services that are now critical path for everything.
The deeper driver is organizational. Early in a project, one team builds the User Service. Other teams need user data, so they call it. Nobody designs a caching or event strategy because the User Service is fast enough at low scale. By the time it's a problem, 15 services depend on it synchronously and unwinding that dependency graph is a multi-quarter effort.
There's also a knowledge gap. Junior engineers often conflate "single source of truth" with "single synchronous dependency." You can have authoritative data ownership without requiring every consumer to make a network call on every request. Tokens, caches, and events are all strategies for distributing reads without giving up write authority.
The fan-in scoring model
A practical way to measure god service risk is to score each service by its fan-in impact:
| Fan-in count | Risk level | Recommended action |
|---|---|---|
| 1-3 | Low | Normal service, no special treatment needed |
| 4-6 | Moderate | Add consumer-side caching, monitor for correlated failures |
| 7-10 | High | Decompose or add circuit breakers on all consumers |
| 10+ | Critical | This is a god service. Prioritize decomposition immediately |
The fan-in count alone doesn't tell the full story. A service called by 4 hot-path services is more dangerous than one called by 10 batch jobs. Weight by call frequency and whether the call is on the user-facing critical path.
Track fan-in as a metric over time. If it's growing, you're accumulating god-service risk even if the current count seems manageable.
How to Detect It
| Symptom | What It Means | How to Check |
|---|---|---|
| Fan-in ratio > 5 | Too many services call this one synchronously | Count inbound edges in your service graph (Datadog, Jaeger) |
| Correlated latency spikes | When service X is slow, 5+ other services spike too | Correlation analysis on p99 latency dashboards |
| Deployment fear | Engineers avoid deploying because "everything breaks" | Ask your team: "Which service scares you most to deploy?" |
| Unbounded responsibility | One service owns auth, preferences, billing, notifications | Count distinct domain concepts in one service's API surface |
| Connection pool exhaustion | Downstream services exhaust connections to the god service | Monitor connection pool usage on both sides |
If three or more of these symptoms match, you have a god service.
Quick diagnostic
Run this mental test: "If the User Service (or whatever service you suspect) goes down for 5 minutes, which user-facing features break?" If the answer is "most of them" or "all of them," that's a god service. A healthy service topology means that a single service failure degrades one feature, not everything.
You can also check your distributed traces. If more than 50% of all traces in your system include a span from the same service, that service has too much fan-in. Tools like Datadog's Service Map or Jaeger's dependency graph make this immediately visible.
// Quick check: count inbound callers to a service
// If this returns more than 5-6 unique callers, investigate
const callers = traceData
.filter((span) => span.downstream === "user-service")
.map((span) => span.upstream);
const uniqueCallers = new Set(callers);
console.log(`Fan-in: ${uniqueCallers.size} services call user-service`);
The Fix
Fix 1: Cache aggressively at the consumer
If services call the User Service for data that rarely changes (user locale, plan tier, display name), cache it at the consumer with a short TTL. Most calls never reach the User Service.
async function getUserLocale(userId: string): Promise<string> {
const cached = await cache.get(`user:${userId}:locale`);
if (cached) return cached; // covers 99% of calls
const locale = await userServiceClient.getLocale(userId);
await cache.set(`user:${userId}:locale`, locale, { ttl: 300 }); // 5-min cache
return locale;
}
Trade-off: you accept up to 5 minutes of staleness for user attributes. For most read-heavy data, this is perfectly acceptable.
A crucial detail: the consumer must have a fallback for when the cache is empty AND the User Service is down. Without this, you've just added a cache that delays the failure by one TTL cycle.
// GOOD: cache-aside with graceful degradation
async function getUserLocale(userId: string): Promise<string> {
const cached = await cache.get(`user:${userId}:locale`);
if (cached) return cached;
try {
const locale = await userServiceClient.getLocale(userId);
await cache.set(`user:${userId}:locale`, locale, { ttl: 300 });
return locale;
} catch (err) {
// God service is down. Use a sensible default rather than failing.
console.warn(`User Service unavailable, using default locale for ${userId}`);
return "en-US";
}
}
Fix 2: Embed critical data in auth tokens
If you need user properties (role, tenant, plan) on every request, put them in the JWT payload at login time. Services verify the token locally without any downstream call. The User Service is only called to update data, not to read it on the hot path.
{
"sub": "user-123",
"role": "admin",
"tenant_id": "acme-corp",
"plan": "enterprise",
"exp": 1714000000
}
Trade-off: token data can be stale until the user refreshes their session. For attributes that change rarely (role, plan), this is a good tradeoff. For attributes that change often (cart contents), it's not.
Fix 3: Decompose by subdomain
If the User Service handles auth, preferences, billing, and notifications, split it. Auth becomes a dedicated Auth Service. Billing becomes a Billing Service. Each service depends on fewer others, and failures are scoped.
This is not "going back to the monolith." It is correctly drawing service boundaries around business domains rather than around a single entity.
Fix 4: Event-driven read path
For data that changes infrequently but needs to be available everywhere, publish change events. The User Service emits user.updated events. Each consumer subscribes and maintains a local read-model with the data it needs.
// User Service publishes on write
async function updateUserLocale(userId: string, locale: string) {
await db.query("UPDATE users SET locale = $1 WHERE id = $2", [locale, userId]);
await eventBus.publish("user.updated", {
userId,
changes: { locale },
timestamp: Date.now(),
});
}
// Product Service consumes and caches locally
eventBus.subscribe("user.updated", async (event) => {
if (event.changes.locale) {
await localStore.set(`user:${event.userId}:locale`, event.changes.locale);
}
});
Trade-off: eventual consistency. When a user changes their locale, there's a brief window (typically milliseconds to seconds) where other services still see the old value. For most use cases, this is perfectly acceptable. For auth-critical data (is this user banned?), prefer the JWT approach or a short-TTL cache with a circuit breaker.
Which fix to use when
Severity and Blast Radius
A god service is a high-severity anti-pattern because the blast radius is proportional to fan-in. If 12 services depend on it synchronously, a single degradation event affects all 12 simultaneously.
Recovery is not straightforward. You can't just restart the god service and move on. If it crashed due to load, restarting it under the same load causes the same crash. The immediate fix is typically shedding load (rate limiting callers), but the structural fix (decomposition or caching) takes weeks to months.
The worst case: a god service failure triggers cascading timeouts across all consumers, which exhaust their own connection pools, which causes their callers to fail. The entire system goes down because of one service.
| Impact dimension | God service with fan-in of 10 |
|---|---|
| Blast radius | 10 services degraded simultaneously |
| Recovery time (immediate) | Minutes (restart, shed load) |
| Recovery time (structural) | Weeks to months (decompose, add caching) |
| Deployment risk | Every deploy is high-risk; any bug affects all consumers |
| Scaling cost | Must scale to handle combined traffic of all consumers |
The most insidious aspect: the god service's failure mode is indistinguishable from a "the whole system is down" event. Incident responders waste time investigating every consumer service before realizing the root cause is a single upstream dependency.
When It's Actually OK
- Early-stage startups (< 5 services): If you have 3 services and one is the User Service, the fan-in is manageable. Over-decomposing at this stage wastes time. Ship the god service, and plan to break it apart when you hit 8+ consumers.
- Internal tooling with low traffic: Admin dashboards, reporting tools, or batch jobs that call a central service infrequently don't create hot-path dependency pressure. If the User Service goes down and the only impact is "the admin dashboard is unavailable for 10 minutes," that's acceptable risk.
- Read-only aggregators: If the "god service" is a read-only data aggregator (not a write-path dependency), the blast radius of its failure is limited to stale reads, not broken writes.
- Behind a circuit breaker with graceful degradation: If every consumer has a fallback for when the central service is unavailable, the god service pattern is tolerable. The question is whether your team will actually implement and test those fallbacks.
The key test
Ask yourself: "If I'm paged at 3 a.m. because this service is down, how many people will also be paged?" If the answer is "every on-call engineer in the company," the god service has already become a structural problem, not just a latency concern.
How This Shows Up in Interviews
When you have a design with a shared service (auth, user data, shared config), the interviewer will probe: "What happens when the User Service is slow?" The correct answer involves circuit breakers, consumer-side caching, embedded auth tokens, or graceful degradation that doesn't require the User Service on every request.
Strong answers include:
- Identifying which data belongs in a JWT token vs. cached at the consumer vs. fetched live
- Distinguishing "source of truth for writes" from "synchronous read dependency"
- Proposing event-driven fan-out for read-heavy, write-rare data
- Mentioning circuit breakers and graceful degradation as immediate mitigations
Single source of truth != synchronous dependency
You can have one authoritative service for user data without requiring every service to synchronously call it on every request. The key pattern: push data to consumers (tokens, caches, events) so each consumer can operate independently.
Quick Recap
- A god service has excessive fan-in: too many other services depend on it synchronously in their hot path.
- When it has a bad day (memory leak, slow query, deployment bug) every service that depends on it degrades simultaneously.
- The blast radius is proportional to fan-in. Ten consumers means ten services affected by a single degradation event.
- Fix 1: Cache aggressively at the consumer to absorb the majority of read traffic. A 5-minute TTL covers 99% of calls.
- Fix 2: Embed stable user attributes in JWT tokens so services can validate sessions without a network call.
- Fix 3: Decompose by subdomain: auth, preferences, and billing are separate concerns that should be owned by separate services.
- Fix 4: Use event-driven read paths to push updates to consumers asynchronously rather than requiring synchronous reads.
- The decision is not "centralize vs. distribute" but "synchronous dependency vs. async data push."