API gateway overload anti-pattern
Learn why cramming business logic, orchestration, authentication, and transformation into your API gateway creates a bottleneck and tight coupling that defeats the gateway's purpose.
TL;DR
- An API gateway should handle cross-cutting concerns: routing, TLS termination, rate limiting, auth token validation, and request logging. That's it.
- The anti-pattern: business logic, orchestration, data transformation, and response aggregation creep into the gateway. It starts knowing specific business rules, calling specific microservices, and holding domain knowledge.
- A logic-laden gateway becomes the slowest, most fragile point in your system. Every business change requires a gateway deployment, and one slow endpoint degrades all APIs.
- The fix: move business orchestration to a Backend-for-Frontend (BFF) service or a dedicated orchestrator that sits behind the gateway, not inside it.
- The test for whether logic belongs in the gateway: does it change when business requirements change? If yes, extract it.
The Problem
It's 11:15 p.m. on launch night. Your team just released the new checkout flow, and latency spikes from 120ms to 4.2 seconds. The on-call engineer checks the Order Service: healthy, 15ms p99. The Inventory Service: fine. The User Service: no issues. Every downstream service is green.
The problem is the API gateway. It's making 6 synchronous calls per checkout request: user tier lookup, feature flag evaluation, order creation, inventory check, shipping estimate, and a discount calculation. One slow response from the feature flag service (200ms timeout) cascades into every single API request because the gateway is single-threaded on that code path.
I've seen this exact failure at two different companies. Both times, the team was shocked because "the gateway was just routing." It wasn't. It had quietly become the most complex service in the system.
Your API gateway started clean six months ago. It routed /api/orders/* to the Order Service and validated JWTs. Simple. Then the requirements trickled in:
- Calls the User Service to inject user tier into the request header
- Aggregates responses from Order, Inventory, and Shipping for the checkout endpoint
- Applies a 10% discount for "premium" tier users (business logic in infrastructure)
- Transforms date formats from ISO 8601 to Unix timestamps for legacy clients
- Queries the feature flag service to decide which checkout flow version to expose
The mistake I see most often: teams treat the gateway as "just a YAML config file" and assume adding logic there is cheaper than building a new service. It is cheaper, for exactly one sprint. After that, it's debt that compounds with every new feature.
Here's what the overloaded gateway looks like in practice:
Six sequential calls. Business logic in the gateway. One slow dependency tanks every API in the system, not just checkout.
The gateway is a shared resource. When it's busy doing orchestration for checkout, health check endpoints, search queries, and profile lookups all queue behind it. At 500 RPS, those 6 sequential calls per checkout request consume 3,000 outbound connections from the gateway's connection pool. That's connection pool exhaustion waiting to happen.
Every deployment of the checkout business logic now requires a gateway deployment. The gateway team becomes a bottleneck for every product feature. And because the gateway handles all APIs, a failed deployment rolls back everything, not just checkout.
The root cause: the gateway stopped being infrastructure and became a business service.
Here's the architecture from 10,000 feet. Notice how the gateway sits in the middle of everything:
Every โ node is business logic that has no place in a gateway. The gateway knows about user tiers, discount rules, inventory, and shipping. It's coupled to every domain in your system.
Compare this to what a clean gateway looks like:
The clean gateway has four responsibilities. The BFF layer owns all business orchestration. When the checkout flow changes, only the BFF deploys. The gateway stays untouched.
The visual difference is striking: the overloaded gateway connects to everything, while the clean gateway connects to exactly one downstream layer. If your gateway architecture diagram looks like a spider web, that's a red flag.
Why It Happens
Think of an API gateway like a security desk in a building lobby. It checks badges, logs entries, and directs visitors to the right floor. Now imagine the security desk also starts making coffee orders, handling mail sorting, resolving HR disputes, and calculating payroll. Everyone in the building depends on the security desk, and now it's doing six jobs instead of one.
Every decision that leads here sounds reasonable in isolation. That's what makes this anti-pattern so common. No single PR introduces the problem; it's a death by a thousand cuts.
"It's just one small call." The first piece of logic is always tiny. "We just need to check the user's tier before routing." A 5-line if-statement. No one creates a new service for 5 lines of code. So it goes in the gateway.
"The gateway already has the auth context." Since the gateway validates JWTs, it already knows the user ID. It feels wasteful to pass that downstream and have another service look it up again. So the gateway enriches the request. Now it has domain knowledge.
"We need response aggregation for mobile." The mobile team needs a single endpoint that combines data from three services. The gateway is the natural aggregation point. Except now it knows how to call specific services, merge their responses, and handle partial failures.
"We don't have time to build a BFF." Building a separate orchestration service feels like over-engineering when you have 3 microservices. The gateway is right there. I've made this exact argument myself, and it's valid for a while. The problem is that "a while" expires faster than you expect. What starts as one aggregation endpoint becomes five, then ten.
Each decision compounds. By the sixth or seventh addition, the gateway owns business logic, deployment cadence, and domain models. It's no longer infrastructure. It's the most important business service in your architecture, but nobody treats it that way.
"The gateway team is fast and reliable." Sometimes the gateway team is the most competent team in the org. Other teams start routing requests through the gateway "temporarily" because they trust gateway deployments more than their own. Before you know it, the gateway team is the de facto orchestration team owning checkout, search, and profile endpoints, none of which are gateway concerns.
Rolling it back requires extracting logic into new services, building new CI/CD pipelines, and migrating traffic. That's always harder than adding the logic was in the first place.
The cost of extraction grows linearly with each new piece of business logic you add. A gateway with one aggregation endpoint takes a day to extract. A gateway with twenty orchestration paths takes months. This is why catching it early matters so much.
Here's the uncomfortable truth: the best time to extract is before you need to. The second best time is now. Every week you wait, the extraction gets harder and the blast radius of a gateway failure gets larger.
How to Detect It
The tricky part about this anti-pattern is that it doesn't announce itself. There's no alarm that says "your gateway now has too much business logic." It creeps in one PR at a time. By the time someone notices, the gateway is doing 15 things that aren't its job.
Here are the warning signs. If you see three or more, you're already deep in the anti-pattern:
| Symptom | What It Means | How to Check |
|---|---|---|
| Gateway has service-specific code paths | Business logic has leaked in | grep -r "if.*path.*checkout" gateway/ |
| Gateway calls 2+ downstream services per request | Orchestration lives in the gateway | Trace span count per gateway request |
| Business team tickets require gateway deploys | Domain coupling | Check deploy history: gateway deploys per sprint |
| Gateway p99 latency > 5x downstream service p99 | Gateway is doing too much work | Compare gateway vs downstream latency dashboards |
| Gateway is the hardest service to deploy | Too many concerns in one deployable | Deployment frequency and rollback rate |
| Gateway config references domain entities | Domain knowledge leaked into infra | Search config for "discount", "tier", "checkout" |
The fastest check: look at your gateway's outbound connections. If it calls more than one downstream service for a single routing decision, it's doing orchestration. Gateways route. They don't orchestrate.
Here's a quick diagnostic you can run right now:
# Count unique downstream hosts the gateway connects to per request path
# If any path connects to 3+ different services, that's orchestration
kubectl logs gateway-pod | grep "upstream_addr" | \
awk '{print $path, $upstream}' | sort | uniq -c | sort -rn
Another useful check: count the number of domain-specific imports in your gateway code.
// Run this against your gateway codebase
// If you see imports from domain services, you have coupling
import { UserTier } from "@company/user-models"; // โ domain model in gateway
import { DiscountRule } from "@company/pricing"; // โ business logic in gateway
import { RateLimitConfig } from "@company/infra"; // โ
infrastructure concern
import { JWTValidator } from "@company/auth-infra"; // โ
infrastructure concern
If your gateway imports domain models or business logic packages, it has crossed the line. Gateway dependencies should be limited to infrastructure libraries: HTTP frameworks, auth utilities, observability SDKs, and rate limiting libraries. If you count more than 5 domain-specific imports, you have a serious case of this anti-pattern.
Here's a simple heuristic: count the number of distinct microservice URLs in your gateway config or code. A clean gateway has one URL per route (the downstream service it routes to). An overloaded gateway has multiple URLs per route because it orchestrates calls to several services before responding.
Another red flag: if your gateway has try-catch blocks with service-specific error handling. A clean gateway doesn't need to know what happens when the Inventory Service returns a 503. It just forwards the error. An overloaded gateway catches the error, falls back to cached inventory data, and returns a degraded response. That's business logic masquerading as error handling.
The most dangerous version of this anti-pattern is invisible in metrics. If the gateway calls are fast (under 50ms each), latency looks fine. The real cost shows up as deployment coupling: every product change requires a gateway release. Check your deploy logs, not just your dashboards.
The Fix
The fix comes in three parts: clarify what belongs in the gateway, extract what doesn't, and add a BFF layer for orchestration.
Step 1: Draw the Line
This table is your decision framework. Print it and tape it to your monitor. Every time someone proposes adding logic to the gateway, run it through this table.
The distinction comes down to one question: does this concern change when business requirements change, or when infrastructure requirements change?
- Infrastructure concerns change when you switch cloud providers, update security policies, or adjust capacity. Rate limiting, TLS, JWT validation, routing.
- Business concerns change when the product manager writes a ticket. Discount logic, feature flags, response shaping, data enrichment.
If someone says "but it's only a 5-line change in the gateway," they're looking at the cost of the change, not the cost of the coupling. The coupling is what kills you over time.
Here's how I explain it in code reviews: "This PR is 5 lines. The next 20 PRs that follow this pattern are 5 lines each. Now we have 100 lines of business logic in the gateway and a 3-week extraction project on our hands."
| Concern | In Gateway? | Where Instead |
|---|---|---|
| TLS termination | โ Yes | Gateway |
| JWT signature verification | โ Yes | Gateway |
| Rate limiting by IP or user | โ Yes | Gateway |
| Request routing by path/host | โ Yes | Gateway |
| Response aggregation from N services | โ No | BFF service |
| Business rule application | โ No | Domain service |
| User data enrichment | โ No | BFF or the calling service |
| Feature flag evaluation | โ No | Application layer |
| Protocol translation (REST to gRPC) | โ Sometimes | Gateway or dedicated adapter |
The test: does this logic change when business requirements change? If yes, it doesn't belong in the gateway.
For your interview: memorize this table. It's the single most useful framework for gateway design discussions. When someone proposes putting feature flags in the gateway, you can point to this and say "that changes when business requirements change, so it goes in the application layer."
Step 2: Extract to a BFF
The Backend-for-Frontend (BFF) is a thin service per client type (mobile BFF, web BFF) that sits behind the gateway. It aggregates upstream service responses, applies client-specific transformations, and handles orchestration. The gateway treats the BFF as just another upstream.
Why per-client-type? Because different clients have different data needs. Your mobile app needs a trimmed payload with just the essential fields to save bandwidth. Your web dashboard needs the full response with nested details. Your partner API needs a completely different response shape. When these transformations live in the gateway, the gateway accumulates client-specific code paths. In a BFF, each client type gets its own service with its own deployment cycle.
The key insight: the BFF owns the "how do I compose a response for this client?" question. The gateway only owns "where does this request go?"
The gateway has no business knowledge. Each BFF owns exactly the orchestration logic for its client type. When the checkout flow changes, you deploy the BFF, not the gateway.
This also gives you independent scaling. If mobile traffic spikes during a sale, you scale the Mobile BFF without touching the gateway or the Web BFF. The gateway's resource usage stays flat because it's only doing routing.
Step 3: Migrate Incrementally
Don't try to extract everything at once. My recommendation: pick the single most complex gateway endpoint, build a BFF route for it, and redirect traffic. Then repeat.
Here's a practical migration checklist:
- Identify the worst offender. Which gateway endpoint has the most downstream calls? Start there.
- Build the BFF endpoint. Copy the logic verbatim. Don't refactor yet.
- Shadow traffic. Send a copy of production traffic to the BFF and compare responses.
- Switch routing. Update the gateway config to route to the BFF instead.
- Remove gateway code. Only after the BFF is serving production traffic successfully.
- Repeat. Move to the next most complex endpoint.
// BEFORE: gateway handles checkout orchestration
// gateway/routes/checkout.ts
app.post("/checkout", async (req, res) => {
const user = await userService.getUser(req.userId); // domain call
const flags = await featureFlags.get("checkout-v2"); // domain call
const order = await orderService.create(req.body); // domain call
const shipping = await shippingService.estimate(order); // domain call
if (user.tier === "premium") order.discount = 0.10; // business logic!
res.json({ order, shipping, flags });
});
// AFTER: gateway just routes, BFF handles orchestration
// gateway config (nginx/Kong/etc)
// route: /api/checkout/* -> checkout-bff:3000
// checkout-bff/routes/checkout.ts
app.post("/checkout", async (req, res) => {
const [user, flags] = await Promise.all([
userService.getUser(req.userId),
featureFlags.get("checkout-v2"),
]);
const order = await orderService.create(req.body);
const shipping = await shippingService.estimate(order);
if (user.tier === "premium") order.discount = 0.10;
res.json({ order, shipping, flags });
});
The logic is identical, but now it lives in a service that the product team owns and deploys independently from gateway infrastructure. The BFF can have its own CI/CD pipeline, its own scaling policy, and its own on-call rotation.
Notice the BFF version also parallelizes the user and feature flag lookups with Promise.all. When orchestration lives in a dedicated service, you can optimize the call pattern without worrying about affecting gateway throughput for unrelated endpoints.
The gateway config change is the key moment. The gateway goes from "I know how to build a checkout response" to "I know that /api/checkout goes to checkout-bff:3000." That's the transformation you want.
Common migration mistakes:
- Building the BFF as a monolith that aggregates everything. Keep BFFs focused per client type or per domain.
- Leaving dead code in the gateway after migration. Remove it. Dead gateway code creates confusion during incidents.
- Skipping the shadow traffic step. Comparing BFF responses against gateway responses catches subtle bugs before users do.
Which approach fits your situation?
Severity and Blast Radius
Severity: High. The gateway is a shared chokepoint. Every API in your system flows through it. When the gateway is slow, everything is slow. There's no circuit breaker that helps here because the gateway is the entry point.
One overloaded checkout path cascades into every API. The load balancer sees health check timeouts and starts removing gateway instances, which increases load on remaining instances. This is a classic cascading failure.
In the worst case I've seen, a team lost 3 of 4 gateway instances in under 2 minutes because the health checks timed out. The remaining instance couldn't handle the full traffic load, and the entire platform went dark for 45 minutes.
| Dimension | Impact |
|---|---|
| Latency | All APIs degrade when gateway is busy |
| Availability | Gateway OOM or thread exhaustion = total outage |
| Deployment coupling | Business changes require infra team deploys |
| Blast radius | Total: auth, routing, rate limiting, all business endpoints |
| Recovery time | Weeks (2-6 sprints for full extraction) |
| Cascading failure risk | High: one slow downstream service blocks all requests |
Recovery is measured in weeks, not hours. You can't just "remove the logic." You need to build the BFF service, migrate endpoints one at a time, and verify each migration under production traffic. I've seen this extraction take 2-6 sprints depending on how deeply the business logic is embedded.
The worst part: when this anti-pattern causes an outage, the fix is "deploy less to the gateway," which means product features get delayed. The organizational cost often exceeds the technical cost.
If you're calculating the cost of this anti-pattern for a business case, add these numbers:
- Gateway-related incident count per quarter
- Average time-to-restore for gateway incidents
- Number of product features delayed by gateway deployment bottlenecks
- Engineering hours spent on gateway code reviews for business logic changes
That usually makes the case for extraction clear. When the CTO sees "12 product features were delayed by gateway deployment conflicts last quarter," the BFF project gets funded.
When It's Actually OK
Not every bit of logic in the gateway is a sin. Here's where the line gets blurry:
- Prototype or MVP with fewer than 3 services. If you have 2 microservices and one client, a little aggregation in the gateway won't kill you. Just know you're taking on debt that compounds fast.
- Simple header enrichment. Adding a request ID, correlation ID, or trace context in the gateway is fine. That's infrastructure, not business logic. The key test: this logic doesn't change when your product manager writes a ticket.
- A/B routing by header. Routing 10% of traffic to a canary based on a header value is a gateway concern. The gateway isn't making business decisions, it's making infrastructure decisions about traffic distribution.
- You have fewer than 5 engineers. The overhead of a separate BFF service (new repo, CI/CD pipeline, deployment target, monitoring) may not be worth it yet. Revisit when gateway deploys start blocking product work.
- Protocol translation at the edge. Converting REST to gRPC at the gateway boundary is acceptable if it's a uniform transformation applied to all traffic, not endpoint-specific logic that changes per business route.
The bright line: if the logic changes when business requirements change, extract it. If it changes when infrastructure requirements change, it can stay. When in doubt, ask: "Would the product manager care about this code?" If yes, it doesn't belong in the gateway.
Related anti-patterns: If your overloaded gateway is also the only way services communicate with each other (not just external clients), you may have the closely related "God Gateway" pattern, where internal service-to-service calls route through the gateway unnecessarily. Internal traffic should use direct service-to-service communication or a service mesh, not the API gateway.
How This Shows Up in Interviews
This comes up whenever you draw an API gateway in your architecture. The interviewer is testing whether you understand the boundary between infrastructure and business logic. I've seen candidates lose points by casually saying "the gateway aggregates responses from three services" without realizing they just created a single point of failure.
Most interviewers won't explicitly ask "what shouldn't go in the gateway?" Instead, they'll ask you to design the checkout flow or the search experience, and they'll listen for whether you instinctively separate infrastructure from business logic. The best candidates make this separation without being prompted.
A strong answer includes:
- Naming exactly what the gateway does: "TLS termination, JWT validation, rate limiting, and path-based routing."
- Explicitly saying what it doesn't do: "Orchestration and response aggregation live in a BFF layer behind the gateway."
- Explaining why: "Keeping the gateway thin means it deploys independently of business logic. Feature changes don't require gateway releases."
For your interview: say "the gateway handles cross-cutting infrastructure, and a BFF handles orchestration" then move on. Don't spend 5 minutes on gateway internals unless the question specifically asks about gateway design.
If the interviewer pushes back with "why not just put it in the gateway?", you have the perfect response: "Because the gateway is a shared resource. Business logic in the gateway means every product change requires an infrastructure deployment, and one slow endpoint degrades all APIs."
Say: "The gateway handles cross-cutting infrastructure concerns: routing, auth, rate limiting. Business orchestration lives in BFF services behind the gateway. This keeps the gateway thin, fast, and independently deployable."
Quick Recap
- An API gateway should only house cross-cutting concerns: routing, auth validation, rate limiting, and TLS termination. Nothing else.
- Business logic, response aggregation, and orchestration creep into the gateway because each addition seems small and the gateway already has the auth context.
- An overloaded gateway becomes a shared bottleneck: when it's slow, every API is slow. There's no isolation between endpoints. One bad path degrades all paths.
- The fix is a Backend-for-Frontend layer that sits behind the gateway and owns business orchestration per client type.
- Detect the anti-pattern by checking outbound calls per request, domain model imports in gateway code, and gateway deploy frequency relative to business changes.
- Migrate incrementally: extract one endpoint at a time to a BFF, shadow traffic to compare responses, verify under production load, then cut over.
- At very small scale (fewer than 5 engineers, fewer than 3 services), a little gateway logic is acceptable debt. Set a clear trigger for extraction and revisit regularly.
- In interviews, always separate gateway concerns (infrastructure) from BFF concerns (business). This distinction is one of the strongest signals of architectural maturity.