RPC vs. messaging
The architectural difference between synchronous RPC calls (gRPC, REST) and asynchronous messaging (Kafka, SQS): when each model fits, the failure modes, and why the choice affects your entire service topology.
TL;DR
| Dimension | Choose RPC | Choose Messaging |
|---|---|---|
| Coupling | Caller needs the result to proceed (checkout needs payment confirmation) | Caller doesn't need the result immediately (send notification, update analytics) |
| Failure mode | Prefer fast failure with immediate error feedback | Prefer buffered retry with eventual delivery |
| Latency | Need sub-50ms response, direct call to callee | Can tolerate seconds-to-minutes of processing delay |
| Fan-out | One caller, one callee, point-to-point | One event, many consumers (notifications + analytics + inventory) |
| Traffic shape | Steady, predictable throughput | Bursty traffic that needs buffering (flash sales, batch jobs) |
Default answer: RPC for the user-facing request/response path, messaging for side-effects and background work. Most production systems use both. The checkout call is RPC (you need to know if payment succeeded), but the post-purchase email, analytics event, and inventory update go through a message broker.
The Framing
A team I worked with built an order service that made synchronous RPC calls to five downstream services: payment, inventory, notifications, analytics, and fraud detection. Every order went through all five before the user saw "Order confirmed."
Then the analytics service deployed a bad query that added 800ms to every response. Suddenly the order endpoint went from 200ms to 1,000ms. Users started abandoning checkout. But the analytics data wasn't even user-visible. It was a background metric pipeline holding the checkout hostage.
The fix was straightforward: payment and inventory stayed as RPC calls (the order can't complete without them). Notifications, analytics, and fraud detection moved to a Kafka topic. The order service publishes an "order.placed" event and returns immediately after payment + inventory succeed. The three background services consume the event at their own pace.
Order latency dropped back to 200ms. When analytics deploys a bad query now, its consumer falls behind, messages queue up, and nobody notices until the dashboards are delayed. The checkout path is completely unaffected.
This pattern is the core of the tradeoff: RPC couples services in time (both must be available simultaneously), messaging decouples them (the broker absorbs timing differences). The question is which services belong on the critical path and which don't.
How Each Works
RPC: Synchronous Request/Response
RPC (Remote Procedure Call) makes a network call look like a local function call. Service A calls Service B, waits for the response, and continues. The caller blocks until the callee responds or times out.
# gRPC client calling a payment service
import grpc
from payment_pb2 import ChargeRequest
from payment_pb2_grpc import PaymentServiceStub
channel = grpc.insecure_channel("payment-service:50051")
client = PaymentServiceStub(channel)
# Blocks until response or timeout
response = client.Charge(
ChargeRequest(
order_id="ord_abc123",
amount_cents=4999,
currency="USD",
idempotency_key="idem_xyz789"
),
timeout=5.0 # 5 second timeout
)
if response.status == "SUCCESS":
proceed_with_order()
else:
handle_payment_failure(response.error)
The strength of RPC is immediate feedback. You know right now whether the payment succeeded. You can make a decision on the next line of code. The programming model is simple: call a function, get a result.
The weakness is temporal coupling. Both services must be up at the same time. If the callee is slow, the caller is slow. If the callee is down, the caller fails (or times out). In a chain of RPC calls (A calls B calls C calls D), slowness or failure at any point cascades back through every caller.
Two dominant RPC protocols exist:
| Feature | REST (HTTP/JSON) | gRPC (HTTP/2 + Protobuf) |
|---|---|---|
| Serialization | JSON, human-readable, ~10x larger | Protobuf binary, compact, ~10x smaller |
| Contract | OpenAPI spec (optional) | .proto file (required, code-generated) |
| Streaming | Not native (SSE, WebSocket for workarounds) | Bidirectional streaming built in |
| Latency | ~1-5ms serialization overhead | ~0.1-0.5ms serialization overhead |
| Browser support | Native | Requires grpc-web proxy |
| Tooling | curl, Postman, any HTTP client | Requires protoc, language-specific stubs |
My rule: REST for public APIs and browser-facing services. gRPC for internal service-to-service communication where latency and type safety matter.
Messaging: Asynchronous Fire-and-Forget
Messaging decouples the sender from the receiver with a broker in between. The sender publishes a message and moves on immediately. The broker stores the message durably. Consumers read messages at their own pace.
# Kafka producer: publish order event
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=["kafka-1:9092", "kafka-2:9092"],
value_serializer=lambda v: json.dumps(v).encode("utf-8"),
acks="all", # Wait for all replicas to acknowledge
retries=3
)
# Fire and forget: returns immediately after broker ack
producer.send(
topic="order.events",
key=b"ord_abc123", # Partition by order ID for ordering
value={
"event_type": "order.placed",
"order_id": "ord_abc123",
"user_id": "usr_456",
"items": [{"sku": "WIDGET-1", "qty": 2}],
"total_cents": 4999,
"timestamp": "2025-03-15T14:30:00Z"
}
)
# Producer continues immediately, doesn't wait for consumers
The strength is decoupling in time and space. The producer doesn't know or care which consumers exist, how many there are, or whether they're currently running. If a consumer crashes, messages queue up and are delivered when it recovers. If traffic spikes 10x, the broker absorbs the burst while consumers process at a sustainable rate.
The weakness is delayed feedback. The producer doesn't know if the consumer successfully processed the message. Failures are visible minutes later (dead letter queue, consumer lag alerts), not immediately. Debugging is harder because the request and processing are separated in time.
Head-to-Head Comparison
| Dimension | RPC | Messaging | Verdict |
|---|---|---|---|
| Coupling | Temporal + spatial: both services must be up simultaneously | Decoupled: producer and consumer work independently | Messaging for resilience |
| Latency | Direct call, sub-ms to low-ms (same datacenter) | Broker round-trip + consumer poll interval (10ms-1s) | RPC for latency-sensitive |
| Failure feedback | Immediate: caller knows within timeout | Delayed: minutes to hours via DLQ and lag monitoring | RPC for critical operations |
| Failure propagation | Cascading: slow callee slows all callers | Absorbed: slow consumer just builds queue depth | Messaging for isolation |
| Fan-out | Point-to-point: one caller, one callee per call | Pub/sub: one event, unlimited consumers | Messaging, decisively |
| Traffic buffering | None: caller must handle the full load | Broker absorbs spikes, consumers process at steady rate | Messaging for bursty traffic |
| Ordering | Natural: request-response preserves order | Complex: partition-level ordering only, rebalancing breaks it | RPC for strict ordering |
| Debugging | Trace a single request: call stack, logs, response | Trace across producer/broker/consumer: correlation IDs, lag | RPC, simpler |
| Transactions | 2PC or saga orchestration | Saga choreography with compensation events | RPC for coordinated ops |
| Contract management | OpenAPI / Protobuf schema, versioned APIs | Schema registry (Avro/Protobuf), topic naming conventions | Similar complexity |
The honest assessment: RPC is simpler to reason about because the call stack is linear and errors are immediate. Messaging is more resilient because it absorbs failures and decouples deployments. My approach in every design: start with RPC for the happy path, then identify which downstream calls can be moved to messaging without breaking the user experience.
When RPC Wins
RPC is the right model when the caller genuinely needs the result to continue.
Payment processing. The checkout page can't show "Order confirmed" without knowing if the charge succeeded. You need synchronous feedback. A message that says "payment will be processed eventually" doesn't work for a user staring at a checkout screen.
Data reads that drive UI. GET /user/123, GET /product/456. The frontend needs the data to render the page. There's no async alternative to "give me the profile, I'll display it."
Multi-step workflows where each step depends on the previous. Reserve inventory, then charge payment, then create shipment. Each step needs the result of the previous step. An RPC chain (with circuit breakers) is simpler than orchestrating this through message passing.
Low-latency requirements. Internal service calls in a microservices architecture where p99 latency must be under 50ms. Message broker round-trips add 10-50ms of overhead that you can't afford on latency-sensitive paths.
Strong consistency needs. When you need to know that multiple services agree on the outcome of an operation before committing. RPC with 2PC or saga orchestration gives you coordinated confirmation. Messaging gives you eventual consistency.
When Messaging Wins
Messaging is the right model when the caller doesn't need the result to continue, and especially when reliability under failure matters more than speed.
Side-effects after the main operation. Order placed: send confirmation email, update analytics, trigger fraud scan, adjust inventory projections. None of these need to complete before the user sees "order confirmed." If the email service is down, the email should be delivered later, not block the checkout.
Fan-out to multiple consumers. One event, many listeners. Kafka's consumer groups let you add new consumers without changing the producer. When the marketing team wants to subscribe to "order.placed" events for campaign targeting, they deploy their own consumer. No changes to the order service.
Traffic spike absorption. Black Friday spike at 50x normal order volume. RPC: if downstream services can't handle 50x throughput, they fail, and the checkout fails. Messaging: the broker absorbs 50x volume, consumers process at their sustainable rate (maybe 5x normal), and the queue drains over minutes. Orders are processed slower but none are lost.
Long-running processing. Video encoding (minutes), ML model training (hours), report generation (seconds to minutes). Holding an HTTP connection open for 10 minutes is fragile. Publish a job to a queue, return a job ID, let the client poll or subscribe for completion.
Cross-service data synchronization. CDC (Change Data Capture) from a database publishes change events. Search indexer, cache warmer, and analytics pipeline consume independently. Each consumer maintains its own projection of the data without coupling to the source service's API.
The Nuance
The Hybrid Pattern Is the Default
Almost every production microservices system uses both RPC and messaging. The split follows a simple rule:
The checkout path is RPC: charge the card, reserve inventory, return confirmation. The post-checkout side-effects are messaging: email, analytics, fraud, recommendations. The user sees the result of the RPC path; the messaging path runs in the background.
Messaging Failure Modes Are Insidious
People choose messaging for reliability but underestimate its failure modes:
Delayed failure visibility. A consumer bug silently drops every 100th message. With RPC, the caller would see an error immediately. With messaging, you don't notice until a customer complains they never got their confirmation email. Monitoring consumer lag and dead letter queue depth is essential.
Ordering complexity. Kafka guarantees order within a partition, but not across partitions. If you partition by user_id, all events for one user are ordered. But if user A sends a message that affects user B's data, the ordering across partition A and partition B is undefined. Handling this requires careful partition key design.
Dead letter queue management. Messages that fail repeatedly end up in a DLQ. Someone needs to investigate, fix, and replay them. In RPC, the caller handles the error inline. In messaging, DLQ management is a separate operational concern that teams underinvest in.
The Saga Pattern Bridges Both Worlds
Distributed transactions across services need coordination. The saga pattern typically uses messaging for the steps:
The orchestrator coordinates via messaging (durable, retryable), but the overall pattern has RPC-like semantics: a request enters, steps execute in order, and the outcome is either full success or full compensation. Temporal and Cadence formalize this pattern.
Real-World Examples
Uber: Uses both extensively. User-facing flows (ride request, fare estimation, ETA calculation) use gRPC for sub-100ms internal service calls. Background flows (trip analytics, driver payments, surge pricing calculations) use Kafka. Their architecture has 4,000+ microservices, and the rule is clear: if the user is waiting for the response, use gRPC. If not, use Kafka. They process millions of Kafka events per second for analytics alone.
Stripe: Payment processing is synchronous RPC (the API caller needs "charge succeeded" or "charge failed" immediately). But post-payment processing (receipt emails, webhook delivery, fraud model training, financial reconciliation) is all messaging. Their webhook delivery system uses a message queue with retry logic: if your server is down, Stripe retries the webhook for up to 72 hours with exponential backoff.
LinkedIn: Kafka was invented at LinkedIn to solve exactly this problem. Their original architecture used RPC between services, and cascade failures were a constant operational problem. They built Kafka to decouple producers from consumers, enabling 7+ trillion messages per day. Activity tracking, notifications, search indexing, and ad targeting are all driven by Kafka events, while user-facing API calls remain synchronous RPC.
How This Shows Up in Interviews
This tradeoff appears in every system design interview that involves multiple services. The interviewer wants to see that you know which communication model to use and why, not that you default to one or the other.
What they're testing: Can you draw the line between operations that need synchronous feedback and operations that can be deferred? Do you understand cascade failure risks? Can you articulate why you'd put something on a queue instead of making a direct call?
Depth expected at senior level:
- Know when each model applies (result needed vs. fire-and-forget)
- Explain cascade failure in RPC chains and name mitigations (circuit breakers, timeouts, bulkheads)
- Describe dead letter queue management for messaging failures
- Know the hybrid pattern: RPC for the critical path, messaging for side-effects
- Explain partition-level ordering in Kafka and its implications
| Interviewer asks | Strong answer |
|---|---|
| "How do services communicate in your design?" | "The checkout path uses gRPC: payment and inventory require synchronous responses. Post-checkout side-effects (email, analytics, fraud) publish to a Kafka topic. This keeps checkout latency at 200ms while background processing happens independently." |
| "What happens if the notification service is down?" | "If it's RPC, the order fails or degrades. If it's messaging, the event queues in Kafka until the notification service recovers. Since notifications aren't critical to order success, messaging is the right model. The user gets their confirmation immediately; the email arrives when the service is back." |
| "How do you handle failures in the message consumer?" | "Retry with exponential backoff (3 attempts). After max retries, the message moves to a dead letter queue. We alert on DLQ depth > 0 and have a replay tool to reprocess failed messages after fixing the bug. Consumer offsets don't advance past the failed message until it's handled." |
| "Why not use messaging for everything?" | "Messaging adds latency (broker round-trip + poll interval) and loses immediate feedback. For payment processing, I need to know right now if the charge succeeded. For a confirmation email, I just need it to arrive eventually. The communication model should match the operation's requirements." |
| "How do you prevent cascade failures with RPC?" | "Circuit breakers (open after 5 failures, half-open after 30s), timeouts on every call (3s for user-facing), bulkheaded connection pools per downstream service, and idempotent retries with exponential backoff. The most important one is the timeout: without it, a slow downstream holds the caller indefinitely." |
Gotcha: don't say 'use Kafka for everything'
Kafka is great for event streams, but it adds operational complexity (ZooKeeper/KRaft, partition management, consumer group rebalancing) and latency compared to direct RPC. Using Kafka for a simple synchronous GET request is over-engineering. Show you know when the complexity is justified.
Quick Recap
- RPC is synchronous request-response: the caller blocks until the callee responds. Use it when the caller needs the result to proceed (payment confirmation, data reads, inventory checks).
- Messaging decouples sender and receiver with a broker. Use it when the caller doesn't need immediate feedback (notifications, analytics, background processing) and when you need fan-out, buffering, or failure isolation.
- RPC propagates failures: a slow or failing downstream service makes every upstream caller slow or failing. Circuit breakers, timeouts, and bulkheads mitigate but don't eliminate this risk.
- Messaging absorbs failures: a slow consumer just builds queue depth, not caller latency. But failure visibility is delayed (DLQ, lag monitoring) instead of immediate.
- The hybrid pattern is the industry standard: RPC for the user-facing critical path, messaging for side-effects and background work. Almost every production microservices system uses both.
- In interviews, draw the line explicitly: "Payment is RPC because I need the charge result. Email is messaging because the user doesn't wait for delivery." This shows architectural judgment.
Related Trade-offs
- Sync vs. async for the broader synchronous vs. asynchronous communication pattern
- Event-driven architecture for event sourcing, choreography, and the full event-driven model
- Message queues for deep dives into Kafka, SQS, RabbitMQ internals
- REST vs. GraphQL for the API-layer tradeoff when you've already chosen the RPC model
- Circuit breaker pattern for the primary defense against RPC cascade failures | Protocol | HTTP/1.1 | HTTP/2 | | Serialization | JSON | Protocol Buffers (binary) | | Schema | Optional (OpenAPI) | Required (.proto file) | | Browser support | Native | Requires proxy (grpc-web) | | Streaming | Limited | Native bidirectional | | Payload size | Larger (text) | Smaller (binary, 3-10x) | | Ecosystem | Vast | Growing |
REST for public APIs and browser clients. gRPC for internal service-to-service calls where latency and throughput matter.
The Hybrid Reality
Most production systems use both:
Synchronous RPC (gRPC/REST):
โ User-facing request path (checkout, search, auth)
โ Response required before the browser can render
Asynchronous messaging (Kafka/SQS):
โ Side effects (analytics, notifications, downstream processing)
โ Fire-and-forget operations
โ Work that can tolerate delay or must survive spikes
The rule: if the user is waiting for a response, use RPC. If the work continues after the user gets their response, use messaging.
Quick Recap
- RPC (REST, gRPC): synchronous, caller waits, immediate feedback, temporal coupling (both services must be available at the same time).
- Messaging (Kafka, SQS): asynchronous, caller continues, decoupled failure, natural fan-out, but delayed failure visibility.
- RPC propagates failures up the call chain. Messaging absorbs failures (consumer failure queues messages rather than crashing producers).
- gRPC vs. REST within RPC: gRPC (binary, HTTP/2, faster) for internal services; REST (JSON, HTTP/1.1, broad tooling) for public APIs and browser clients.
- Most systems use both: RPC for the user-facing response path, messaging for side effects and background work.