Notification Service
Design a multi-channel notification service that delivers billions of push, email, and SMS notifications per day reliably, covering ingestion pipelines, fan-out strategies, deduplication, and guaranteed delivery.
What is a notification service?
A notification service delivers messages from your product to users across push, email, and SMS. Sending one notification is trivial. Sending a billion without losing any, without blowing APNs rate limits during a viral campaign, and without delivering the same notification twice because a worker crashed mid-send is where the real engineering lives.
I like opening with this system in interviews because it forces you to think about producer-consumer boundaries, external provider quirks, and idempotency all in one design. This question tests message queues, fan-out strategies, distributed delivery guarantees, and external provider integration all in the same answer.
Functional Requirements
Core Requirements
- Send notifications through mobile push (APNs/FCM), email, and SMS channels.
- Support immediate and scheduled sends.
- Guarantee at-least-once delivery; deduplicate at the client where possible.
- Allow users to manage preferences and opt out of specific channels.
Below the Line (out of scope)
- In-app notification bell and inbox (badge counts, read/unread state)
- Transactional OTP flows with tight sub-3-second delivery requirements
- A/B testing of notification content and send-time optimization
- Rich push notifications with images and deep-link action buttons
The hardest part in scope: Fan-out at scale. When a platform pushes a new post to 50 million followers, those 50 million push notifications must land within minutes without saturating APNs, causing other notifications to queue behind the campaign for hours, or delivering duplicates if the fan-out worker crashes halfway through.
An in-app notification bell is out of scope because it requires a separate storage model (an inbox per user with read/unread state) and a real-time delivery mechanism. To add it, I would write notification records to a dedicated inbox table after successful delivery, expose a GET /users/{id}/notifications paginated endpoint, and push badge count updates via Server-Sent Events or WebSocket.
Transactional OTP flows share the SMS delivery path but require latency an order of magnitude tighter than general notifications (OTPs must arrive in under 3 seconds). To add them, I would route OTP events to a separate high-priority SMS queue that bypasses backpressure controls entirely and connects to a dedicated Twilio subaccount with reserved throughput.
A/B testing is a product layer on top of delivery. To add it, I would resolve the template variant in the notification service at enqueue time, assigning users to experiment arms via a feature flag SDK call before publishing to Kafka.
Rich push is out of scope because APNs and FCM have a 4KB payload size cap and separate documentation for media attachments. To add it, I would store media URLs in the notification payload and let the device SDK download assets asynchronously on receipt rather than bundling them in the push payload.
Non-Functional Requirements
Core Requirements
- Throughput: Handle 1M notifications per second at peak across all channels combined.
- Latency: Real-time notifications queued within 1 second of trigger; delivered within 10 seconds end to end for push.
- Availability: 99.99% uptime for the ingestion API. Delivery workers tolerate brief restarts as long as the queue persists.
- Durability: No notification is lost once accepted by the ingestion API.
- Scale: 1 billion registered devices, 500M DAU.
Below the Line
- Sub-second push delivery end to end (APNs and FCM add their own tail latency beyond our control)
- Real-time delivery receipts and per-user read confirmations
Write-heavy reality: This system is almost entirely writes. Every inbound event produces at least one dispatch per channel, and bulk campaigns produce millions. There is no hot read path comparable to a URL shortener; the challenge is absorbing enormous write throughput without data loss and without blowing through external provider rate limits every time a marketing team sends a campaign. Every design decision in this article traces back to that 1M peak writes per second constraint.
I'd call out that the 99.99% availability target applies to the ingestion API only, not end-to-end delivery. Each external provider (APNs, FCM, Twilio, SES) carries its own SLA, and your system cannot exceed it. Design for at-least-once delivery and idempotent workers, not for real-time guarantees that depend on third-party uptime.
Core Entities
- Notification: A single delivery event carrying channel, recipient identifier, template reference, rendered payload, status, and optional scheduled delivery time.
- User: The recipient account, tied to a device token (push), email address, and phone number per channel.
- UserPreference: A per-user, per-channel opt-in flag with optional quiet-hours window configuration.
- NotificationTemplate: A reusable payload template with variable slots for personalization (order ID, username, amount, etc.).
- DeliveryLog: An append-only record of each delivery attempt: timestamp, outcome (success, transient failure, permanent failure), and provider response code.
Full schema, indexes, and column types are deferred to the data model deep dive. The entities above are enough to drive the API design and High-Level Design.
API Design
FR 1 and FR 2: Send a notification:
# Accept a single notification and queue it for delivery
POST /v1/notifications
Body: {
user_id: "u_123",
channels: ["push", "email"],
template_id: "order_confirmed",
template_vars: { "order_id": "o_456" },
scheduled_at?: "2026-03-29T15:00:00Z"
}
Response: { notification_id: "n_789", status: "queued" }
Accepting channels as an array rather than a scalar lets the caller specify a primary channel with fallbacks in one request. Using template_id rather than a raw body prevents XSS and keeps payloads auditable. scheduled_at defaults to "now" when absent, covering both immediate and scheduled sends in the same endpoint.
FR 1: Bulk send to a user segment:
# Schedule a notification campaign to an entire user segment
POST /v1/notifications/bulk
Body: {
segment_id: "new_users_march",
channels: ["push"],
template_id: "onboarding_day1",
scheduled_at?: "2026-03-29T09:00:00Z"
}
Response: { batch_id: "b_999", estimated_recipients: 4200000, status: "scheduled" }
Do not accept a user_ids array in the request body. A 50M element array creates a request body that is impossible to parse and a timeout bomb for the ingestion service. Segment-based sends resolve the recipient list asynchronously inside the fan-out pipeline, returning immediately with a batch_id for status polling.
FR 4: Manage user preferences:
# Read and replace the full channel opt-in/out preference set for a user
GET /v1/users/{user_id}/preferences
PUT /v1/users/{user_id}/preferences
Body: { push: true, email: false, sms: true }
Response: { user_id: "u_123", push: true, email: false, sms: true, updated_at: "..." }
Use PUT over PATCH because the preference object is small and always replaces the full set of channel flags. PATCH with partial updates adds merge-conflict complexity for no benefit at this schema size.
High-Level Design
1. Source systems submit a notification via the ingestion API
The ingestion path: source system calls REST API, notification service validates and persists, event published to Kafka for async processing.
Components:
- Source System: Any internal service (product, payments, auth) that needs to trigger a notification. Calls
POST /v1/notificationswith a template reference and recipient. - Notification Service: Validates the payload, writes a notification record to the database with status
pending, then publishes the event to Kafka. - Notification DB: PostgreSQL. Stores the notification record as the source of truth for status tracking. Insert on receipt, update on delivery outcome.
- Kafka (ingestion topic): Receives the notification event after the successful DB write. All downstream processing happens from this topic, never from the source system directly.
Request walkthrough:
- Source system sends
POST /v1/notificationswithuser_id,template_id, and optionalscheduled_at. - Notification Service validates: verifies the template exists,
user_idis non-null, channel list is non-empty. - Notification Service inserts a record into Notification DB. If
scheduled_atis in the future, status isscheduled; otherwisepending. - Notification Service publishes the event to the
notifications.pendingKafka topic, keyed byuser_idfor ordered per-user processing. - Notification Service returns
{ notification_id, status: "queued" }to the source system.
The write to Notification DB happens before the Kafka publish. If Kafka is temporarily unavailable, the record persists in the DB with status pending and a background sweeper re-publishes it. This write-before-publish ordering is the durability guarantee.
I always draw the DB write before the Kafka publish on the whiteboard and pause to let the interviewer notice it. That ordering decision is the single most important durability choice in the entire ingestion path, and calling it out early shows you understand exactly where data loss hides.
Why Kafka over a direct call to the router?
Source systems must not call delivery workers directly. A direct call ties ingestion latency to delivery latency and makes every source system aware of the notification topology. Kafka decouples them: the source system gets a sub-millisecond ack and the delivery pipeline processes at its own pace.
2. Channel routing and fan-out
The Router Worker consumes from Kafka, resolves which channels the user wants, and fans out to per-channel queues.
The naive approach here would be to route directly from the ingestion service by calling APNs, SES, and Twilio inline. At 1M notifications per second, synchronous provider calls from the ingestion path would saturate provider rate limits within seconds and lose all backpressure control. The fix is to decouple routing from delivery with per-channel Kafka topics, one per channel.
Components:
- Router Worker: Consumes events from
notifications.pending. For each event, resolves the user's active channels (checked against preferences), renders the template into a channel-specific payload, and publishes one message per active channel. - Per-channel Kafka topics:
notifications.push,notifications.email,notifications.sms. Each topic gives the corresponding delivery path its own independent backpressure boundary. - Preferences Cache: Redis. The Router Worker queries this before routing. Full detail in HLD section 4 and Deep Dive 3.
Request walkthrough:
- Router Worker consumes an event from
notifications.pending. - Router Worker looks up user preferences: which channels are active for this user.
- For each active channel, Router Worker renders the template into a channel-specific payload (APNs JSON for push, HTML body for email, plain text for SMS).
- Router Worker publishes one message per active channel to the appropriate topic.
- Notification DB record updated to
routing_complete.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.