Notification Service
Design a multi-channel notification service that delivers billions of push, email, and SMS notifications per day reliably, covering ingestion pipelines, fan-out strategies, deduplication, and guaranteed delivery.
What is a notification service?
A notification service delivers messages from your product to users across push, email, and SMS. Sending one notification is trivial. Sending a billion without losing any, without blowing APNs rate limits during a viral campaign, and without delivering the same notification twice because a worker crashed mid-send is where the real engineering lives.
I like opening with this system in interviews because it forces you to think about producer-consumer boundaries, external provider quirks, and idempotency all in one design. This question tests message queues, fan-out strategies, distributed delivery guarantees, and external provider integration all in the same answer.
Functional Requirements
Core Requirements
- Send notifications through mobile push (APNs/FCM), email, and SMS channels.
- Support immediate and scheduled sends.
- Guarantee at-least-once delivery; deduplicate at the client where possible.
- Allow users to manage preferences and opt out of specific channels.
Below the Line (out of scope)
- In-app notification bell and inbox (badge counts, read/unread state)
- Transactional OTP flows with tight sub-3-second delivery requirements
- A/B testing of notification content and send-time optimization
- Rich push notifications with images and deep-link action buttons
The hardest part in scope: Fan-out at scale. When a platform pushes a new post to 50 million followers, those 50 million push notifications must land within minutes without saturating APNs, causing other notifications to queue behind the campaign for hours, or delivering duplicates if the fan-out worker crashes halfway through.
An in-app notification bell is out of scope because it requires a separate storage model (an inbox per user with read/unread state) and a real-time delivery mechanism. To add it, I would write notification records to a dedicated inbox table after successful delivery, expose a GET /users/{id}/notifications paginated endpoint, and push badge count updates via Server-Sent Events or WebSocket.
Transactional OTP flows share the SMS delivery path but require latency an order of magnitude tighter than general notifications (OTPs must arrive in under 3 seconds). To add them, I would route OTP events to a separate high-priority SMS queue that bypasses backpressure controls entirely and connects to a dedicated Twilio subaccount with reserved throughput.
A/B testing is a product layer on top of delivery. To add it, I would resolve the template variant in the notification service at enqueue time, assigning users to experiment arms via a feature flag SDK call before publishing to Kafka.
Rich push is out of scope because APNs and FCM have a 4KB payload size cap and separate documentation for media attachments. To add it, I would store media URLs in the notification payload and let the device SDK download assets asynchronously on receipt rather than bundling them in the push payload.
Non-Functional Requirements
Core Requirements
- Throughput: Handle 1M notifications per second at peak across all channels combined.
- Latency: Real-time notifications queued within 1 second of trigger; delivered within 10 seconds end to end for push.
- Availability: 99.99% uptime for the ingestion API. Delivery workers tolerate brief restarts as long as the queue persists.
- Durability: No notification is lost once accepted by the ingestion API.
- Scale: 1 billion registered devices, 500M DAU.
Below the Line
- Sub-second push delivery end to end (APNs and FCM add their own tail latency beyond our control)
- Real-time delivery receipts and per-user read confirmations
Write-heavy reality: This system is almost entirely writes. Every inbound event produces at least one dispatch per channel, and bulk campaigns produce millions. There is no hot read path comparable to a URL shortener; the challenge is absorbing enormous write throughput without data loss and without blowing through external provider rate limits every time a marketing team sends a campaign. Every design decision in this article traces back to that 1M peak writes per second constraint.
I'd call out that the 99.99% availability target applies to the ingestion API only, not end-to-end delivery. Each external provider (APNs, FCM, Twilio, SES) carries its own SLA, and your system cannot exceed it. Design for at-least-once delivery and idempotent workers, not for real-time guarantees that depend on third-party uptime.
Core Entities
- Notification: A single delivery event carrying channel, recipient identifier, template reference, rendered payload, status, and optional scheduled delivery time.
- User: The recipient account, tied to a device token (push), email address, and phone number per channel.
- UserPreference: A per-user, per-channel opt-in flag with optional quiet-hours window configuration.
- NotificationTemplate: A reusable payload template with variable slots for personalization (order ID, username, amount, etc.).
- DeliveryLog: An append-only record of each delivery attempt: timestamp, outcome (success, transient failure, permanent failure), and provider response code.
Full schema, indexes, and column types are deferred to the data model deep dive. The entities above are enough to drive the API design and High-Level Design.
API Design
FR 1 and FR 2: Send a notification:
# Accept a single notification and queue it for delivery
POST /v1/notifications
Body: {
user_id: "u_123",
channels: ["push", "email"],
template_id: "order_confirmed",
template_vars: { "order_id": "o_456" },
scheduled_at?: "2026-03-29T15:00:00Z"
}
Response: { notification_id: "n_789", status: "queued" }
Accepting channels as an array rather than a scalar lets the caller specify a primary channel with fallbacks in one request. Using template_id rather than a raw body prevents XSS and keeps payloads auditable. scheduled_at defaults to "now" when absent, covering both immediate and scheduled sends in the same endpoint.
FR 1: Bulk send to a user segment:
# Schedule a notification campaign to an entire user segment
POST /v1/notifications/bulk
Body: {
segment_id: "new_users_march",
channels: ["push"],
template_id: "onboarding_day1",
scheduled_at?: "2026-03-29T09:00:00Z"
}
Response: { batch_id: "b_999", estimated_recipients: 4200000, status: "scheduled" }
Do not accept a user_ids array in the request body. A 50M element array creates a request body that is impossible to parse and a timeout bomb for the ingestion service. Segment-based sends resolve the recipient list asynchronously inside the fan-out pipeline, returning immediately with a batch_id for status polling.
FR 4: Manage user preferences:
# Read and replace the full channel opt-in/out preference set for a user
GET /v1/users/{user_id}/preferences
PUT /v1/users/{user_id}/preferences
Body: { push: true, email: false, sms: true }
Response: { user_id: "u_123", push: true, email: false, sms: true, updated_at: "..." }
Use PUT over PATCH because the preference object is small and always replaces the full set of channel flags. PATCH with partial updates adds merge-conflict complexity for no benefit at this schema size.
High-Level Design
1. Source systems submit a notification via the ingestion API
The ingestion path: source system calls REST API, notification service validates and persists, event published to Kafka for async processing.
Components:
- Source System: Any internal service (product, payments, auth) that needs to trigger a notification. Calls
POST /v1/notificationswith a template reference and recipient. - Notification Service: Validates the payload, writes a notification record to the database with status
pending, then publishes the event to Kafka. - Notification DB: PostgreSQL. Stores the notification record as the source of truth for status tracking. Insert on receipt, update on delivery outcome.
- Kafka (ingestion topic): Receives the notification event after the successful DB write. All downstream processing happens from this topic, never from the source system directly.
Request walkthrough:
- Source system sends
POST /v1/notificationswithuser_id,template_id, and optionalscheduled_at. - Notification Service validates: verifies the template exists,
user_idis non-null, channel list is non-empty. - Notification Service inserts a record into Notification DB. If
scheduled_atis in the future, status isscheduled; otherwisepending. - Notification Service publishes the event to the
notifications.pendingKafka topic, keyed byuser_idfor ordered per-user processing. - Notification Service returns
{ notification_id, status: "queued" }to the source system.
The write to Notification DB happens before the Kafka publish. If Kafka is temporarily unavailable, the record persists in the DB with status pending and a background sweeper re-publishes it. This write-before-publish ordering is the durability guarantee.
I always draw the DB write before the Kafka publish on the whiteboard and pause to let the interviewer notice it. That ordering decision is the single most important durability choice in the entire ingestion path, and calling it out early shows you understand exactly where data loss hides.
Why Kafka over a direct call to the router?
Source systems must not call delivery workers directly. A direct call ties ingestion latency to delivery latency and makes every source system aware of the notification topology. Kafka decouples them: the source system gets a sub-millisecond ack and the delivery pipeline processes at its own pace.
2. Channel routing and fan-out
The Router Worker consumes from Kafka, resolves which channels the user wants, and fans out to per-channel queues.
The naive approach here would be to route directly from the ingestion service by calling APNs, SES, and Twilio inline. At 1M notifications per second, synchronous provider calls from the ingestion path would saturate provider rate limits within seconds and lose all backpressure control. The fix is to decouple routing from delivery with per-channel Kafka topics, one per channel.
Components:
- Router Worker: Consumes events from
notifications.pending. For each event, resolves the user's active channels (checked against preferences), renders the template into a channel-specific payload, and publishes one message per active channel. - Per-channel Kafka topics:
notifications.push,notifications.email,notifications.sms. Each topic gives the corresponding delivery path its own independent backpressure boundary. - Preferences Cache: Redis. The Router Worker queries this before routing. Full detail in HLD section 4 and Deep Dive 3.
Request walkthrough:
- Router Worker consumes an event from
notifications.pending. - Router Worker looks up user preferences: which channels are active for this user.
- For each active channel, Router Worker renders the template into a channel-specific payload (APNs JSON for push, HTML body for email, plain text for SMS).
- Router Worker publishes one message per active channel to the appropriate topic.
- Notification DB record updated to
routing_complete.
Adding a new channel (WhatsApp, for example) requires one new Kafka topic, one new delivery worker, and one additional boolean flag in the UserPreference schema. The Router Worker gains a single if whatsapp_enabled check and a publish call. Nothing else changes.
I'd mention this extensibility point proactively in an interview. Interviewers love hearing "adding a new channel is a one-topic, one-worker change" because it proves the architecture handles future requirements without redesign.
3. Per-channel delivery workers
Each channel has a dedicated worker pool that reads from its Kafka topic and calls the appropriate external provider.
Channels are not interchangeable. Push delivery goes through APNs (Apple) and FCM (Google) using device tokens and HTTP/2. Email delivery goes through SES or SendGrid via REST. SMS delivery goes through Twilio via REST. I always build separate worker fleets per channel because their rate limits, error codes, and retry semantics diverge enough to make a single generic worker unmanageable in production.
Components:
- Push Worker: Reads from
notifications.push. Calls APNs (iOS) or FCM (Android). Classifies errors:BadDeviceTokenis permanent (invalidate the token);RateLimitExceededis transient (back off and retry). - Email Worker: Reads from
notifications.email. Calls SES or SendGrid REST API. Listens to bounce and complaint webhooks to suppress future sends to hard-bounced addresses. - SMS Worker: Reads from
notifications.sms. Calls Twilio REST API. Polls or webhook-receives delivery status to update the DeliveryLog with the final outcome.
Request walkthrough (push notification):
- Push Worker consumes a message from
notifications.pushcontaining a device token and rendered payload. - Push Worker calls APNs (iOS) or FCM (Android) with the payload.
- On success: DeliveryLog records
delivered, Kafka offset committed. - On transient error (rate limit, server 500): offset not committed, message redelivered after backoff.
- On permanent error (
BadDeviceToken): device token invalidated in User DB, DeliveryLog recordsfailed_permanent, offset committed.
Workers never call external providers synchronously from the ingestion path. All provider calls happen asynchronously from the channel queues. A full APNs outage does not affect ingestion or email delivery; the push queue absorbs the backlog until APNs recovers.
In every notification system I have worked on, the first production incident was a provider outage cascading into the ingestion path because someone shortcut the queue. Separate worker fleets per channel are not optional, they are the blast radius boundary that keeps one provider's bad day from becoming your system-wide outage.
APNs and FCM are not the same provider
Treat APNs and FCM as separate providers inside the Push Worker. Device tokens for iOS go to APNs; tokens for Android go to FCM. They have different endpoints, different authentication schemes (APNs uses a JWT; FCM uses an API key), and different error code enumerations. A single push worker handles both by maintaining two internal client objects, one per provider.
4. User preferences and opt-out
Before routing, the Router Worker checks a Redis-backed preferences cache. Users who have opted out of a channel are skipped for that channel without any DB read.
Components:
- Preferences API: Exposes
GET /PUT /v1/users/{id}/preferences. Writes to Preferences DB and invalidates the Redis cache key on every update. - Preferences DB: PostgreSQL table with one row per user, boolean columns per channel, and a
quiet_hoursJSON column for time-window suppression. This is the source of truth. - Preferences Cache: Redis. Caches the full preference object per user with a 5-minute TTL. The Router Worker reads exclusively from cache; a miss falls back to the DB and repopulates the cache.
Request walkthrough:
- User opens notification settings and disables email.
- App sends
PUT /v1/users/{user_id}/preferenceswith{ push: true, email: false, sms: true }. - Preferences API updates the Preferences DB row.
- Preferences API deletes the Redis cache key immediately (write-invalidate, not write-through).
- On the next routing event for this user, the Router Worker gets a cache miss, fetches fresh preferences from the DB, repopulates the cache, and skips the email channel.
Write-invalidate is simpler than write-through here because preference updates are rare and the 5-minute TTL provides a natural worst-case freshness bound. A user who opts out of email will stop receiving email within 5 minutes at worst. That is acceptable for a preference change.
I would call out the TTL tradeoff explicitly on the whiteboard: "5 minutes of stale preferences is acceptable for marketing emails, but unacceptable for regulatory opt-outs (CAN-SPAM, GDPR). If compliance requires real-time suppression, switch to write-through for the opt-out flag specifically." That kind of nuance separates a textbook answer from a production-tested one.
Potential Deep Dives
1. How do you handle bulk fan-out to 50 million users without overwhelming APNs and FCM?
A marketing team triggers a push campaign to all 50M active users at once. APNs limits each provider certificate to roughly 300 concurrent HTTP/2 connections, supporting around 15,000 notifications per second per connection pool. A naive approach saturates that limit immediately and blocks all transactional and OTP notifications behind the campaign for hours.
Three constraints drive the design: providers have hard rate limits per API key; bulk fan-out must complete within a reasonable window (under 30 minutes for most campaigns); and bulk traffic must not starve real-time transactional notifications.
2. How do you guarantee at-least-once delivery and handle failures?
Every external provider call can fail. APNs returns 500s under load. Twilio returns 429 when your message rate exceeds your account tier. SES bounces email without warning. The system must distinguish transient failures (retry) from permanent failures (log and skip) and must never silently drop a notification that was accepted and acknowledged to the source system.
3. How do you manage user preferences and channel routing at scale?
With 500M DAU, the Router Worker processes 50,000 events per second at peak. A direct PostgreSQL preference read per event at 10ms per query means each Router Worker instance handles only 100 events per second. Scaling to 50,000 events per second with direct DB reads requires 500 worker instances all hammering the Preferences DB simultaneously, creating a read bottleneck that grows with every scale event.
Final Architecture
The most important architectural insight: every layer is a separate fleet consuming from a durable Kafka topic, so any layer can fail, restart, or scale independently without losing a single notification once accepted. The Kafka topic is the contract between layers; the consumer is replaceable.
Interview Cheat Sheet
- Channels supported: Mobile push (APNs for iOS, FCM for Android), email (SES or SendGrid), and SMS (Twilio). Each channel gets a dedicated Kafka topic and a dedicated worker fleet. Adding a new channel requires one new Kafka topic, one new worker service, and one boolean column addition to the UserPreference schema.
- Ingestion model: Source systems call
POST /v1/notifications. The Notification Service persists the record to PostgreSQL with statuspendingbefore publishing to Kafka. The DB write happens before the Kafka publish. If Kafka is down at publish time, a background sweeper re-publishespendingrecords, ensuring no accepted notification is ever discarded silently. - Buffering strategy: Each channel has a dedicated Kafka topic (
notifications.push,notifications.email,notifications.sms). The Router Worker publishes to these topics after resolving preferences. Per-channel topics give each delivery path its own independent backpressure boundary; a Twilio outage fills the SMS topic without affecting push or email delivery. - Delivery differences by channel: Push uses APNs HTTP/2 or FCM REST with a device token. Email uses SES/SendGrid REST with MIME encoding and separate bounce-and-complaint webhooks. SMS uses Twilio REST with delivery receipt callbacks. Error classification differs per provider: APNs
BadDeviceTokenis permanent (invalidate the token); Twilio 429 is transient (back off and retry). - Bulk fan-out: A 50M-user campaign uses a dedicated Fan-out Service that cursor-scans the user segment (no
OFFSETqueries) and publishes to 16 sharded Kafka topics. Separate consumer groups and APNs connection pools for bulk versus real-time traffic ensure campaigns cannot starve transactional or OTP notifications regardless of campaign size. - Provider rate limiting: Use a token-bucket rate limiter per worker pool, sized to the provider's documented limit. APNs supports roughly 300 concurrent HTTP/2 connections per certificate; FCM allows 600K messages per minute per project. Keep the worker connection count below these limits with headroom for retry burst traffic.
- At-least-once delivery: Workers do not commit Kafka offsets until the provider call succeeds and the DeliveryLog is updated. Failed messages are published to a
notifications.retrytopic with adelay_untilheader rather than sleeping and blocking the partition. After 5 failed attempts, the event goes to the dead-letter queue. - Idempotency: Each delivery attempt carries an
idempotency_keyofnotification_id:channel:attempt. Workers check the DeliveryLog for an existingdeliveredstatus before calling the provider. Pass the key as theapns-id(APNs) ormessage_id(FCM) header to deduplicate at the provider level as well during outage recovery replays. - User preferences at scale: A Redis Bloom filter holds the
user_idof every user who has ever opted out of at least one channel. Roughly 95% of users are never in the filter, so their preference lookup is a single sub-millisecond Bloom filter check returning all-channels-enabled defaults with zero DB or cache reads. Only opted-out users incur a Redis GET or DB fallback. - Preference freshness: The Preferences API uses write-invalidate (delete the Redis key on update). Maximum staleness equals the Redis TTL, set to 5 to 10 minutes. For opted-out users the API also performs a write-through into the Redis preference cache immediately, so the next Router Worker read is already fresh without a DB round-trip.
- Scheduled sends: Notifications with
scheduled_atin the future are stored with statusscheduledrather thanpending. A Scheduler Worker polls every 30 seconds for rows wherescheduled_at <= NOW() AND status = 'scheduled', usingSELECT FOR UPDATE SKIP LOCKEDto prevent double-processing across multiple instances, then publishes matching rows to Kafka. - Throughput math: 1M notifications per second peak, with each notification routing to 1 to 3 active channels, produces up to 3M Kafka messages per second across all channel topics. A 6-broker Kafka cluster on modern hardware handles this comfortably at 500K messages per second per broker with replication factor 3.
- Deduplication in bulk campaigns: A Redis Bloom filter keyed by
(batch_id, user_id)in the Fan-out Service prevents the same user from receiving the same campaign notification twice due to belonging to overlapping segments. False positives result in a skipped notification for an edge-case user, which is acceptable. False negatives (duplicates) are impossible with a Bloom filter.