Design a Stripe-like payment processor from scratch: charge flows, idempotency to prevent double-charges, handling unknown states after timeouts, and scaling to 10K transactions per second during a flash sale.
36 min read2026-04-02hardpaymentssystem-designdatabasesidempotencydistributed-systems
A payment processor sits between a merchant's checkout page and the card networks (Visa, Mastercard, Amex). It routes authorization requests to the right network, records the result, and handles the inevitable failures: timeouts, partial captures, refunds, and retries. I open every payment system interview by saying "the happy path is boring; everything interesting happens when the network drops your response," because that framing immediately shows the interviewer you understand what makes this problem hard.
The interesting engineering challenges live off the happy path: ensuring a charge fires exactly once when the client retries a timed-out request, handling the limbo state where the card network accepted a charge but your service never received the response, and scaling the synchronous authorization pipeline to 10K TPS during a flash sale. This design builds a direct card network integration, not a wrapper around Stripe or Braintree; the interesting distributed systems problems only surface when you own the network call yourself.
PCI-DSS card data storage. We do not store raw card numbers; a tokenization vault (like Stripe's vault or Braintree) converts card details to an opaque token before they reach our system.
Fraud detection ML model internals.
End-to-end settlement and ledger reconciliation.
Dispute and chargeback management.
The hardest part in scope: Exactly-once charge execution. A client that retries after a network timeout must not trigger a second charge. A charge response that gets dropped in transit must not leave the payment in an ambiguous permanent state. The idempotency key mechanism and the payment state machine together solve both problems, and each gets a full deep dive.
Comments
PCI-DSS storage is below the line because storing raw card numbers expands the compliance scope far beyond the distributed systems challenge. In production I would integrate a tokenization vault so our system never sees the actual card number. The opaque card token we receive is useless without the vault.
Fraud detection is below the line because it runs as a scoring service, not a core payment-path component. To add it, I would call a fraud score API synchronously before sending the auth request to the card network and reject any charge above a configured risk threshold.
Settlement and reconciliation are below the line because they run as a daily batch pipeline against completed transactions, not a real-time flow. They do not affect the charge or refund paths we are designing.
Dispute and chargeback management is below the line because it is a human-assisted process triggered by customer disputes through their bank, not something our payment API initiates directly.
Exactly-once delivery: A charge must complete exactly once regardless of how many times the client retries due to network failures or timeouts.
Strong consistency: The payment state stored in our database must agree with what the card network recorded. Stale state is not acceptable for financial data.
Availability: 99.99% uptime for the charge and refund endpoints, under 52 minutes of downtime per year.
Latency: Charge API returns in under 2 seconds p99. Card network authorization adds 100-300ms of unavoidable latency; our infrastructure must not contribute more than an additional 200ms on top of that.
Scale: 5M active merchants globally, 50M active cardholders. Peak 10K transactions per second during flash sales (Black Friday, holiday surges). That peak implies roughly 1B transactions per day during sustained burst periods.
Auditability: Every payment state transition is written to an immutable event log and must be queryable indefinitely.
Sub-100ms charge response time (card networks impose unavoidable latency)
Multi-region active-active with synchronous cross-region consistency guarantees
Read/write ratio: Payments are write-skewed at the state machine level. Each transaction produces 3-4 state transitions (PENDING to AUTHORIZING to AUTHORIZED to CAPTURED), and each transition produces an immutable PaymentEvent record. The write-to-read ratio on the payments table is roughly 4:1. The strong consistency requirement means we cannot serve reads from an eventually-consistent read replica; every status query must reflect the current authoritative state. We do not apply aggressive caching to payment records.
The 2-second p99 latency target defines our timeout strategy. In practice, I set the card network timeout at 1 second rather than 1.5 seconds to give the error path enough budget to write UNKNOWN and return before breaching the 2-second SLA. That is why the "unknown state" problem in deep dive 2 exists: the hard timeout is non-negotiable.
The 99.99% availability on a synchronous path with an external dependency (the card network) means we cannot let card network slowness cascade to our uptime SLA. Bulkheads and fallback logic must isolate card network outages from the payment recording path.
Payment: The transaction record. Carries the payment ID, merchant ID, amount, currency, card token, current state, idempotency key, network reference ID (set before calling the card network), and timestamps.
PaymentEvent: An immutable record of a single state transition on a Payment. Every time a payment changes state, we append one PaymentEvent. This is the audit trail and the source of truth for what happened and when.
Refund: A credit linked to a captured Payment. Carries a refund ID, parent payment ID, amount, reason, and current state (PENDING, SUCCEEDED, FAILED).
The full schema, indexes, and constraints are deferred to the data model deep dive. I deliberately keep the Payment entity flat here; the audit trail lives in PaymentEvents, not in versioned columns on the payment row. The three entities above are sufficient to drive the API and the High-Level Design.
POST /paymentsHeaders: Idempotency-Key: <client-generated uuid>Body: { card_token, amount, currency, description? }Response: { payment_id, status }
The Idempotency-Key header is required, not optional. Without it, retrying a timed-out request creates a duplicate charge. The key is a UUID the client generates before sending; any retry uses the exact same key. This is the pattern Stripe uses and the one we replicate here.
POST /payments/{payment_id}/refundsHeaders: Idempotency-Key: <client-generated uuid>Body: { amount?, reason? }Response: { refund_id, status }
I always call this out proactively in interviews: POST /payments/{payment_id}/refunds uses POST rather than DELETE because a refund is a new financial operation, not a reversal of the charge at the resource level. A full refund does not delete the payment; it creates a Refund entity linked to the original. Partial refunds make this unambiguous: you cannot DELETE /payments/{id} by 30%.
The amount field is optional. If omitted, the system defaults to a full refund of the captured amount. The service validates that the sum of all prior refunds plus this refund does not exceed the original captured amount before calling the card network.
If authentication were in scope, I would add a merchant_id claim to the request context from the auth token and scope all payment lookups to that merchant. I would not add a merchant_id field to the request body because that is a privilege escalation risk.
The simplest starting point: record the intent to charge before touching any external system. The client sends a charge request; the Payment Service writes the payment to the database in PENDING state and returns a payment_id. No card network interaction yet.
Components:
Merchant App: Sends the charge request with a card token, amount, and Idempotency-Key header.
Payment Service: Validates the request, checks the idempotency key, writes the payment record in PENDING state.
Payments DB (PostgreSQL): Stores the payment with a UNIQUE constraint on idempotency_key.
Request walkthrough:
Client sends POST /payments with card token, amount, currency, and Idempotency-Key.
Payment Service validates the request (positive amount, supported currency, non-empty card token).
Payment Service executes INSERT ... ON CONFLICT DO NOTHING on the idempotency key.
Payment Service returns { payment_id, status: "PENDING" }.
This diagram covers only the intent-recording step. The card network is not involved yet; we just know a merchant wants to charge a customer.
Now we extend the write path with the actual authorization. The Payment Service transitions the payment to AUTHORIZING (written to DB before making the external call), then calls the card network synchronously, then records the result.
Writing AUTHORIZING before the network call is deliberate: if the service crashes mid-call, the stuck AUTHORIZING record is visible to the reconciliation job introduced in deep dive 2 rather than silently lost. I draw this two-write sequence on the whiteboard every time, because interviewers consistently ask "what if you crash between the write and the network call?" Having the answer already on the board earns immediate credibility.
Components:
Payment Service: Transitions payment PENDING to AUTHORIZING, calls the card network, writes the result.
Card Network API: Visa, Mastercard, or Amex network endpoint. Returns an authorization code (approved) or a decline code in approximately 200ms.
Payments DB: Stores the state transitions and appends PaymentEvents for the audit trail.
Request walkthrough:
Payment Service transitions payment to AUTHORIZING and appends a PaymentEvent.
Payment Service calls the card network with the card token, amount, and currency.
Card network responds in ~200ms: AUTHORIZED with an auth code, or DECLINED with a decline code.
Payment Service records the result, transitions the payment state, and appends a final PaymentEvent.
Payment Service returns the updated status to the client.
The two arrows from PS to DB represent two separate moments within the same request: the first update happens before the card network call (recording in-flight state), the second happens after (recording the outcome).
The timeout problem: The card network can fail to respond within our 1-second budget. When that happens, we cannot tell whether the charge succeeded (the network dropped our response) or failed (the network never processed the request). Marking this FAILED and allowing a retry risks charging the customer twice. This is the central correctness problem in payment processing, covered fully in deep dive 2.
A card network authorization does two things: it confirms the card is valid and funds are available, and it places a temporary hold on those funds. Capturing converts that hold into an actual fund movement. For most e-commerce use cases, authorization and capture happen immediately in sequence (auto-capture). Hotel and car rental bookings typically authorize upfront and capture the final amount at checkout, but we defer delayed capture to a future deep dive.
This design uses auto-capture. After a successful AUTHORIZED response, the Payment Service immediately calls the card network's capture endpoint, transitions the payment to CAPTURED, and appends a final PaymentEvent.
Request walkthrough:
Payment Service receives AUTHORIZED response from card network.
Payment Service immediately calls the card network capture endpoint with the authorization code.
Card network confirms the funds movement; Payment Service transitions payment to CAPTURED.
Payment Service appends a PaymentEvent with the capture timestamp and returns CAPTURED to the client.
The CAPTURED state is the final success state for a charge. A refund can only be issued against a CAPTURED payment; the authorization code required by the card network credit API is written to the record at this step.
A refund is a credit operation on a previously CAPTURED payment. The Refund Service creates a Refund record in PENDING state, calls the card network credit API, then updates the refund state based on the result.
Components:
Refund Service: Validates the refund amount against the refundable balance, creates the Refund record, calls the card network credit API.
Card Network API: Accepts a credit request tied to the original authorization code. Responds in ~200ms.
Payments DB: Stores the Refund record. The refundable_amount check uses the sum of prior refunds on the same payment.
Request walkthrough:
Client sends POST /payments/{id}/refunds with amount and Idempotency-Key.
Refund Service fetches the payment, verifies it is in CAPTURED state, and checks refundable balance.
Refund Service inserts a Refund row in PENDING state under the idempotency key.
Refund Service calls the card network credit API with the original authorization code and refund amount.
Card network responds; Refund Service updates the refund to SUCCEEDED or FAILED.
The refund path is deliberately separate from the charge path. A failed refund does not affect the original payment's CAPTURED state. A partial refund does not create a new payment; it creates a new Refund entity with its own state machine.
This is a straightforward primary-database read. Payment state must be current (strong consistency requirement), and payment status queries are infrequent relative to write activity. Serve directly from the primary database. Return the payment record plus the events array from PaymentEvents sorted ascending by timestamp.
I would not add a Redis cache in front of payment status reads. The consistency requirement means a cached stale state is a liability, not a performance win. The volume of status queries does not justify the coherence overhead.
The card network times out. The client has no receipt. It retries with a new POST /payments request. If we treat every request as a fresh charge, we bill the customer twice. This is the most critical correctness problem in payment processing.
Constraints to design against:
Retries with the same intent must produce exactly one charge.
The solution must work when multiple retry requests arrive concurrently (a buggy client fires three retries simultaneously).
The idempotency key is client-supplied; we cannot trust the client to always send one.
This is the ghost-in-the-wire problem. Our service sent the authorization request. The card network processed it and notified the customer's bank. But the TCP response was dropped before reaching us. We do not know whether to treat this as a success or a failure.
Constraints:
Do not re-authorize (that risks charging the customer twice).
Do not leave the payment in UNKNOWN permanently.
Resolution must be automatic; manual support tickets are not a system.
At 10K TPS, each payment involves a synchronous card network call (100-300ms), a DB write before the call, and a DB write after. The card network call dominates. With a single synchronous Payment Service, the maximum throughput is bounded by how many concurrent card network connections we can maintain and how fast the card network responds.
Constraints:
Do not drop requests under burst load.
Maintain the 2-second p99 SLA.
Degrade gracefully: reduce throughput rather than returning 500 errors.
A merchant in Germany wants to accept SEPA direct debit. A customer in Japan pays in JPY. Card network authorization covers credit and debit cards; everything else requires different integration paths, different settlement timelines, and different edge cases.
Constraints:
Support multi-currency display and settlement.
Support at least 3 regional payment methods (SEPA, ACH, PIX) without rebuilding the core charge flow.
FX conversion must not silently alter the amount the customer approved at checkout.
The async queue is the central architectural decision, and I draw it as the literal spine of the diagram on the whiteboard. The Payment Service accepts a charge in under 50ms (validate + PENDING + enqueue) and returns immediately. Charge Workers process the card network call at a controlled rate the network can sustain. The merchant's user experience is decoupled from the card network's variable latency, and burst traffic queues rather than dropping.
Start by clarifying scope: are you calling the card network directly or wrapping a provider like Stripe? The interesting engineering problems only appear in the direct-call model.
The Idempotency-Key header on POST /payments is non-negotiable. Make it required and return HTTP 400 if absent. Soft-optional idempotency keys defeat the entire purpose.
Use INSERT ... ON CONFLICT DO NOTHING to make payment creation idempotent at the database layer. The check-then-insert pattern has a race condition under concurrent retries.
The UNKNOWN state is not a failure; it is a genuine information gap. Timeout does not equal decline. Never conflate the two.
The card network may have charged the customer even when your request timed out. Never re-authorize on a timeout. Query the network's transaction status API first.
Write AUTHORIZING to the database before calling the card network, not after. If the service crashes mid-call, the stuck AUTHORIZING record is visible to the reconciliation job instead of silently lost.
A background reconciliation job polling UNKNOWN payments every 30 seconds resolves most within a minute. Alert if any payment stays UNKNOWN more than 10 minutes.
Split the accept phase (validate + PENDING + enqueue, under 50ms) from the process phase (card network call, up to 2 seconds). This is the pattern behind Stripe's async charge model.
Use a dead-letter queue for charge jobs that fail after N retries. Silently dropped failed charges are a compliance and customer support disaster.
The payment state machine is the contract. Any transition not in the machine (e.g., FAILED to AUTHORIZED) must throw, not silently succeed.
For multi-currency: record quoted_amount, charged_amount, and exchange_rate separately. If the rate shifts beyond a tolerance threshold between quote time and charge time, abort with a price-changed error rather than silently charging a different amount.
SEPA and ACH settle in 1-4 days rather than 200ms. Model them as PaymentProcessor implementations that return AUTHORIZING immediately; the reconciliation layer resolves them asynchronously when the bank confirms.
For burst traffic (flash sales): queue-based processing provides natural backpressure. Requests enqueue rather than drop. Worker throughput is the rate limit knob, not the ingress service.