Payment Processing
Design a Stripe-like payment processor from scratch: charge flows, idempotency to prevent double-charges, handling unknown states after timeouts, and scaling to 10K transactions per second during a flash sale.
What is a payment processing system?
A payment processor sits between a merchant's checkout page and the card networks (Visa, Mastercard, Amex). It routes authorization requests to the right network, records the result, and handles the inevitable failures: timeouts, partial captures, refunds, and retries. I open every payment system interview by saying "the happy path is boring; everything interesting happens when the network drops your response," because that framing immediately shows the interviewer you understand what makes this problem hard.
The interesting engineering challenges live off the happy path: ensuring a charge fires exactly once when the client retries a timed-out request, handling the limbo state where the card network accepted a charge but your service never received the response, and scaling the synchronous authorization pipeline to 10K TPS during a flash sale. This design builds a direct card network integration, not a wrapper around Stripe or Braintree; the interesting distributed systems problems only surface when you own the network call yourself.
Functional Requirements
Core Requirements
- Merchants can initiate a charge with a card token, amount, and currency.
- The system authorizes and captures the charge via the card network.
- Merchants can issue full or partial refunds on a captured payment.
- Merchants can query the current status of any payment.
Below the Line (out of scope)
- PCI-DSS card data storage. We do not store raw card numbers; a tokenization vault (like Stripe's vault or Braintree) converts card details to an opaque token before they reach our system.
- Fraud detection ML model internals.
- End-to-end settlement and ledger reconciliation.
- Dispute and chargeback management.
The hardest part in scope: Exactly-once charge execution. A client that retries after a network timeout must not trigger a second charge. A charge response that gets dropped in transit must not leave the payment in an ambiguous permanent state. The idempotency key mechanism and the payment state machine together solve both problems, and each gets a full deep dive.
PCI-DSS storage is below the line because storing raw card numbers expands the compliance scope far beyond the distributed systems challenge. In production I would integrate a tokenization vault so our system never sees the actual card number. The opaque card token we receive is useless without the vault.
Fraud detection is below the line because it runs as a scoring service, not a core payment-path component. To add it, I would call a fraud score API synchronously before sending the auth request to the card network and reject any charge above a configured risk threshold.
Settlement and reconciliation are below the line because they run as a daily batch pipeline against completed transactions, not a real-time flow. They do not affect the charge or refund paths we are designing.
Dispute and chargeback management is below the line because it is a human-assisted process triggered by customer disputes through their bank, not something our payment API initiates directly.
Non-Functional Requirements
Core Requirements
- Exactly-once delivery: A charge must complete exactly once regardless of how many times the client retries due to network failures or timeouts.
- Strong consistency: The payment state stored in our database must agree with what the card network recorded. Stale state is not acceptable for financial data.
- Availability: 99.99% uptime for the charge and refund endpoints, under 52 minutes of downtime per year.
- Latency: Charge API returns in under 2 seconds p99. Card network authorization adds 100-300ms of unavoidable latency; our infrastructure must not contribute more than an additional 200ms on top of that.
- Scale: 5M active merchants globally, 50M active cardholders. Peak 10K transactions per second during flash sales (Black Friday, holiday surges). That peak implies roughly 1B transactions per day during sustained burst periods.
- Auditability: Every payment state transition is written to an immutable event log and must be queryable indefinitely.
Below the Line
- Sub-100ms charge response time (card networks impose unavoidable latency)
- Multi-region active-active with synchronous cross-region consistency guarantees
Read/write ratio: Payments are write-skewed at the state machine level. Each transaction produces 3-4 state transitions (PENDING to AUTHORIZING to AUTHORIZED to CAPTURED), and each transition produces an immutable PaymentEvent record. The write-to-read ratio on the payments table is roughly 4:1. The strong consistency requirement means we cannot serve reads from an eventually-consistent read replica; every status query must reflect the current authoritative state. We do not apply aggressive caching to payment records.
The 2-second p99 latency target defines our timeout strategy. In practice, I set the card network timeout at 1 second rather than 1.5 seconds to give the error path enough budget to write UNKNOWN and return before breaching the 2-second SLA. That is why the "unknown state" problem in deep dive 2 exists: the hard timeout is non-negotiable.
The 99.99% availability on a synchronous path with an external dependency (the card network) means we cannot let card network slowness cascade to our uptime SLA. Bulkheads and fallback logic must isolate card network outages from the payment recording path.
Core Entities
- Payment: The transaction record. Carries the payment ID, merchant ID, amount, currency, card token, current state, idempotency key, network reference ID (set before calling the card network), and timestamps.
- PaymentEvent: An immutable record of a single state transition on a Payment. Every time a payment changes state, we append one PaymentEvent. This is the audit trail and the source of truth for what happened and when.
- Refund: A credit linked to a captured Payment. Carries a refund ID, parent payment ID, amount, reason, and current state (PENDING, SUCCEEDED, FAILED).
The full schema, indexes, and constraints are deferred to the data model deep dive. I deliberately keep the Payment entity flat here; the audit trail lives in PaymentEvents, not in versioned columns on the payment row. The three entities above are sufficient to drive the API and the High-Level Design.
API Design
FR 1 - Initiate a charge:
POST /payments
Headers: Idempotency-Key: <client-generated uuid>
Body: { card_token, amount, currency, description? }
Response: { payment_id, status }
The Idempotency-Key header is required, not optional. Without it, retrying a timed-out request creates a duplicate charge. The key is a UUID the client generates before sending; any retry uses the exact same key. This is the pattern Stripe uses and the one we replicate here.
FR 2 - Get payment status:
GET /payments/{payment_id}
Response: { payment_id, status, amount, currency, created_at, events: [...] }
FR 3 - Issue a refund:
POST /payments/{payment_id}/refunds
Headers: Idempotency-Key: <client-generated uuid>
Body: { amount?, reason? }
Response: { refund_id, status }
I always call this out proactively in interviews: POST /payments/{payment_id}/refunds uses POST rather than DELETE because a refund is a new financial operation, not a reversal of the charge at the resource level. A full refund does not delete the payment; it creates a Refund entity linked to the original. Partial refunds make this unambiguous: you cannot DELETE /payments/{id} by 30%.
The amount field is optional. If omitted, the system defaults to a full refund of the captured amount. The service validates that the sum of all prior refunds plus this refund does not exceed the original captured amount before calling the card network.
If authentication were in scope, I would add a merchant_id claim to the request context from the auth token and scope all payment lookups to that merchant. I would not add a merchant_id field to the request body because that is a privilege escalation risk.
High-Level Design
FR 1 - Accept a charge and record it
The simplest starting point: record the intent to charge before touching any external system. The client sends a charge request; the Payment Service writes the payment to the database in PENDING state and returns a payment_id. No card network interaction yet.
Components:
- Merchant App: Sends the charge request with a card token, amount, and Idempotency-Key header.
- Payment Service: Validates the request, checks the idempotency key, writes the payment record in PENDING state.
- Payments DB (PostgreSQL): Stores the payment with a UNIQUE constraint on
idempotency_key.
Request walkthrough:
- Client sends
POST /paymentswith card token, amount, currency, and Idempotency-Key. - Payment Service validates the request (positive amount, supported currency, non-empty card token).
- Payment Service executes
INSERT ... ON CONFLICT DO NOTHINGon the idempotency key. - Payment Service returns
{ payment_id, status: "PENDING" }.
This diagram covers only the intent-recording step. The card network is not involved yet; we just know a merchant wants to charge a customer.
FR 1 (continued) - Authorize the charge with the card network
Now we extend the write path with the actual authorization. The Payment Service transitions the payment to AUTHORIZING (written to DB before making the external call), then calls the card network synchronously, then records the result.
Writing AUTHORIZING before the network call is deliberate: if the service crashes mid-call, the stuck AUTHORIZING record is visible to the reconciliation job introduced in deep dive 2 rather than silently lost. I draw this two-write sequence on the whiteboard every time, because interviewers consistently ask "what if you crash between the write and the network call?" Having the answer already on the board earns immediate credibility.
Components:
- Payment Service: Transitions payment PENDING to AUTHORIZING, calls the card network, writes the result.
- Card Network API: Visa, Mastercard, or Amex network endpoint. Returns an authorization code (approved) or a decline code in approximately 200ms.
- Payments DB: Stores the state transitions and appends PaymentEvents for the audit trail.
Request walkthrough:
- Payment Service transitions payment to
AUTHORIZINGand appends a PaymentEvent. - Payment Service calls the card network with the card token, amount, and currency.
- Card network responds in ~200ms:
AUTHORIZEDwith an auth code, orDECLINEDwith a decline code. - Payment Service records the result, transitions the payment state, and appends a final PaymentEvent.
- Payment Service returns the updated status to the client.
The two arrows from PS to DB represent two separate moments within the same request: the first update happens before the card network call (recording in-flight state), the second happens after (recording the outcome).
The timeout problem: The card network can fail to respond within our 1-second budget. When that happens, we cannot tell whether the charge succeeded (the network dropped our response) or failed (the network never processed the request). Marking this FAILED and allowing a retry risks charging the customer twice. This is the central correctness problem in payment processing, covered fully in deep dive 2.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.