Every Backend Engineer needs to know how to deal with payments.
A simple introduction to payment retries.
Any system that handles payments needs to be reliable and build fault tolerance.
Payment failures could happen— a network issue, a temporary bank glitch, or an expired card. A good retry strategy helps handle these failures smoothly while avoiding duplicate charges.
Let me show you how.
A good backend engineer masters APIs. A great backend engineer masters System Design.
If you need to practice System Design, CodeCrafters lets you create your own Docker, Git, Redis, and more.
Hands-on Projects + Practice = Job Offer
Sign up and get 40% off if you upgrade.
Thank you to our sponsors who keep this newsletter free!
Keeping Track of Payment Status
A key part of a solid payment retry system is tracking the payment status. Every transaction should have a status that helps decide whether to retry, refund, or escalate.
Typical payment statuses might include:
Initiated: The payment request has been received but not yet processed.
Processing: The payment is currently being processed by the payment gateway.
Succeeded: The payment has been completed.
Failed: The payment attempt has failed.
Pending Retry: The payment failed but is scheduled for a retry attempt.
Refunded: The payment was successful but has been refunded.
Cancelled: The payment was cancelled before processing.
Payment statuses should be stored in an append-only database table to maintain accuracy and allow for troubleshooting.
An append-only database, also known as an immutable database, only allows new data to be added to the end of the database. It doesn't allow modification or deletion of existing data.
Here's a simple example of how a payment status tracking table might look:
CREATE TABLE payment_status (
id SERIAL PRIMARY KEY,
payment_id UUID NOT NULL,
status VARCHAR(50) NOT NULL,
timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
metadata JSONB
);
Each time a payment's status changes, you insert a new row into this table. The metadata column can store additional information relevant to each status change, such as error messages for failed attempts or transaction IDs for successful payments.
This approach offers several benefits:
Audit Trail: It provides a complete history of all status changes, which is crucial for compliance and debugging.
Data Integrity: Since existing records can't be modified, there's no risk of accidental data corruption.
Concurrency: It simplifies handling of concurrent operations, as you're only ever adding new data, not updating existing data.
Key Parts of a Payment Retry System
Two components that can help are a Retry Queue and a Dead Letter Queue.
1. Retry Queue
Handles temporary issues like network timeouts or processing delays.
Failed payments are re-queued and retried after a short wait.
2. Dead Letter Queue (DLQ)
Stores persistent failures or transactions that have exceeded retry limits.
Helps debug and isolate problematic transactions instead of endlessly retrying them.
Retry or Not Retry: That Is the Question
It is critical to decide whether to retry a payment failure. A good retry strategy should distinguish between transient errors that may succeed upon retry and permanent failures that require alternative handling.
When to Retry
Network timeouts – A retry might work if the issue is temporary.
Bank system unavailability – Some bank systems may experience brief downtime.
Rate limits – A delayed retry can allow time for limits to reset.
Service outages – If the payment gateway is momentarily down, retrying after a short period can work.
When Not to Retry
Insufficient funds – Repeating the charge won't change the customer's balance.
Card expired – The card details must be updated before retrying.
Invalid card details – A retry with the same incorrect information will always fail.
Fraud detection blocks – Some transactions are permanently blocked due to security reasons.
If the error is a fundamental problem with your input or the system, then "No Retry" is the better choice.
A well-designed system should assess the type of Failure before automatically retrying a transaction, ensuring efficiency and avoiding unnecessary processing.
How Payment Retries Work
The following steps illustrate how a robust payment retry process works:
Failure Occurs – A payment attempt fails due to a temporary or permanent issue.
Retryable Check – The system evaluates whether the failure is transient (e.g., network timeout) or non-retryable (e.g., insufficient funds).
If retryable, the transaction moves to the Retry Queue for reprocessing.
If not retryable, it is logged or stored in a database for further analysis.
Retry Process – Payments in the Retry Queue are picked up by the Payment Service for another attempt.
Second Failure Check – If the payment still fails, the system re-evaluates its retryability.
If retryable, the system retries up to a limit (e.g., 3 attempts).
If not retryable, the failed transaction is moved to the Dead Letter Queue (DLQ) for manual inspection and debugging.
Dead Letter Queue Handling – Transactions in the DLQ require investigation to determine if they need manual intervention, customer notification, or a system fix.
Avoiding Duplicate Charges with Exactly-Once Delivery
One major challenge in payments is ensuring retries don’t result in duplicate charges. This is where exactly-once delivery comes in—it guarantees that a payment is processed only once, even if retried multiple times.
Ways to achieve this include:
Idempotency Keys – Ensuring repeated attempts reference the same transaction.
Distributed Transaction Handling – Using event sourcing or locks to prevent double processing.
But that’s a topic for another post I’m cooking ;-)
Final Thoughts
Building a solid payment retry system is essential to handle failures effectively while avoiding lost revenue and customer frustration. Using payment status tracking, retry queues, and dead letter queues can help create a more resilient system.
Sometimes, reliability is just a series of well-managed retries.
System Design Classroom is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Articles I enjoyed this week
Stop Using NULL. It's a Bad Practice by
8 Must-Know Distributed System Design Patterns by
How to Estimate like a Superstar Tech Lead? by
Thank you for reading System Design Classroom. If you like this post, share it with your friends!
Very useful. Thanks for this!
Great article. Nicely illustrated! Thank you, Raúl.