Idempotency and Exactly-Once Settlement
Why idempotency keys are necessary but not sufficient for exactly-once settlement, and the patterns that close the gap across retries and partial failures.
When a payment retries, the first thing most engineers reach for is an idempotency key. That instinct is correct. It just doesn't finish the job.
An idempotency key tells your service: "if you've already processed this request, don't process it again." That gives the caller a stable response. What it doesn't give you is agreement between the event-sourced ledger, the external payment rail, and the counterparty about what actually happened. When those three views drift apart in settlement, that's the kind of thing a regulator eventually writes you a letter about.
What "exactly once" actually means in settlement
There is no exactly-once delivery in distributed systems, and pretending otherwise is how teams build themselves into a corner. What you can build is exactly-once effect: the outcome a customer or counterparty observes, sitting on top of at-least-once delivery underneath. The job of a settlement system is to absorb retries and partial failures cleanly enough that the observable effect lands one time.
For that to be true, three conditions have to hold simultaneously:
- The ledger reflects the transaction once, no matter how many times the request was retried.
- The external rail (Faster Payments, SEPA Inst, SWIFT, a card network) sees one instruction rather than several.
- Reconciliation against counterparty statements confirms the two views agree.
An idempotency key handles the first one reliably. It only helps with the second if the rail itself accepts and honours an idempotency token, which most rails don't. For the third, it's irrelevant.
The patterns that close the gap
Most teams get there by combining an outbox, an inbox, and reconciliation that's deterministic enough to be repeatable.
The outbox exists because the ledger write and the rail instruction are not the same operation, and trying to make them atomic across two systems is the source of most of the production pain in this space. The fix is to write the rail instruction as a row in the same database transaction as the ledger entry, and have a separate worker drain that table out to the rail. Crashes on either side stay safe: the worker retries, and the rail's own deduplication (or your inbox) catches the duplicates.
The inbox is the mirror image. Every callback or webhook from the rail — settlement confirmation, return, reversal — is written to an inbox table keyed by the rail's reference before any business logic runs against it. Replays from the rail become no-ops by construction, which means you can stop reasoning about whether the rail will ever resend something, because the answer is "yes, eventually."
Then there's reconciliation, which is the layer that catches what the first two miss. End-of-day and intraday runs go against independent counterparty statements and nostro balances. Breaks are surfaced with full lineage from the original instruction through the outbox row, the rail confirmation, and the ledger entry. The reason this is worth the effort is that the cases where outbox plus inbox quietly disagree with the counterparty are exactly the cases nobody notices until the next monthly close.
Where this gets tested
The cases that exercise all of this aren't the happy path. They look like: a 504 from the rail after the rail has already accepted the instruction. A webhook arriving before the synchronous response does. A reversal arriving days later for an already-settled instruction. And the awkward one, which is your own service crashing somewhere between writing the ledger entry and writing the outbox row. The first three are handled by what's described above. The last one is handled by treating the ledger write and the outbox write as a single transaction, never two.
None of this is exotic engineering. The reason it doesn't show up by default is that it costs more to build than the optimistic path, and the failures it prevents only appear in production at low rates. Regulated transaction & settlement systems treat it as table stakes because those low rates compound, and the audit trail you'd need to explain any one of them lives in the same journal as everything else you'd want to defend.
Filed under
Discuss in your context