Idempotency and Exactly-Once Settlement
Why idempotency keys are necessary but not sufficient for exactly-once settlement, and the patterns that close the gap across retries and partial failures.
The first thing engineers reach for when a payment retries is an idempotency key. It is the right reflex. It is also, on its own, not enough.
An idempotency key tells your service: "If you have already processed this request, do not process it again." That guarantees the API caller a stable response. It does not guarantee that the underlying event-sourced ledger, the external payment rail, and the counterparty all agreed on what happened. In settlement, those three views diverging is the bug that ends up in a regulator's letter.
What "exactly once" actually means in settlement
There is no exactly-once delivery in distributed systems. There is exactly-once effect — the outcome a customer or counterparty observes — built on top of at-least-once delivery. The job of a settlement system is to absorb retries and partial failures so the effect lands once.
That means three things have to be true at the same time:
- The ledger reflects the transaction once, regardless of how many times the request was retried.
- The external rail (Faster Payments, SEPA Inst, SWIFT, card network) sees one instruction, not several.
- Reconciliation against counterparty statements confirms the two views match.
Idempotency keys solve the first reliably. They help with the second only if the rail itself accepts and honours an idempotency token, which most rails do not. They do nothing for the third.
The patterns that close the gap
The first pattern is the outbox. The ledger write and the rail instruction are not the same operation; making them atomic across two systems is what causes most of the production pain. Instead, write the rail instruction as a record in the same database transaction as the ledger entry, and have a separate worker drain that outbox to the rail. Crashes in either direction become safe: the worker retries; the rail's own deduplication or your inbox catches duplicates.
The second is the inbox. Every callback or webhook from the rail — settlement confirmation, return, reversal — is written to an inbox table keyed by the rail's reference, before any business logic runs. Replays from the rail are no-ops by construction.
The third is deterministic reconciliation. End-of-day and intraday reconciliation runs against independent counterparty statements and nostro balances. Breaks are surfaced with full lineage from the original instruction through the outbox, the rail confirmation, and the ledger entry. This is the layer that catches the cases the first two miss.
Where this gets tested
The cases that exercise all three patterns are not the happy path. They are: a 504 from the rail after the rail has already accepted; a webhook arriving before the synchronous response; a reversal arriving days later for a settled instruction; and the awkward one — your own service crashing between writing the ledger entry and writing the outbox row. The first three are handled by the patterns above. The last is handled by treating the ledger write and the outbox write as one transaction, never two.
None of this is exotic. The reason it does not show up by default is that it costs more to build than the optimistic path, and the failures it prevents only appear in production at low rates. The reason regulated transaction & settlement systems treat it as table stakes is that those low rates compound, and the audit trail you need to prove what happened in any one of them lives in the same journal as everything else.
Filed under
Discuss in your context