Saga Pattern · System Internals

The big picture#

TL;DRthe 30-second version

A saga splits one cross-service transaction into a sequence of local transactions, each committed independently in its own service's database — there is no global lock and no global commit.
Every forward step Tᵢ has a compensating action Cᵢ that semantically undoes it (refund, release, cancel). If a step fails, the saga runs the compensations for already-completed steps in reverse order — backward recovery.
Two coordination styles: orchestration (a central coordinator commands each step) or choreography (services react to each other's events). Both run the same forward/compensate logic.
You gain availability, loose coupling, and no distributed locks; you give up isolation (intermediate states are visible) and atomicity-by-rollback — so steps need idempotency, semantic locks, and correct compensations.
It's the standard way to do distributed transactions in microservices. Engines that run sagas for you: Temporal/Cadence, AWS Step Functions, Camunda/Zeebe, Axon, Netflix Conductor.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced internals (isolation anomalies, recovery semantics, comparison with TCC) you can skip on a first pass and return to later.

T1create order

ok →

T2reserve stock

ok →

T3take payment

ok →

T4ship

Success: the forward chain runs to completion

T1order created

ok →

T2stock reserved

fails ✗

T3✗ payment declined

Failure at T3: compensate backward, in reverse order

Start here: the problem it solves#

An order touches several services — Order, Inventory, Payment, Shipping — each owning its own database. You want the whole thing to either complete or have no effect. But there's no shared transaction across those databases, and the textbook answer (two-phase commit) makes every service hold locks and block until a coordinator says commit — fragile and slow at scale, and a coordinator crash leaves everyone stuck.

Why is one ACID transaction off the table? Because each microservice owns a private database — that's the whole point of the pattern (independent deploys, independent scaling, no shared schema). A single transaction would have to span four different database engines, possibly different vendors, possibly across a network partition. There is no BEGIN/COMMIT that reaches across all of them.

Two-phase commit (2PC) can technically coordinate multiple resource managers, but it pays for atomicity with blocking. In the prepare phase every participant locks the rows it touched and promises to commit; it then holds those locks until the coordinator's decision arrives. If the coordinator crashes after participants vote 'yes' but before it broadcasts 'commit', the participants are stuck holding locks with no safe way to proceed — the classic blocking problem. Across services on a network, that window is long and common.

A saga gives up global isolation and instead embraces 'do, and be ready to undo'. Each service commits its own local transaction immediately and releases its locks right away; if a later step fails, the saga runs each completed step's compensating action to walk the system back to a consistent state.

The trade-off in one lineNo distributed locks and no blocking coordinator — high availability and loose coupling. The cost: no isolation (other clients can observe the half-done intermediate state), and you must design a correct compensating action for every step. A saga gives atomicity and durability but not isolation — ACD, not ACID.

Forward steps, then reverse compensations#

A saga is an ordered list of steps. Each step is a local transaction (T₁, T₂, …) paired with a compensating transaction (C₁, C₂, …) that semantically undoes it. The saga runs the forward steps in order. Each Tᵢ commits in its own service the moment it succeeds. If every step succeeds, the saga is complete. If step k fails, the saga runs Cₖ₋₁, Cₖ₋₂, … C₁ — the compensations for the already-completed steps, in reverse order — and ends in the aborted state. This reverse unwinding is called backward recovery.

Run T₁ (create order), T₂ (reserve inventory), T₃ (charge payment), T₄ (schedule shipping) — each committing locally as it goes.
If, say, T₃ fails: the already-committed steps are T₁ and T₂.
Run C₂ (release inventory), then C₁ (cancel order) — reverse order of how they ran.
Outcome: either all steps committed (success), or all completed steps compensated (clean abort). Never a permanent half-done state.

Compensation is semantic, not a rollbackYou can't 'roll back' a committed local transaction — it's durable and visible the instant it commits. So a compensation is a brand-new transaction that counteracts the original's business effect: refund the charge, release the stock, cancel the booking. It does not restore the exact prior bytes; it restores the prior business meaning. It must be idempotent and should essentially always eventually succeed (retry until it does), because there is no compensation for a failed compensation.

Steps come in three flavors, and naming them is the key design skill. Compensatable steps can be undone by a later compensation (reserve inventory → release it). A pivot step is the point of no return: once it commits the saga is guaranteed to complete forward (e.g. 'charge the card' or 'send the shipment' — often you arrange the pivot so everything after it is retriable). Retriable steps come after the pivot and must succeed eventually; they have no compensation because the saga will never go backward past the pivot.

PredictA saga has steps T₁ T₂ T₃ T₄. T₃ commits successfully, then T₄ fails. T₄ has no compensation defined. What should happen?

Hint: Where is the pivot? What kind of step has no compensation?

A step with no compensation is a retriable step, which means it sits after the pivot — so the saga must NOT go backward. The correct behavior is forward recovery: keep retrying T₄ (with backoff, idempotently) until it succeeds. Compensating T₃, T₂, T₁ would be wrong, because committing past the pivot was the promise that the saga would complete. If T₄ genuinely cannot succeed, that's an operational alert for manual intervention, not an automatic rollback.

Orchestration vs. choreography#

There are two ways to drive a saga. Orchestration (shown in the simulator) uses a central orchestrator that tells each service what to do next and triggers compensations on failure — the workflow lives in one place, easy to follow, easy to reason about, and easy to add timeouts and observability to; the cost is that the orchestrator is a stateful component you must build, run, and keep available. Choreography has no central brain: each service emits domain events and others subscribe and react, so the workflow is implicit in the chain of events — maximally decoupled, but the end-to-end flow exists only in your head and gets hard to understand, test, and debug as steps multiply.

Orchestratordecides the next step

command ↓ · reply ↑the coordinator holds the saga state

Order · Stock · Payservices — just obey commands

Orchestration: a central coordinator drives every step

Orderemits event

event →

Stockreacts, emits event

event →

Payreacts

Choreography: each service reacts to the previous one's event

Go deeperGo deeper: which one to pick, and the hidden costs

Choreography shines for short sagas (2–4 steps) with stable, simple flows: it adds no central component and keeps services maximally autonomous. Its failure mode is emergent complexity — cyclic event dependencies, no single place to see 'where is this order?', and compensation logic smeared across every service. There is also a risk of services becoming implicitly coupled through the events they expect from each other.

Orchestration shines as flows grow or branch. The orchestrator is the single source of truth for saga state, which makes timeouts, retries, compensation ordering, and 'why did order 123 abort?' tractable. Its risks: the orchestrator can accrete business logic that belongs in the services (turning into a god component), and it must itself be made durable — if it crashes mid-saga it has to resume exactly where it left off.

That durability requirement is why workflow engines exist. Temporal/Cadence persist every step and every decision as an event-sourced history, so a crashed orchestrator replays the history and continues deterministically. AWS Step Functions encodes the saga as a state machine (Amazon States Language) with built-in retry/catch and a managed durable executor. Either way the orchestrator's own state must survive crashes, or a mid-saga failure becomes an orphaned, half-applied transaction.

The cost of no isolation: locks, idempotency, retries#

The defining property of a saga is that it has no isolation. Because each Tᵢ commits immediately, the intermediate state — order created but not paid, inventory reserved but order not confirmed — is visible to every other client and saga the moment it lands. Two-phase locking hides intermediate state behind locks; a saga deliberately does not. Almost all of a saga's implementation complexity comes from managing this missing isolation.

Semantic locks — instead of a database lock, mark the affected records with an application-level state flag (order = PENDING, account = AUTH_HELD). Other operations check the flag and refuse, queue, or treat the resource as tentatively committed. The saga clears the flag when it completes or compensates. This is a countermeasure for the lack of isolation, implemented in your data model rather than the database.
Idempotency — every step and every compensation may be delivered or retried more than once (messaging is at-least-once; crashes cause re-sends). Each handler must produce the same result whether it runs once or five times — usually via a unique saga/request id stored and de-duplicated on the receiving side.
Commutative / retriable design — order compensations so they commute where possible, and make steps after the pivot retriable so the saga can always drive forward. Operations that can't be safely retried or reordered (sending an email, calling an external irreversible API) need special handling or must be the pivot.
Durable saga log — the saga's progress (which steps committed, which compensated) must itself be persisted, so a crash of the orchestrator or coordinator resumes instead of leaving a transaction half-applied. This is the saga equivalent of a write-ahead log.

Why idempotency is non-negotiableTie each saga step to messaging and you inherit at-least-once delivery: the same 'charge payment' command can arrive twice if an ack is lost. Without idempotency that's a double charge. The standard fix pairs the transactional outbox (write the event in the same local transaction as the data, so they can't diverge) with a consumed-message table on the receiver that drops duplicates by id.

Go deeperGo deeper: countermeasures for the anomalies isolation would have prevented

Garcia-Molina & Salem's original paper, and Chris Richardson's later treatment, catalog the anomalies and their countermeasures. The three core anomalies: lost updates (a saga overwrites a change another saga made into its intermediate state), dirty reads (another transaction reads a value the saga will later compensate away), and fuzzy/non-repeatable reads (a saga reads twice and sees different values because another saga's steps interleaved).

Semantic lock — the PENDING-flag approach above; the workhorse countermeasure.
Commutative updates — design updates so order doesn't matter (e.g. increment/decrement a balance rather than set it), eliminating lost updates.
Pessimistic view — reorder the saga so the step that would expose a risky intermediate state runs as late as possible, shrinking the dirty-read window.
Reread value — before updating, re-read and verify the record hasn't changed since you read it (optimistic check), aborting if it has.
Version file / by-value — record operations and reorder or interpret them so out-of-order or duplicate delivery still yields a correct result.

When to reach for a saga (and when not to)#

A saga is the right call when a business transaction genuinely spans multiple services with private databases, you need high availability, and you can tolerate eventual consistency with brief, visible intermediate states. It's the wrong call when a single service (and a single local ACID transaction) could own the whole operation, or when the business truly cannot tolerate any window where the intermediate state is observable.

Good fit: long-lived, multi-service business workflows (order fulfillment, travel booking, account onboarding) where blocking 2PC would kill availability and the steps have natural compensations (refund, release, cancel).
Weaker fit: operations that fit inside one service/database — just use a local ACID transaction; a saga adds enormous accidental complexity for nothing.
Watch out: workflows with steps that have no meaningful compensation (you can't un-send a physical package or un-email a customer). These force you to design around a pivot and accept forward-only completion past it.

What you actually tradeYou buy availability, loose coupling, and lock-free progress. You pay with eventual consistency, the absence of isolation and atomic rollback, and the engineering burden of writing a correct, idempotent compensation for every reversible step — plus the semantic-lock machinery to keep other clients from acting on half-done state. The compensations are real code with real edge cases, and they are the part teams most often get wrong.

Failure modes and how they bite#

Sagas trade one big failure mode (a blocking coordinator) for several smaller, subtler ones. Knowing them is the difference between a saga that's robust and one that silently corrupts business state.

Compensation itself fails — there is no compensation-for-a-compensation. A Cᵢ that fails must be retried until it succeeds (with backoff); if it truly cannot, the saga escalates to an alert for manual intervention. This is why compensations should be simple, idempotent, and as close to guaranteed-to-succeed as you can make them.
Dirty reads / lost updates — without isolation, another client or saga can read or overwrite the intermediate state. Mitigated with semantic locks, commutative updates, and reread checks, never with a database lock.
Non-idempotent retries — at-least-once delivery means a step can run twice. A non-idempotent step (charge, ship, increment) then double-applies. Mitigated with idempotency keys and a consumed-message table.
Pivot / irreversible steps — once you've crossed the pivot (or done something physically irreversible), backward recovery is impossible; you must go forward. Misclassifying a step as compensatable when it isn't leads to attempts to 'undo' things that can't be undone.
Lost orchestrator / coordinator state — if the saga's own progress log isn't durable, a crash leaves a transaction half-applied with nothing tracking it. The saga log must be persisted and recovered, which is exactly what workflow engines provide.

Saga vs 2PC vs TCC vs eventual consistency#

	Saga	2PC (XA)	TCC	Plain eventual consistency
Atomicity	Business-level, via compensation	True atomic commit	Business-level, via Confirm/Cancel	None guaranteed
Isolation	None (needs semantic locks)	Full (locks held to commit)	Partial (reservation hides intent)	None
Locking	No distributed locks	Distributed locks until commit	Short reservation, no long lock	None
Coupling	Loose	Tight (all-or-nothing coordinator)	Medium (3 ops per service)	Loose
Blocking on coordinator crash	No	Yes — participants stuck	No	No
Best for	Long multi-service business flows	Few resources, short txns, strong consistency	Reservable resources (seats, stock, funds)	Independent updates, no all-or-nothing need

Go deeperGo deeper: how TCC differs from a saga

TCC (Try-Confirm/Cancel) is a close cousin. Each service exposes three operations: Try reserves resources tentatively (hold the seat, authorize the funds) without making the effect final; Confirm makes all reservations final once every Try succeeds; Cancel releases the reservations if any Try fails. It's a two-phase shape like 2PC, but without database-level distributed locks — the 'lock' is an application-level reservation.

The key contrast: in a saga, every Tᵢ commits its real effect immediately and a compensation undoes a fully-applied change — so intermediate state is fully visible (the charge really happened, then gets refunded). In TCC, the Try only reserves; the real effect isn't applied until Confirm, so there's no 'real charge then refund' — just 'authorization hold, then capture or release.' TCC therefore offers better isolation (the held reservation hides intent) at the price of every service implementing three coordinated operations and holding reservations until Confirm.

Where sagas run in the wild#

The canonical example is e-commerce order fulfillment: create order → take payment → reserve inventory → arrange shipping, each in its own service, with compensations to cancel the order, refund the payment, and release the inventory if a later step fails. The same shape appears in travel booking (flight + hotel + car), account/loan onboarding, and any multi-step business process across services.

Temporal / Cadence — durable workflow engines (Cadence originated at Uber; Temporal is its successor). You write the saga as ordinary code; the engine persists every step as event-sourced history and resumes deterministically after a crash, with first-class support for compensations.
AWS Step Functions — managed orchestrator; the saga is a state machine (Amazon States Language) with built-in Retry/Catch, so a failed step triggers the compensation branch. AWS documents the saga pattern explicitly as a sample.
Camunda / Zeebe (BPMN) — model the saga as a business process with compensation boundary events; the engine drives forward and runs compensation handlers in reverse on failure.
Axon Framework — saga support in the JVM/CQRS+event-sourcing world: a saga component subscribes to events and dispatches commands/compensations.
Netflix Conductor — a microservice orchestration engine for stitching service calls into workflows with compensation/retry semantics.

You rarely hand-roll the plumbingThe forward/compensate logic is yours to write (it's your business domain), but the durable saga log, retries, timeouts, and crash-resume are exactly what these engines provide. Reaching for one is usually wiser than building a bespoke orchestrator and re-discovering every recovery edge case the hard way.

Common misconceptions & gotchas#

Saga vs 2PC — when would I still use 2PC?

2PC gives true atomic commit and full isolation, but it holds distributed locks until commit and blocks if the coordinator crashes mid-decision. Use it for a small number of tightly-coupled resources in a short transaction where strong consistency is worth the availability hit (e.g. a single bank's internal databases). For long-lived flows across independent microservices, the blocking and lock-holding make 2PC impractical — that's exactly the gap sagas fill.

Orchestration vs choreography — which should I choose?

Choreography (services react to events) for short, stable flows where you value maximal decoupling and have no central component. Orchestration (a coordinator commands each step) once the flow has more than a few steps, branches, or needs timeouts and end-to-end visibility — the orchestrator becomes the single source of truth for saga state. Most non-trivial sagas end up orchestrated, often via a workflow engine.

What if a compensation itself fails?

There's no compensation for a compensation, so the answer is retry until it succeeds — compensations are designed to be idempotent and near-guaranteed. If it genuinely cannot succeed (a downstream is permanently broken), the saga escalates to manual intervention / an alert. This is why compensations should be kept simple and resilient.

How do you handle the lack of isolation?

Application-level countermeasures, not database locks: semantic locks (a PENDING/AUTH_HELD flag other operations respect), commutative updates (increment/decrement instead of set), reread-before-write checks, and ordering risky steps as late as possible. The intermediate state is genuinely visible — you manage it explicitly rather than hiding it.

Is a saga the same as eventual consistency?

A saga produces eventual consistency, but adds a guarantee plain eventual consistency lacks: business-level atomicity. Either all steps commit, or all completed steps are compensated — there's no permanent partial outcome. Plain eventual consistency makes no such all-or-nothing promise.

QuizAn order saga reserves inventory (T₂), then payment (T₃) fails. The compensation C₂ 'release inventory' is sent but its ack is lost, so it's redelivered and runs twice. What property saves you from releasing the same stock twice?

Two-phase commit on the inventory service
Idempotency of the compensation (dedupe by saga id)
A distributed lock held across the saga
Strict serializable isolation between sagas

Show answer

Idempotency of the compensation (dedupe by saga id) — Messaging is at-least-once, so any step or compensation can be delivered more than once. The defense is idempotency: the inventory service records the saga/request id it has already processed and treats the duplicate C₂ as a no-op, so the stock is released exactly once in effect. Sagas deliberately avoid distributed locks and global isolation, so the other options aren't how this is solved.

In an interview#

Lead with the contrast: across microservices you can't do one ACID transaction, and 2PC blocks while holding distributed locks; a saga instead runs a sequence of local transactions with compensating actions, undoing completed steps in reverse on failure (backward recovery). You trade isolation for availability and loose coupling — ACD, not ACID. Mention orchestration vs. choreography as the two coordination styles.

Be ready for the hard parts: compensations are semantic (a new transaction, not a byte-level rollback), must be idempotent, and must near-always succeed because there's no compensation for a compensation; intermediate states are visible, so use semantic locks or status flags ('pending') to stop others acting on half-done state; tie steps to messaging with the transactional outbox + at-least-once delivery so a crash mid-saga can resume. Know the pivot step and forward vs backward recovery. A common follow-up is the comparison with 2PC and TCC.

Then open the simulator: START the saga and STEP through the local transactions committing one by one; arm FAIL to make a step fail and watch the orchestrator compensate the completed steps in reverse, aborting cleanly with no lock ever held.

References & further reading#

References

Garcia-Molina & Salem — Sagas (SIGMOD 1987) — the original paper that defined sagas and compensating transactions
Chris Richardson — Saga pattern (microservices.io) — the canonical modern treatment: orchestration/choreography, isolation countermeasures
AWS Step Functions — Saga pattern sample — saga as a state machine with built-in compensation branches
Temporal — The Saga pattern made easy — durable orchestration and compensations in a workflow engine
Microsoft — Saga design pattern (Azure Architecture Center) — clear diagrams of orchestration vs choreography and failure handling

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

A step fails — undo the rest →Order and Inventory commit, then Payment fails — so the orchestrator runs compensating actions in reverse to undo the committed steps. The saga ends cleanly aborted.Every step succeeds →No failure: each step commits its own local transaction in turn and the saga completes — the fast path where no compensation is ever needed.

Open the Saga Pattern simulator →

Up nextMessage Queue (Kafka)

← Back to the learning path