The big picture#
TL;DRthe 30-second version
- A saga splits one cross-service transaction into a sequence of local transactions, each committed independently in its own service's database — there is no global lock and no global commit.
- Every forward step Tᵢ has a compensating action Cᵢ that semantically undoes it (refund, release, cancel). If a step fails, the saga runs the compensations for already-completed steps in reverse order — backward recovery.
- Two coordination styles: orchestration (a central coordinator commands each step) or choreography (services react to each other's events). Both run the same forward/compensate logic.
- You gain availability, loose coupling, and no distributed locks; you give up isolation (intermediate states are visible) and atomicity-by-rollback — so steps need idempotency, semantic locks, and correct compensations.
- It's the standard way to do distributed transactions in microservices. Engines that run sagas for you: Temporal/Cadence, AWS Step Functions, Camunda/Zeebe, Axon, Netflix Conductor.
Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced internals (isolation anomalies, recovery semantics, comparison with TCC) you can skip on a first pass and return to later.
Start here: the problem it solves#
An order touches several services — Order, Inventory, Payment, Shipping — each owning its own database. You want the whole thing to either complete or have no effect. But there's no shared transaction across those databases, and the textbook answer (two-phase commit) makes every service hold locks and block until a coordinator says commit — fragile and slow at scale, and a coordinator crash leaves everyone stuck.
Why is one ACID transaction off the table? Because each microservice owns a private database — that's the whole point of the pattern (independent deploys, independent scaling, no shared schema). A single transaction would have to span four different database engines, possibly different vendors, possibly across a network partition. There is no BEGIN/COMMIT that reaches across all of them.
Two-phase commit (2PC) can technically coordinate multiple resource managers, but it pays for atomicity with blocking. In the prepare phase every participant locks the rows it touched and promises to commit; it then holds those locks until the coordinator's decision arrives. If the coordinator crashes after participants vote 'yes' but before it broadcasts 'commit', the participants are stuck holding locks with no safe way to proceed — the classic blocking problem. Across services on a network, that window is long and common.
A saga gives up global isolation and instead embraces 'do, and be ready to undo'. Each service commits its own local transaction immediately and releases its locks right away; if a later step fails, the saga runs each completed step's compensating action to walk the system back to a consistent state.
Forward steps, then reverse compensations#
A saga is an ordered list of steps. Each step is a local transaction (T₁, T₂, …) paired with a compensating transaction (C₁, C₂, …) that semantically undoes it. The saga runs the forward steps in order. Each Tᵢ commits in its own service the moment it succeeds. If every step succeeds, the saga is complete. If step k fails, the saga runs Cₖ₋₁, Cₖ₋₂, … C₁ — the compensations for the already-completed steps, in reverse order — and ends in the aborted state. This reverse unwinding is called backward recovery.
- Run T₁ (create order), T₂ (reserve inventory), T₃ (charge payment), T₄ (schedule shipping) — each committing locally as it goes.
- If, say, T₃ fails: the already-committed steps are T₁ and T₂.
- Run C₂ (release inventory), then C₁ (cancel order) — reverse order of how they ran.
- Outcome: either all steps committed (success), or all completed steps compensated (clean abort). Never a permanent half-done state.
Steps come in three flavors, and naming them is the key design skill. Compensatable steps can be undone by a later compensation (reserve inventory → release it). A pivot step is the point of no return: once it commits the saga is guaranteed to complete forward (e.g. 'charge the card' or 'send the shipment' — often you arrange the pivot so everything after it is retriable). Retriable steps come after the pivot and must succeed eventually; they have no compensation because the saga will never go backward past the pivot.
PredictA saga has steps T₁ T₂ T₃ T₄. T₃ commits successfully, then T₄ fails. T₄ has no compensation defined. What should happen?
Hint: Where is the pivot? What kind of step has no compensation?
A step with no compensation is a retriable step, which means it sits after the pivot — so the saga must NOT go backward. The correct behavior is forward recovery: keep retrying T₄ (with backoff, idempotently) until it succeeds. Compensating T₃, T₂, T₁ would be wrong, because committing past the pivot was the promise that the saga would complete. If T₄ genuinely cannot succeed, that's an operational alert for manual intervention, not an automatic rollback.
Orchestration vs. choreography#
There are two ways to drive a saga. Orchestration (shown in the simulator) uses a central orchestrator that tells each service what to do next and triggers compensations on failure — the workflow lives in one place, easy to follow, easy to reason about, and easy to add timeouts and observability to; the cost is that the orchestrator is a stateful component you must build, run, and keep available. Choreography has no central brain: each service emits domain events and others subscribe and react, so the workflow is implicit in the chain of events — maximally decoupled, but the end-to-end flow exists only in your head and gets hard to understand, test, and debug as steps multiply.
Go deeperGo deeper: which one to pick, and the hidden costs
Choreography shines for short sagas (2–4 steps) with stable, simple flows: it adds no central component and keeps services maximally autonomous. Its failure mode is emergent complexity — cyclic event dependencies, no single place to see 'where is this order?', and compensation logic smeared across every service. There is also a risk of services becoming implicitly coupled through the events they expect from each other.
Orchestration shines as flows grow or branch. The orchestrator is the single source of truth for saga state, which makes timeouts, retries, compensation ordering, and 'why did order 123 abort?' tractable. Its risks: the orchestrator can accrete business logic that belongs in the services (turning into a god component), and it must itself be made durable — if it crashes mid-saga it has to resume exactly where it left off.
That durability requirement is why workflow engines exist. Temporal/Cadence persist every step and every decision as an event-sourced history, so a crashed orchestrator replays the history and continues deterministically. AWS Step Functions encodes the saga as a state machine (Amazon States Language) with built-in retry/catch and a managed durable executor. Either way the orchestrator's own state must survive crashes, or a mid-saga failure becomes an orphaned, half-applied transaction.
The cost of no isolation: locks, idempotency, retries#
The defining property of a saga is that it has no isolation. Because each Tᵢ commits immediately, the intermediate state — order created but not paid, inventory reserved but order not confirmed — is visible to every other client and saga the moment it lands. Two-phase locking hides intermediate state behind locks; a saga deliberately does not. Almost all of a saga's implementation complexity comes from managing this missing isolation.
- Semantic locks — instead of a database lock, mark the affected records with an application-level state flag (order = PENDING, account = AUTH_HELD). Other operations check the flag and refuse, queue, or treat the resource as tentatively committed. The saga clears the flag when it completes or compensates. This is a countermeasure for the lack of isolation, implemented in your data model rather than the database.
- Idempotency — every step and every compensation may be delivered or retried more than once (messaging is at-least-once; crashes cause re-sends). Each handler must produce the same result whether it runs once or five times — usually via a unique saga/request id stored and de-duplicated on the receiving side.
- Commutative / retriable design — order compensations so they commute where possible, and make steps after the pivot retriable so the saga can always drive forward. Operations that can't be safely retried or reordered (sending an email, calling an external irreversible API) need special handling or must be the pivot.
- Durable saga log — the saga's progress (which steps committed, which compensated) must itself be persisted, so a crash of the orchestrator or coordinator resumes instead of leaving a transaction half-applied. This is the saga equivalent of a write-ahead log.
Go deeperGo deeper: countermeasures for the anomalies isolation would have prevented
Garcia-Molina & Salem's original paper, and Chris Richardson's later treatment, catalog the anomalies and their countermeasures. The three core anomalies: lost updates (a saga overwrites a change another saga made into its intermediate state), dirty reads (another transaction reads a value the saga will later compensate away), and fuzzy/non-repeatable reads (a saga reads twice and sees different values because another saga's steps interleaved).
- Semantic lock — the PENDING-flag approach above; the workhorse countermeasure.
- Commutative updates — design updates so order doesn't matter (e.g. increment/decrement a balance rather than set it), eliminating lost updates.
- Pessimistic view — reorder the saga so the step that would expose a risky intermediate state runs as late as possible, shrinking the dirty-read window.
- Reread value — before updating, re-read and verify the record hasn't changed since you read it (optimistic check), aborting if it has.
- Version file / by-value — record operations and reorder or interpret them so out-of-order or duplicate delivery still yields a correct result.
When to reach for a saga (and when not to)#
A saga is the right call when a business transaction genuinely spans multiple services with private databases, you need high availability, and you can tolerate eventual consistency with brief, visible intermediate states. It's the wrong call when a single service (and a single local ACID transaction) could own the whole operation, or when the business truly cannot tolerate any window where the intermediate state is observable.
- Good fit: long-lived, multi-service business workflows (order fulfillment, travel booking, account onboarding) where blocking 2PC would kill availability and the steps have natural compensations (refund, release, cancel).
- Weaker fit: operations that fit inside one service/database — just use a local ACID transaction; a saga adds enormous accidental complexity for nothing.
- Watch out: workflows with steps that have no meaningful compensation (you can't un-send a physical package or un-email a customer). These force you to design around a pivot and accept forward-only completion past it.
Failure modes and how they bite#
Sagas trade one big failure mode (a blocking coordinator) for several smaller, subtler ones. Knowing them is the difference between a saga that's robust and one that silently corrupts business state.
- Compensation itself fails — there is no compensation-for-a-compensation. A Cᵢ that fails must be retried until it succeeds (with backoff); if it truly cannot, the saga escalates to an alert for manual intervention. This is why compensations should be simple, idempotent, and as close to guaranteed-to-succeed as you can make them.
- Dirty reads / lost updates — without isolation, another client or saga can read or overwrite the intermediate state. Mitigated with semantic locks, commutative updates, and reread checks, never with a database lock.
- Non-idempotent retries — at-least-once delivery means a step can run twice. A non-idempotent step (charge, ship, increment) then double-applies. Mitigated with idempotency keys and a consumed-message table.
- Pivot / irreversible steps — once you've crossed the pivot (or done something physically irreversible), backward recovery is impossible; you must go forward. Misclassifying a step as compensatable when it isn't leads to attempts to 'undo' things that can't be undone.
- Lost orchestrator / coordinator state — if the saga's own progress log isn't durable, a crash leaves a transaction half-applied with nothing tracking it. The saga log must be persisted and recovered, which is exactly what workflow engines provide.
Saga vs 2PC vs TCC vs eventual consistency#
| Saga | 2PC (XA) | TCC | Plain eventual consistency | |
|---|---|---|---|---|
| Atomicity | Business-level, via compensation | True atomic commit | Business-level, via Confirm/Cancel | None guaranteed |
| Isolation | None (needs semantic locks) | Full (locks held to commit) | Partial (reservation hides intent) | None |
| Locking | No distributed locks | Distributed locks until commit | Short reservation, no long lock | None |
| Coupling | Loose | Tight (all-or-nothing coordinator) | Medium (3 ops per service) | Loose |
| Blocking on coordinator crash | No | Yes — participants stuck | No | No |
| Best for | Long multi-service business flows | Few resources, short txns, strong consistency | Reservable resources (seats, stock, funds) | Independent updates, no all-or-nothing need |
Go deeperGo deeper: how TCC differs from a saga
TCC (Try-Confirm/Cancel) is a close cousin. Each service exposes three operations: Try reserves resources tentatively (hold the seat, authorize the funds) without making the effect final; Confirm makes all reservations final once every Try succeeds; Cancel releases the reservations if any Try fails. It's a two-phase shape like 2PC, but without database-level distributed locks — the 'lock' is an application-level reservation.
The key contrast: in a saga, every Tᵢ commits its real effect immediately and a compensation undoes a fully-applied change — so intermediate state is fully visible (the charge really happened, then gets refunded). In TCC, the Try only reserves; the real effect isn't applied until Confirm, so there's no 'real charge then refund' — just 'authorization hold, then capture or release.' TCC therefore offers better isolation (the held reservation hides intent) at the price of every service implementing three coordinated operations and holding reservations until Confirm.
Where sagas run in the wild#
The canonical example is e-commerce order fulfillment: create order → take payment → reserve inventory → arrange shipping, each in its own service, with compensations to cancel the order, refund the payment, and release the inventory if a later step fails. The same shape appears in travel booking (flight + hotel + car), account/loan onboarding, and any multi-step business process across services.
- Temporal / Cadence — durable workflow engines (Cadence originated at Uber; Temporal is its successor). You write the saga as ordinary code; the engine persists every step as event-sourced history and resumes deterministically after a crash, with first-class support for compensations.
- AWS Step Functions — managed orchestrator; the saga is a state machine (Amazon States Language) with built-in Retry/Catch, so a failed step triggers the compensation branch. AWS documents the saga pattern explicitly as a sample.
- Camunda / Zeebe (BPMN) — model the saga as a business process with compensation boundary events; the engine drives forward and runs compensation handlers in reverse on failure.
- Axon Framework — saga support in the JVM/CQRS+event-sourcing world: a saga component subscribes to events and dispatches commands/compensations.
- Netflix Conductor — a microservice orchestration engine for stitching service calls into workflows with compensation/retry semantics.
Common misconceptions & gotchas#
Saga vs 2PC — when would I still use 2PC?
2PC gives true atomic commit and full isolation, but it holds distributed locks until commit and blocks if the coordinator crashes mid-decision. Use it for a small number of tightly-coupled resources in a short transaction where strong consistency is worth the availability hit (e.g. a single bank's internal databases). For long-lived flows across independent microservices, the blocking and lock-holding make 2PC impractical — that's exactly the gap sagas fill.
Orchestration vs choreography — which should I choose?
Choreography (services react to events) for short, stable flows where you value maximal decoupling and have no central component. Orchestration (a coordinator commands each step) once the flow has more than a few steps, branches, or needs timeouts and end-to-end visibility — the orchestrator becomes the single source of truth for saga state. Most non-trivial sagas end up orchestrated, often via a workflow engine.
What if a compensation itself fails?
There's no compensation for a compensation, so the answer is retry until it succeeds — compensations are designed to be idempotent and near-guaranteed. If it genuinely cannot succeed (a downstream is permanently broken), the saga escalates to manual intervention / an alert. This is why compensations should be kept simple and resilient.
How do you handle the lack of isolation?
Application-level countermeasures, not database locks: semantic locks (a PENDING/AUTH_HELD flag other operations respect), commutative updates (increment/decrement instead of set), reread-before-write checks, and ordering risky steps as late as possible. The intermediate state is genuinely visible — you manage it explicitly rather than hiding it.
Is a saga the same as eventual consistency?
A saga produces eventual consistency, but adds a guarantee plain eventual consistency lacks: business-level atomicity. Either all steps commit, or all completed steps are compensated — there's no permanent partial outcome. Plain eventual consistency makes no such all-or-nothing promise.
QuizAn order saga reserves inventory (T₂), then payment (T₃) fails. The compensation C₂ 'release inventory' is sent but its ack is lost, so it's redelivered and runs twice. What property saves you from releasing the same stock twice?
- Two-phase commit on the inventory service
- Idempotency of the compensation (dedupe by saga id)
- A distributed lock held across the saga
- Strict serializable isolation between sagas
Show answer
Idempotency of the compensation (dedupe by saga id) — Messaging is at-least-once, so any step or compensation can be delivered more than once. The defense is idempotency: the inventory service records the saga/request id it has already processed and treats the duplicate C₂ as a no-op, so the stock is released exactly once in effect. Sagas deliberately avoid distributed locks and global isolation, so the other options aren't how this is solved.
In an interview#
Lead with the contrast: across microservices you can't do one ACID transaction, and 2PC blocks while holding distributed locks; a saga instead runs a sequence of local transactions with compensating actions, undoing completed steps in reverse on failure (backward recovery). You trade isolation for availability and loose coupling — ACD, not ACID. Mention orchestration vs. choreography as the two coordination styles.
Be ready for the hard parts: compensations are semantic (a new transaction, not a byte-level rollback), must be idempotent, and must near-always succeed because there's no compensation for a compensation; intermediate states are visible, so use semantic locks or status flags ('pending') to stop others acting on half-done state; tie steps to messaging with the transactional outbox + at-least-once delivery so a crash mid-saga can resume. Know the pivot step and forward vs backward recovery. A common follow-up is the comparison with 2PC and TCC.
Then open the simulator: START the saga and STEP through the local transactions committing one by one; arm FAIL to make a step fail and watch the orchestrator compensate the completed steps in reverse, aborting cleanly with no lock ever held.
References & further reading#
- Garcia-Molina & Salem — Sagas (SIGMOD 1987) — the original paper that defined sagas and compensating transactions
- Chris Richardson — Saga pattern (microservices.io) — the canonical modern treatment: orchestration/choreography, isolation countermeasures
- AWS Step Functions — Saga pattern sample — saga as a state machine with built-in compensation branches
- Temporal — The Saga pattern made easy — durable orchestration and compensations in a workflow engine
- Microsoft — Saga design pattern (Azure Architecture Center) — clear diagrams of orchestration vs choreography and failure handling
Ready to try it?
The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.