Two-Phase Commit · System Internals

The big picture#

TL;DRthe 30-second version

2PC commits ONE transaction across multiple nodes atomically: all commit or all abort. A single coordinator drives two round trips to a set of participants.
Phase 1 (prepare/vote): each participant does the work, force-writes a 'prepared' record to its WAL, and votes yes/no. A yes is a binding promise — it must hold its locks until told the outcome.
Phase 2 (commit/abort): if every vote was yes, the coordinator force-logs COMMIT (the commit point) and broadcasts it; any no ⇒ ABORT. Participants apply, release locks, and ack.
Cost: 2 round trips, O(n) messages per phase, latency bounded by the slowest participant, plus forced log flushes on both sides.
Fatal flaw: 2PC is a BLOCKING protocol. A coordinator crash after votes are in but before the decision ships leaves prepared participants in-doubt — stuck holding locks until it recovers. 3PC, Paxos Commit, and Sagas are the responses.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced internals (presumed-abort bookkeeping, 3PC, the FLP/CAP framing) you can skip on a first pass.

Coordinatordrives the txnParticipantseach shard / DB

Phase 1 — prepare / vote

PREPARE

each force-logs “prepared”, then holds its locks

vote YES (a binding promise)

all YES ⇒ coordinator force-logs the COMMIT POINT

Phase 2 — commit / abort

COMMIT

apply + release locks

ACK

Message flow and the in-doubt window

Start here: the problem it solves#

Suppose transferring money means debiting an account in database A and crediting one in database B. These are separate systems with separate transaction logs and separate locks. If A commits but B fails, money vanishes; if B commits but A fails, money is created. You need both to commit together or neither to — atomicity across machines, not just within one. The same shape appears whenever one logical transaction touches multiple shards, multiple microservice databases, or a database plus a message broker.

The naive approach — just send commit to A, then commit to B — is broken. After A commits there is no going back, but B can still fail, crash, or reject the write. There is no moment at which you can be sure both will succeed before either of them has made its change permanent. You are committing on hope.

The core idea: separate the promise from the act2PC splits 'commit' into two steps. First, every participant promises it CAN commit (and makes that promise durable) without yet committing. Only once everyone has promised does anyone actually commit. The promise is reversible up to the commit point; the act is not. Nobody crosses the point of no return until everyone has reached the edge of it.

The hard part is that each database can independently fail or be slow, and the network can drop or delay messages, at any moment. 2PC coordinates them so they reach a single, agreed outcome despite that — by never letting anyone commit until everyone has durably promised they can.

The mechanism: two phases#

There is one coordinator (often the node that began the transaction) and a set of participants (each an independent resource manager — a database, a queue). The coordinator drives the protocol; participants only ever respond.

Phase 1 — prepare/vote. The coordinator sends PREPARE to every participant. Each participant does all the work needed to be able to commit — acquires locks, validates constraints, writes its changes to its log — then force-writes a 'prepared' record durably to its WAL and replies YES, or replies NO if it cannot. A participant that has voted YES is now prepared: it has surrendered its autonomy. It may no longer commit or abort on its own; it has promised to do whatever the coordinator later says, and it must hold its locks until it hears that outcome.

Phase 2 — commit/abort. The coordinator collects the votes. If every vote was YES, it force-writes a COMMIT record to its own log — this is the commit point, the instant the outcome becomes final and irrevocable — and then sends COMMIT to everyone. If even one participant voted NO (or timed out), it force-logs ABORT and sends ABORT. Each participant applies the decision, releases its locks, and acknowledges. When all acks are in, the coordinator can forget the transaction.

Coordinator → all participants: PREPARE.
Each participant does the work, force-logs 'prepared', and replies YES (now holding locks, in-doubt) or NO.
Coordinator collects votes. It force-logs the decision: COMMIT iff every vote was YES, otherwise ABORT. (Logging COMMIT is the commit point.)
Coordinator → all participants: COMMIT or ABORT.
Each participant applies the decision, releases its locks, force-logs the outcome, and acks.
Coordinator collects acks, then writes an end record and forgets the transaction.

PredictAt exactly which message does the transaction's fate become irreversible — the moment after which it WILL commit no matter what crashes next?

Hint: It is not when a participant votes YES, and not when a participant receives COMMIT.

When the coordinator force-writes the COMMIT record to its own log — the commit point — before it sends a single COMMIT message. After that record is durable, the outcome is decided: even if the coordinator crashes immediately, recovery will read that record and drive every participant to commit. A participant voting YES only makes the outcome reachable; the coordinator's durable decision makes it certain.

Why everyone forces a log first#

2PC's correctness rests entirely on forced (fsync'd) write-ahead log records written before messages are sent. The rule on both sides is the same: make your intention durable, then act. This is what lets a crashed node, on restart, rejoin a transaction that was already in flight.

A participant force-logs 'prepared' BEFORE voting YES. If it crashes and restarts, it reads that record, knows it is bound, re-acquires the locks, and asks the coordinator for the outcome instead of guessing.
The coordinator force-logs the COMMIT/ABORT decision BEFORE sending it. If it crashes mid-broadcast, recovery replays the log and re-sends the same decision to everyone — so no participant can ever see a different answer.
Each participant force-logs the final outcome before acking, so a second crash doesn't undo an applied commit.

Go deeperGo deeper: presumed-abort and presumed-commit

Naively, the coordinator would have to keep a log record for every transaction until all acks arrive, and remember aborts forever in case a confused participant asks. Presumed-abort (the common default, and what XA uses) optimizes this: if a recovering coordinator finds NO decision record for a transaction, it presumes ABORT. This means the coordinator need not force-log anything for a transaction it decides to abort, and need not remember aborts at all — a participant that asks about an unknown transaction is simply told 'abort'. It saves log forces and messages on the abort path.

Presumed-commit is the mirror optimization for commit-heavy workloads, but it is subtler: the coordinator must force-log the LIST of participants before voting begins, so that after a crash an unknown transaction can be safely presumed committed without losing track of who still needs telling. Presumed-abort is more popular because its bookkeeping is cheaper and commit already requires a forced record anyway.

Cost: messages, round trips, latency, durability#

For n participants, a successful commit costs two round trips and a predictable amount of I/O. None of it is free, and all of it sits on the critical path of the transaction.

Resource	Cost for n participants	Why
Round trips	2 (prepare→vote, then commit→ack)	One per phase; each is a full request/response to every participant
Messages	~4n (2n requests + 2n replies)	PREPARE+vote in phase 1, COMMIT/ABORT+ack in phase 2 — O(n) per phase
Latency	≈ 2 × (slowest participant round trip)	Each phase waits for the LAST reply; one slow node stalls the whole transaction
Forced log writes	1 (coordinator) + 2 per participant	fsync on the decision, and on each participant's 'prepared' and 'outcome' records
Lock hold time	From local work until phase-2 message arrives	Locks span both round trips — far longer than a local transaction

The hidden cost is lock hold time, not message countMessages are cheap; the expensive part is that every participant holds its locks for the entire duration of both phases plus all the forced log flushes. That window is long compared to a single-node commit, so 2PC sharply reduces concurrency on contended rows and makes the slowest participant everyone's bottleneck. This is why high-throughput systems avoid distributed transactions on hot data.

Failure modes: the in-doubt window and recovery#

The headline failure is a coordinator crash during the in-doubt window. Suppose every participant has voted YES and is prepared, and then the coordinator crashes before sending the decision. A prepared participant cannot commit on its own (maybe someone else voted NO and it should abort) and cannot abort on its own (maybe everyone voted YES and the coordinator already logged COMMIT). It is genuinely stuck — holding locks, blocking every other transaction that needs those rows — until the coordinator recovers and tells it the answer. This is what 'blocking' means: the protocol cannot make progress, and there is no safe timeout-based escape.

Participant crashes before voting: no harm. It hasn't promised anything; the coordinator times out the missing vote and aborts. On restart the participant finds no 'prepared' record and rolls back its partial work.
Participant crashes after voting YES: on restart it reads its 'prepared' record, re-acquires locks, and asks the coordinator for the outcome (the recovery/'redo' path). It must NOT decide on its own.
Coordinator crashes BEFORE the commit point: presumed-abort kicks in — recovery finds no decision, aborts, and participants that ask are told to abort.
Coordinator crashes AFTER the commit point: recovery reads the durable COMMIT record and re-broadcasts COMMIT until every participant acks. The outcome was already sealed.
Coordinator crashes DURING the in-doubt window with no decision yet logged AND stays down: the prepared participants block indefinitely. This is the unavoidable case.

Go deeperGo deeper: why blocking is fundamental (FLP & CAP)

When a prepared participant loses contact with the coordinator, it cannot tell apart 'coordinator crashed' from 'coordinator is slow' from 'network is partitioned' — in an asynchronous network these are indistinguishable. The FLP impossibility result says no deterministic protocol can guarantee both safety and liveness with even one crash failure in a fully asynchronous system. 2PC chooses safety: it would rather block forever than risk two participants reaching different outcomes. So when the coordinator is unreachable, progress stops by design.

In CAP terms, 2PC is CP: under a partition that isolates the coordinator, it sacrifices availability (participants block) to preserve consistency (atomicity). You cannot timeout your way out safely — a participant that 'gives up and aborts' might do so just as the coordinator commits elsewhere, splitting the transaction. The only real fix is to remove the single point of failure: replicate the coordinator's decision with a consensus group (Paxos/Raft) so the decision survives the loss of any one node.

Trade-offs: safety bought with liveness#

2PC's central bargain is atomicity at the cost of availability. It guarantees that all participants agree on commit-or-abort (safety) but cannot guarantee that they always make progress (liveness). When the coordinator is healthy it is simple and correct; when the coordinator fails at the wrong instant, the whole transaction — and everyone's locks — freezes.

Pro: true atomic commit across heterogeneous systems, with a simple, well-understood protocol and broad standard support (XA/JTA).
Pro: strong consistency — no partial commits, ever, even across crashes, thanks to forced logging.
Con: blocking — a coordinator failure in the in-doubt window stalls prepared participants holding locks.
Con: the coordinator is a single point of failure and a bottleneck; latency tracks the slowest participant.
Con: long lock hold times kill concurrency on contended data; it scales poorly as participant count grows.

When to reach for it — and when not toUse 2PC when you genuinely need atomicity across a small, stable set of resources and can tolerate the blocking risk (e.g. a few databases inside one trusted data center, or Spanner committing across its own Paxos groups). Avoid it across many services, over the open internet, or on hot contended data — there, a Saga with compensating actions trades atomicity for availability and is usually the better fit.

Variants: 3PC, Paxos Commit, and optimizations#

Every variant is an attempt to soften 2PC's blocking weakness or its message cost.

Presumed-abort / presumed-commit: not new phases, just bookkeeping optimizations (above) that cut forced log writes and messages on one of the two outcome paths. Presumed-abort is the de facto standard.
Three-phase commit (3PC): inserts a pre-commit phase between vote and commit so that a recovering participant can use a timeout to decide safely, making the protocol non-blocking — BUT only under the assumption of a synchronous network with no partitions. Under a real network partition, 3PC can split-brain and produce an inconsistent outcome. It also adds a third round trip. It is famous in textbooks and almost never used in production.
Paxos/Raft Commit (Gray & Lamport): replace the single coordinator with a consensus group. Each participant's vote is logged via consensus, and the decision is a consensus value, so the loss of any single node — including the coordinator — no longer blocks the transaction. This is the modern, correct answer and is essentially what Spanner does.

Go deeperGo deeper: why 3PC fails under partitions

3PC's non-blocking guarantee relies on a bounded message delay so that a timeout reliably means 'the other side is dead', not 'the other side is slow'. A network partition breaks that assumption. Imagine the participants split into two groups during the pre-commit phase: one group has received pre-commit and times out into COMMIT, while the other group never did and times out into ABORT. Now half the transaction commits and half aborts — atomicity is violated. 2PC blocks rather than risk this; 3PC trades that safety for liveness and loses the bet under partitions. Because real datacenter networks DO partition, practitioners prefer consensus-replicated 2PC over 3PC.

2PC vs 3PC vs Paxos Commit vs Saga#

Protocol	Atomicity	Blocking?	Round trips	Partition-safe?	Used in practice
2PC	Yes (all-or-nothing)	Blocks on coordinator crash	2	Safe but unavailable (CP)	XA/JTA, MSDTC, Postgres prepared txns
3PC	Yes (no partitions)	Non-blocking (synchronous net only)	3	No — can split-brain	Almost never (textbook)
Paxos/Raft Commit	Yes	Non-blocking (survives node loss)	2 + consensus	Safe and available	Spanner, CockroachDB, modern systems
Saga	No — eventual via compensation	Non-blocking	n local txns + compensations	Available (AP)	Long-lived microservice workflows

The key distinction: 2PC and 3PC and Paxos Commit all deliver true atomicity (one transaction, one outcome). A Saga does NOT — it runs each step as an independent local commit and undoes earlier steps with compensating transactions if a later one fails, so other observers can briefly see partial state. You pick a Saga precisely when you cannot afford to block and can tolerate that eventual, compensated atomicity.

In the wild#

2PC is everywhere distributed transactions exist — usually under a standard interface so applications never see the protocol directly.

X/Open XA + JTA: the dominant standard. XA defines the interface between a transaction manager and resource managers (databases, JMS brokers); Java's JTA exposes it to applications. xa_prepare / xa_commit / xa_rollback are literally the 2PC phases.
PostgreSQL & MySQL prepared transactions: PostgreSQL's PREPARE TRANSACTION durably writes the transaction in the prepared (voted-YES) state so a later COMMIT PREPARED or ROLLBACK PREPARED finishes it — exactly a participant's role in 2PC. (Postgres warns that orphaned prepared transactions hold locks — the blocking problem made operational.)
Microsoft MSDTC: the Distributed Transaction Coordinator on Windows, coordinating 2PC across SQL Server, MSMQ, and other resource managers.
Google Spanner: runs 2PC across its shards, but each participant (and the coordinator) is itself a Paxos group, so the coordinator's decision is consensus-replicated and a single machine failure no longer blocks. This is the 'Paxos Commit' idea in production.
Apache Kafka transactions: the transaction coordinator uses a two-phase commit to atomically write to multiple partitions and the consumer-offsets topic, giving exactly-once processing across a read-process-write loop.

Common misconceptions & gotchas#

What happens if the coordinator dies right after everyone votes YES?

The prepared participants are stuck in-doubt: they have promised to commit and are holding locks, but they don't know the decision and may not invent one. They block until the coordinator recovers. On recovery the coordinator consults its log — if it had force-logged COMMIT it re-sends COMMIT; if not, presumed-abort makes it abort. The participants ask, get the answer, finish, and release locks. The window of pain is exactly 'coordinator down AND decision not yet learned'.

Is 2PC the same as consensus (Paxos/Raft)?

No. Consensus gets a group to agree on a value while tolerating a minority of failures — it stays available as long as a majority is up. 2PC requires UNANIMITY (every participant must vote YES to commit) and tolerates no coordinator failure without blocking. They solve different problems, which is why modern systems use consensus to REPLICATE the 2PC coordinator's decision rather than replacing 2PC outright.

2PC vs Saga — when do I use which?

Use 2PC when you need genuine atomicity over a small, stable set of resources you control and can accept the blocking risk (databases in one datacenter). Use a Saga when you have a long-lived, multi-service workflow where blocking is unacceptable: each step commits locally and failures are undone by compensating transactions, accepting that observers may briefly see partial results.

Why is 2PC called a 'blocking' protocol?

Because a prepared participant that loses the coordinator cannot safely make progress — it can neither commit nor abort on its own, since either choice might contradict the coordinator's actual decision. So it blocks (waits, holding locks) until contact is restored. By the FLP result this is unavoidable for any safe atomic-commit protocol in an asynchronous network with crash failures; 2PC chooses safety over liveness.

QuizAll participants have voted YES. The coordinator force-logs COMMIT, sends COMMIT to participant A only, then crashes before reaching B. What happens to B?

B times out and aborts, so the transaction is now half-committed and inconsistent.
B commits, because A committed and participants gossip the decision among themselves.
B blocks in-doubt until the coordinator recovers, reads its COMMIT record, and re-sends COMMIT to B.
B aborts immediately, and the coordinator rolls A back on recovery.

Show answer

B blocks in-doubt until the coordinator recovers, reads its COMMIT record, and re-sends COMMIT to B. — The commit point was the durable COMMIT record, so the outcome is already COMMIT and cannot change. B is prepared and in-doubt: it must NOT decide on its own (aborting would split the transaction with A, who already committed). It blocks until the coordinator recovers, replays its log, finds COMMIT, and re-broadcasts it. Participants do not gossip the decision in standard 2PC — only the coordinator is authoritative.

In an interview#

Lead with the guarantee and the cost: 2PC gives atomic commit across multiple participants, but it is blocking — a coordinator failure between the phases leaves prepared participants stuck in-doubt holding locks. Name the durability detail: both sides force-log before acting, and the coordinator's COMMIT record is the commit point.

Be ready to walk the two phases, to identify the commit point precisely, and to explain why a prepared participant cannot decide unilaterally. State the cost (2 round trips, O(n) messages, latency = slowest participant, long lock holds). Then name the mitigations: replicate the coordinator's decision with a consensus protocol (Raft/Paxos Commit) so it doesn't single-point-fail; know that 3PC is non-blocking only on a synchronous network and breaks under partitions; and know when a Saga is the better tool.

A frequent follow-up is '2PC vs consensus': 2PC needs unanimity and blocks on coordinator failure; consensus needs only a majority and stays available. Another is '2PC vs Saga': atomic-but-blocking versus available-but-eventually-consistent-via-compensation. Drop a real system to show depth: Spanner runs 2PC across Paxos groups; PostgreSQL exposes the participant role via PREPARE TRANSACTION; Kafka uses 2PC for exactly-once.

Then open the simulator: PREPARE and COMPLETE for a clean commit; force a participant to vote NO to watch the whole transaction abort; or PREPARE, CRASH the coordinator, and see the participants block in-doubt — then RECOVER and watch it replay its log and drive everyone to the same outcome.

References & further reading#

References

Gray & Lamport — Consensus on Transaction Commit (2006) — frames atomic commit as consensus; introduces Paxos Commit to remove the coordinator as a single point of failure
Bernstein, Hadzilacos & Goodman — Concurrency Control and Recovery in Database Systems — the classic textbook; Ch. 7 covers 2PC, presumed-abort/commit, and recovery (free PDF)
The Open Group — Distributed Transaction Processing: The XA Specification — the standard interface (xa_prepare/xa_commit/xa_rollback) behind XA and JTA
PostgreSQL — PREPARE TRANSACTION — a participant's prepared state, exposed as SQL; note the warning about held locks
Corbett et al. — Spanner: Google's Globally-Distributed Database (OSDI 2012) — 2PC across Paxos groups so the coordinator decision is consensus-replicated
Confluent — Transactions in Apache Kafka — the transaction coordinator's two-phase commit for exactly-once
Martin Kleppmann — Designing Data-Intensive Applications, Ch. 9 — the clearest book-length treatment of 2PC, its blocking flaw, and the alternatives

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Coordinator crash & recovery →Everyone votes to prepare, then the coordinator crashes mid-commit — participants block, holding locks, until it recovers from its log and drives the commit home.A clean commit →No failures: every participant votes yes, the coordinator says commit, and all of them commit together — the all-or-nothing guarantee, the easy way.Blocked on a lost coordinator →Participants vote yes, then the coordinator vanishes and never comes back — they're stuck holding locks, unable to commit or abort on their own. This is why 2PC blocks.

Open the Two-Phase Commit simulator →

Up nextSaga Pattern

← Back to the learning path