Raft Consensus · System Internals

The big picture#

TL;DRthe 30-second version

Consensus = getting N machines to agree on one ordered sequence of operations (a replicated log) despite crashes and dropped messages. Apply the same log in the same order everywhere and every node is an identical replica.
Raft elects exactly one leader per term. All client writes go through the leader, which appends to its log and replicates via AppendEntries. One leader means one order — that is the whole trick.
An entry is committed once a majority (quorum) has it. A cluster of 2f+1 nodes tolerates f failures; majorities of the same cluster always overlap, so no two leaders can commit conflicting entries.
Two safety rules carry the weight: the election restriction (you only vote for a candidate whose log is at least as up-to-date as yours) and the current-term commit rule (a leader only counts entries from its own term toward commit).
It is the consensus layer behind etcd, Consul, CockroachDB, TiKV, and Kafka's KRaft mode. The classic alternative, Multi-Paxos, is equivalent in power but harder to reason about and implement.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced internals (membership changes, snapshots, lease reads, the exact safety argument) you can skip on a first pass and return to later.

followerthe default state

election timeout →

candidateafter an election timeout

majority votes →

leaderwon a majority of votes

The three roles — and how a node moves between them

index → entry, each tagged with the term it was created in · highlighted = committed

logx=11 · t1y=22 · t1x=33 · t2z=94 · t3y=75 · t3

The replicated log on the leader

Start here: why does a cluster need a leader?#

A single database server is a single point of failure — if it dies, every client request fails until it's replaced. The fix is to replicate the same data onto several machines. But replication immediately creates a new problem: if two clients write to two different replicas at the same time, which write wins, and in what order do all replicas end up agreeing to apply them?

The classic framing is the replicated state machine. If every node starts in the same state and applies the exact same sequence of commands in the exact same order, every node ends in the same state — deterministically. Replication then reduces to one thing: get all nodes to agree on a single ordered log of commands. That agreement, in the presence of crashes and an unreliable network, is consensus.

Raft's answer is to elect exactly one leader at a time. Every write goes through the leader, which appends it to its own log and then replicates that log, in order, to the other nodes (followers). As long as there's a single leader, there's a single order — the hard problem of 'what happened in what order' reduces to 'whatever order the leader appended things in.'

What Paxos got right, and why Raft existsPaxos (Lamport, 1998) proved consensus is achievable and is correct under the crash-failure, asynchronous-network model. But single-decree Paxos solves agreement on one value; turning it into Multi-Paxos for a continuous log, and doing so correctly, is notoriously subtle — Ongaro and Ousterhout observed that even experts struggle with it. Raft was engineered for understandability: it decomposes the problem into leader election, log replication, and safety, and adds strong structure (a single leader, an append-only log, terms as a logical clock) so there are fewer moving parts to get wrong.

Consensus, not just replicationCopying bytes to other machines is easy. The hard part is agreeing on a single answer to 'what is the current leader, and what is in the log' even when machines crash mid-operation or the network drops messages — without ever letting two nodes both believe they're the legitimate leader and accept conflicting writes. That guarantee is what 'consensus' means here.

Terms: Raft's logical clock#

Raft divides time into terms, numbered consecutively. Each term has at most one leader. A node always carries a currentTerm number, and that number only ever goes up. Any message — a vote request, a replication request — carries the sender's term, and a node that sees a higher term immediately adopts it and demotes itself to follower.

This is the mechanism that prevents stale leaders from doing damage: a leader that's been cut off by a network partition keeps believing it's in charge, but the moment it's reconnected and hears about a higher term, it learns it's obsolete and steps down — even though, locally, nothing told it to stop being leader.

Leader election: from follower to candidate to leader#

Every follower runs an election timer. Each time it hears from the current leader (a heartbeat), it resets the timer. If the timer ever expires — meaning no leader has been heard from in a while — the follower assumes there's no functioning leader and starts an election: it becomes a candidate, increments its term, votes for itself, and sends RequestVote to every other node.

A follower's election timer expires with no heartbeat. It becomes a candidate for term T+1 and votes for itself.
It sends RequestVote(term=T+1) to every peer.
Each peer grants the vote if: the candidate's term is at least as new as its own, it hasn't already voted for someone else this term, AND the candidate's log is at least as up-to-date as its own (compared by last entry's term, then index).
If the candidate collects votes from a majority (including its own), it becomes leader for that term and immediately sends a heartbeat to assert its leadership before anyone else's timer expires.
If no candidate gets a majority (a split vote), every candidate's timer eventually expires again and they retry at an even higher term — staggered timeouts make this converge quickly in practice.

The log-recency check is the safety netTerm alone isn't enough to decide who deserves to be leader — a node isolated by a partition can keep incrementing its term forever without doing any work. The log-comparison rule (only vote for a candidate whose log is at least as current as yours) means a node that missed recent commits can win an election on term number alone, but can't actually get votes, because the nodes that hold the committed data won't endorse it. The paper calls this the election restriction, and it is what guarantees the Leader Completeness property: a newly elected leader always already holds every committed entry.

PredictFive nodes; three of them have committed log entry #10. A sixth-term candidate has term 6 (highest term in the cluster) but its log ends at entry #7. Can it win the election?

Hint: A vote needs more than a high term — what else does each voter check?

No. Granting a vote requires the candidate's log to be at least as up-to-date as the voter's (compare last-entry term, then index). The three nodes holding entry #10 will all reject this candidate because its log is shorter/older, so it can never reach a majority of 3 — at most it gets the two stale nodes plus itself, which isn't enough. A high term lets you START an election; the election restriction decides who can WIN one. This is exactly why committed data can never be lost.

Log replication: AppendEntries and the consistency check#

Once elected, the leader accepts client commands by appending them to its own log first (uncommitted), then pushes them out via AppendEntries — the same RPC it uses as a heartbeat, just carrying entries instead of being empty. Each AppendEntries includes prevLogIndex/prevLogTerm: 'here is the entry I believe directly precedes what I'm sending you — confirm you have it.'

If the follower's log doesn't have a matching entry at that position (it's missing entries, or has different ones from a previous, now-abandoned leader), it rejects the request. The leader responds by decrementing nextIndex for that follower and retrying one position earlier next round, until it finds a point both logs agree on — then it overwrites the follower's divergent suffix with its own. Entries are only ever appended or overwritten in bulk from a known-good point; they're never reordered.

Why overwrite instead of mergeMerging two divergent logs is exactly the ambiguity Raft is designed to avoid — there's no principled way to interleave two different sequences of writes. Once the leader is established, its log is authoritative by definition; followers simply converge to match it. This is what 'a single leader gives you a single order' buys you.

Go deeperThe Log Matching Property — why prevLogIndex/prevLogTerm is enough

Raft guarantees two things about logs across the cluster: (1) if two logs contain an entry with the same index and term, they store the same command; and (2) if two logs contain an entry with the same index and term, then the logs are identical in all preceding entries. Together these are the Log Matching Property.

Property (1) holds because a leader creates at most one entry per index in a given term, and entries never change position once created. Property (2) is enforced inductively by the AppendEntries consistency check: every AppendEntries carries prevLogIndex and prevLogTerm, and a follower refuses the request unless it has a matching entry there. So an accepted AppendEntries proves the follower already agreed up to prevLogIndex — and by induction, all the way back to the start. That single check is why the leader can repair a divergent follower by walking nextIndex backward until the logs match, then overwriting the rest.

Optimization: naively decrementing nextIndex by one per round is O(divergence) round-trips. Real implementations have the follower return a conflicting term and the first index of that term, letting the leader skip an entire term's worth of entries per rejection — turning many round-trips into a handful.

The commit rule: majority replication, with one subtlety#

An entry is committed once it's been replicated to a majority of nodes — at that point it's durable even if the leader immediately crashes, because any future leader (also elected by majority) is guaranteed to overlap with at least one node that has it. The leader tracks matchIndex per follower (the highest index it has confirmed that follower holds) and advances its own commitIndex to the highest index a majority agrees on.

You can't commit a previous leader's entry by count aloneHere's the subtlety interviewers love to probe: suppose an old leader replicated an entry to a majority but crashed before committing it, and a new leader's log happens to contain that same entry (inherited via the consistency check) alongside its own newer entries. Raft forbids the new leader from declaring that old entry committed purely because a majority now has it — it only commits entries from its own current term directly. Once it commits a current-term entry, every entry before it (including the inherited old one) is committed too, transitively, via the log-matching property. Committing old-term entries by count alone can be undone by a future leader that never saw them; this rule closes that gap.

Failure modes: partitions, split votes, and split brain#

If the leader is cut off from a majority of the cluster, two things happen in parallel: the isolated leader keeps accepting client writes into its own log (since nothing has told it otherwise) but can never replicate them to a majority, so they can never commit — those writes are effectively stuck. Meanwhile, the majority side notices it's heard nothing from the leader, times out, and elects a new leader among themselves (since they still have enough nodes for a majority).

When the partition heals, the old leader receives a message from the new leader carrying a higher term, recognizes it's stale, and steps down to follower — then catches its log up to the new leader's via the ordinary AppendEntries consistency check. Its uncommitted, never-replicated writes are simply discarded by being overwritten.

Split vote — two candidates start elections at the same term and each grabs part of the cluster, so neither reaches a majority. Raft fixes this with randomized election timeouts (e.g. 150–300 ms): the node that times out first usually wins outright before others even start, so split votes are rare and self-correct on the next, re-randomized timeout.
Leader crash — followers stop hearing heartbeats, an election timer fires, and a new leader is elected, typically within an election timeout. Committed entries survive because any new leader already holds them (election restriction).
Network partition / split brain — the cluster splits into groups. Because a leader must reach a majority to commit, at most one group can make progress; the minority side cannot elect a leader or commit anything. This is the structural guarantee against two leaders committing conflicting data.
Stale leader still serving reads — a partitioned-off old leader may not yet know it was deposed and could answer a read with stale data. Preventing this needs an explicit read protocol (see Variants → read-only optimizations); replication alone does not make reads linearizable.

This is why Raft needs a majority, not unanimityRequiring a strict majority (not 'all nodes') is what lets the cluster keep making progress during a partition — the majority side doesn't need to wait for the minority side to come back. It's also what prevents split-brain: two disjoint groups can't both contain a majority of the same cluster, so at most one side can ever elect a leader and commit new entries at a time.

QuizA 5-node Raft cluster splits 3-2 by a network partition, with the old leader on the 2-node side. What happens?

Both sides elect a leader and later merge their logs
The 3-node side elects a new leader and keeps committing; the 2-node side (with the old leader) cannot commit anything
The whole cluster halts until the partition heals
The 2-node side keeps committing because it still has the original leader

Show answer

The 3-node side elects a new leader and keeps committing; the 2-node side (with the old leader) cannot commit anything — Commit requires a majority — here, 3 of 5. The 3-node side has a majority, so it elects a new leader (higher term) and continues committing. The 2-node side, even though it holds the old leader, can never reach 3 nodes, so its new writes stay uncommitted forever. On heal, the old leader sees the higher term, steps down, and rolls back its uncommitted suffix. Logs are never 'merged' — the minority converges to the majority's log.

Quorums, fault tolerance, and message cost#

Raft's safety rests on a counting argument: any two majorities of the same set must overlap in at least one member. So if an entry is on a majority, and a future leader was elected by a majority, those two majorities share a node — and that node, by the election restriction, forced the new leader to already hold the entry. Overlap is the whole reason a majority quorum is safe.

Fault tolerance: a cluster of N = 2f+1 nodes tolerates f crash failures and still has a majority (f+1) available. 3 nodes tolerate 1 failure; 5 tolerate 2; 7 tolerate 3.
Why odd sizes: 4 nodes need a majority of 3 — same fault tolerance as 3 nodes (1 failure) but more nodes to coordinate and slower quorums. Even sizes buy you nothing and cost latency, so clusters are almost always 3, 5, or 7.
Commit latency: one client write needs a single round-trip from the leader to the fastest majority of followers — roughly one network RTT plus a disk fsync on each. The slowest node in the majority does not gate you; the (f+1)-th fastest does.
Message cost: each AppendEntries round is O(N) messages (leader to each follower) and an election is O(N) RequestVotes plus O(N) replies — both linear in cluster size, which is why Raft clusters stay small (3–7) and scale reads, not the consensus group, horizontally.

Latency floorBecause every committed write needs a majority round-trip, geo-distributed Raft groups pay the inter-region RTT on every write. A cluster spanning two coasts (~70 ms RTT) cannot commit faster than ~70 ms per write no matter how fast the disks are. This is why systems place all voting members in nearby zones and use witness/non-voting replicas elsewhere.

Beyond the core: membership changes, snapshots, fast reads#

The election-plus-replication core is the exam answer, but production Raft needs three more pieces to actually run for years. Each is bundled into the paper or Ongaro's thesis.

Cluster membership changes — you can't just swap the node set atomically, because nodes apply the change at different moments and could briefly form two disjoint majorities (two leaders). The paper's solution is joint consensus: a transitional configuration C(old,new) in which decisions require majorities of BOTH the old and the new sets, so overlap is preserved throughout the switch. The thesis later simplified this to single-server changes: add or remove one node at a time, where old and new majorities always overlap automatically — this is what most implementations use.
Log compaction / snapshots — the log can't grow forever. Each node periodically writes a snapshot of its state machine and discards the log prefix the snapshot covers. A leader that has already discarded entries a lagging follower needs sends an InstallSnapshot RPC instead of replaying ancient AppendEntries.
Read-only optimizations — a naive read goes through the log like a write (correct but slow). To serve linearizable reads cheaply, a leader confirms it is still leader (exchange heartbeats with a majority) before answering, and records the commit index at read time. Leader leases go further: a leader holding a time-bounded lease can serve reads locally without a round-trip, trading a clock-bound assumption for latency.
Pre-vote — a partitioned node keeps timing out and bumping its term; when it rejoins, its inflated term forces an unnecessary leader step-down (a 'disruptive server'). The pre-vote extension makes a node first ask 'would you vote for me?' WITHOUT incrementing its term, and only starts a real election if a majority says yes — preventing needless disruption.

Go deeperWhy joint consensus needs majorities of BOTH configurations

Suppose you switch a 3-node cluster {A,B,C} to {C,D,E} by letting each node adopt the new config as soon as it hears about it. For a window, A and B might still think the cluster is {A,B,C} (majority = 2: A,B) while D and E think it is {C,D,E} (majority = 2: D,E). Now {A,B} could elect one leader and {D,E} another — two leaders in the same term, with no overlap. Split brain.

Joint consensus closes the window. During the transition the cluster runs in C(old,new): every election and every commit must win a majority of the OLD set AND a majority of the NEW set simultaneously. No decision can be made by either set alone, so the two-leader scenario above is impossible. Once C(old,new) is committed, the leader proposes C(new) alone, and the transition completes. The single-server-change shortcut avoids the joint phase entirely by observing that if you only ever add or remove one node, any new-majority and old-majority necessarily share a member — so overlap is free.

Trade-offs: what consensus costs you#

Raft buys you a single, linearizable, durable log — and you pay for it in latency, throughput ceiling, and availability under partition. It is firmly a CP system in CAP terms: when a partition prevents a majority, the minority side sacrifices availability (it refuses writes) to preserve consistency.

Availability vs consistency — under a partition, only a majority side stays writable; a cluster with no majority side is fully unavailable for writes. Raft never sacrifices consistency for availability, by design.
Latency — every committed write pays a majority round-trip plus per-node fsync. You cannot beat one network RTT to your quorum, which dominates in geo-distributed deployments.
Leader bottleneck — all writes (and, for strong reads, the read confirmation) funnel through one node. The leader's CPU, disk, and uplink cap write throughput; the cluster does not scale writes by adding voters. Followers can offload only stale or lease-bounded reads.
Small clusters only — because every round is O(N) and quorums get slower with more members, Raft groups stay at 3–7 nodes. Scaling data past that means sharding into many independent Raft groups (the CockroachDB / TiKV model), not growing one group.

When Raft is the wrong toolIf you don't need linearizability — say, a cache, analytics rollups, or any workload happy with eventual consistency — Raft's majority round-trip is pure overhead, and a leaderless / gossip / Dynamo-style design will give you higher availability and write throughput. Reach for Raft when you need a strongly consistent, ordered source of truth (metadata, configuration, locks, transaction status), not for bulk data where staleness is acceptable.

Raft vs Multi-Paxos vs ZAB#

	Raft	Multi-Paxos	ZAB (ZooKeeper)
Leadership	Strong single leader; all entries flow leader→follower	Optional 'distinguished proposer'; leadership is a convention, not core	Single leader (primary) elected per epoch
Log model	Append-only log, never has holes; follower mirrors leader	Per-slot agreement; logs can fill out of order with gaps	Primary-order broadcast of transactions (zxid = epoch+counter)
Primary goal	Understandability + practical implementation	Minimal, general consensus theory	Total-order broadcast for a primary-backup store
Membership change	Joint consensus / single-server change	Not specified by the core protocol	Dynamic reconfiguration (added later)
Used by	etcd, Consul, CockroachDB, TiKV, Kafka KRaft	Google Chubby/Spanner, Megastore (Multi-Paxos variants)	Apache ZooKeeper, and systems built on it (HBase, Kafka pre-KRaft)

The headline: Raft and Multi-Paxos are equivalent in power and efficiency — Raft is essentially Multi-Paxos with a strong-leader discipline and a no-holes log bolted on to make it teachable and buildable. ZAB is close to Raft in spirit (single leader, ordered broadcast) because both were built for real primary-backup systems; the differences are mostly in how leadership/epochs and recovery are framed.

Where Raft runs in the wild#

Raft became the default consensus layer for new infrastructure after 2014, largely because it is realistic to implement correctly from the paper. Most uses follow one of two patterns: a single Raft group holding cluster metadata, or many Raft groups (one per data shard) for horizontal scale.

etcd — the canonical Raft library and the key-value store behind Kubernetes; the entire cluster state of Kubernetes lives in a single etcd Raft group.
HashiCorp Consul / Nomad — service discovery and scheduling; use the same battle-tested hashicorp/raft library for their server quorum.
CockroachDB — shards ("ranges") of the keyspace are each their own Raft group, with thousands of groups per cluster, giving SQL-level horizontal scale on top of per-range consensus.
TiKV / TiDB — the same per-region multi-Raft model; TiKV's Raft implementation is a direct port of etcd's Raft to Rust.
Kafka KRaft — Kafka's replacement for its ZooKeeper dependency: the controller quorum now runs a Raft-based metadata log internally, removing the separate ZAB-based ZooKeeper ensemble.

Single group vs many groupsetcd and Consul put all coordination state in one Raft group — simple, and fine because metadata is small. Databases that must scale data (CockroachDB, TiKV) can't fit everything in one group (remember: writes funnel through one leader), so they shard the keyspace into thousands of independent Raft groups, each with its own leader and quorum. 'Multi-Raft' is how Raft scales past a single leader's ceiling.

Common misconceptions & gotchas#

Can two nodes ever both be leader and commit conflicting entries?

Two nodes can briefly both BELIEVE they are leader (e.g. a partitioned old leader that hasn't heard about the new term), but they cannot both COMMIT. Commit requires a majority, terms only increase, and a node grants at most one vote per term — so two leaders can't exist in the same term, and a higher-term leader's majority necessarily excludes enough nodes that the stale leader can never reach its own majority. The stale leader's writes stay uncommitted and are later overwritten.

Does a candidate with a shorter (less up-to-date) log ever win an election?

No. The election restriction makes every voter reject a candidate whose last log entry is older (lower term, or same term but lower index) than its own. Since committed entries are on a majority, any candidate missing them is rejected by that majority and cannot win. A high term lets you start an election but never overrides log up-to-dateness.

Are reads automatically linearizable just because writes go through Raft?

No — this is the most common Raft mistake. A node that thinks it's leader might be a deposed stale leader and serve old data. For linearizable reads the leader must confirm it still leads (a heartbeat round-trip to a majority, or a valid lease) and read at or after its committed index. Reading from a follower, or from an unconfirmed leader, can return stale results.

Why must clusters be odd-sized?

They don't strictly have to be, but even sizes are wasteful: a 4-node cluster needs a majority of 3 and tolerates only 1 failure — identical fault tolerance to a 3-node cluster but with an extra node to coordinate and a larger quorum to wait on. Odd sizes (3, 5, 7) maximize fault tolerance per node and avoid more tie-prone configurations.

Does Raft tolerate Byzantine (malicious/buggy) nodes?

No. Raft assumes crash-stop failures and a non-malicious network (messages may be lost, delayed, reordered, or duplicated, but not forged). A node that lies about its log or term can break safety. Tolerating Byzantine faults requires a different class of protocol (PBFT, Tendermint).

In an interview#

Lead with the shape of the problem: replicate a log, not arbitrary state, by funneling all writes through a single elected leader per term. That single decision is what turns 'how do N machines agree' into 'how do we safely elect and replace a leader' — a much smaller problem.

Know the two RPCs by name (RequestVote, AppendEntries) and the two safety rules that make them correct: the log-recency check during voting (so an out-of-date node can't get elected even with a higher term), and the current-term-only commit rule (so a majority count alone never resurrects an uncommitted entry from a dead leader's term). Be ready to explain what happens during a network partition — an isolated leader keeps accepting writes that can never commit, while the majority side elects a new leader and keeps going.

Then try it in the simulator: TICK until a leader is elected, CLIENT_APPEND a couple of commands and TICK until they commit on every node, then PARTITION the leader and watch the majority side re-elect — followed by HEAL to watch the old leader discover the higher term, step down, and catch its log up.

References & further reading#

References

Ongaro & Ousterhout — In Search of an Understandable Consensus Algorithm (Extended Version) — the canonical Raft paper; USENIX ATC '14 Best Paper. Read this first.
raft.github.io — The Raft homepage — interactive visualization, implementations list, and links to all resources
Diego Ongaro — Consensus: Bridging Theory and Practice (PhD dissertation, 2014) — the long version: single-server membership changes, snapshots, lease reads, pre-vote, TLA+ spec
The Secret Lives of Data — Raft, an animated walkthrough — the gentlest visual introduction to election and replication
etcd — Raft library and design docs — the most widely used production Raft implementation (powers Kubernetes)
Lamport — Paxos Made Simple (2001) — the consensus protocol Raft was designed to be more understandable than

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Election, replication, partition & heal →The whole story: a leader is elected, replicates commands, gets partitioned, and a new leader takes over.Leader election →Watch a follower's election timeout fire, request votes, win a majority, and become leader.Log replication →The leader appends two client commands and replicates them to a majority before committing.Partition & heal →Isolate the leader: a new one is elected on the majority side, and healing rejoins the old leader as a follower.

Open the Raft Consensus simulator →

Up nextTwo-Phase Commit

← Back to the learning path