Consistent Hashing · System Internals

The big picture#

TL;DRthe 30-second version

Naive hash(key) % N breaks catastrophically on resize: change N and almost every key remaps, triggering a cluster-wide data migration (a 'rehash storm').
Consistent hashing places both servers and keys on the same circular hash space (a 'ring'). A key is owned by the first server found walking clockwise from the key.
Adding or removing one server only moves keys between ring neighbors — about K/N of the K keys move, not all of them. Lookups are an O(log N) binary search on a sorted array.
One point per server gives lumpy, unbalanced load, so each server is scattered across the ring as many virtual nodes (vnodes); this smooths each server's share toward 1/N.
It is the partitioning layer behind Dynamo/DynamoDB, Cassandra, Riak, ketama (memcached), Discord, and CDNs. Rendezvous (HRW) hashing and jump consistent hash are common alternatives.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced internals (the math on load variance, bounded-load consistent hashing, the alternative algorithms) that you can skip on a first pass and return to later.

owner = first node clockwise (to the right; wrap at the end)

N3N1N202^32 ↻ wraps to 0k_ak_bk_ck_d

The hash ring, laid flat (the right end wraps back to the left)

Start here: why naive mod-N hashing is catastrophic#

Suppose you have 3 cache servers and you pick which one stores a key with: server = hash(key) % 3. It works perfectly — every key has a home, lookups are O(1), and the load is even. The trouble starts the moment the cluster size changes.

Add a 4th server and the formula becomes hash(key) % 4. The modulus changed, so almost every key now computes a different server. A key whose hash is 100 went from server 1 (100 % 3) to server 0 (100 % 4). It is not just that key — on average only 1/4 of keys happen to land on the same server by coincidence, so roughly 75% of all keys must physically move. The same disaster runs in reverse when a server dies: you drop to hash(key) % 2 and, again, the vast majority of keys are suddenly in the wrong place.

The rehash stormOn a resize, mod-N doesn't just relocate a few keys — it relocates almost all of them at once. For a cache, that means a near-total cache miss storm: every relocated key misses on its new server and hammers the backing database simultaneously. For a database, it means terabytes of data streaming between nodes while live traffic is being served. A single node failure can take the whole cluster down with it.

The root cause: in mod-N, the server assignment of a key depends on N. Change N and you change the function for every key. What we want instead is a scheme where a key's home depends only on the key and the set of servers that exist — so that adding or removing one server perturbs only a thin slice of keys near that server, and leaves everyone else alone.

The goal, stated preciselyWhen you add the (N+1)-th server, only about 1/(N+1) of keys should move — and they should move ONTO the new server, not get reshuffled among the old ones. When you remove a server, only the keys it held should move, onto its neighbors. Every other key stays exactly where it is.

The mechanism: wrap the hash space into a ring#

The key insight is to hash servers and keys into the SAME space. Take a fixed hash output range — say all 32-bit integers, 0 to about 4.29 billion — and imagine it bent into a circle so that the largest value wraps back around to 0. Now hash each server's name to a position on this circle, and hash each key to a position too. Servers and keys live on one shared ring.

Routing rule: a key is owned by the first server you reach walking clockwise from the key's position. That server is the key's 'successor' on the ring. Each server is therefore responsible for the arc of the ring that ends at it — every key whose position falls between the previous server (exclusive) and this server (inclusive).

Hash each server name to a ring position: serverPos = hash(serverName).
Hash each key to a ring position: keyPos = hash(key).
The key's owner is the first server clockwise from keyPos (wrapping past the top of the ring back to the smallest server position if needed).
Crucially, this assignment never mentions N. A key's home depends only on its own hash and the set of server positions — so changing the set perturbs only nearby keys.

Because a server's hash position does not depend on how many other servers exist, adding or removing a server leaves every other server's position fixed. That single property is the whole trick: it localizes change to one arc of the ring instead of redefining the function globally the way mod-N does.

Adding and removing servers: change stays local#

When you add a new server, it lands at one position on the ring. The only keys that need to move are the ones in the arc that now ends at the new server — i.e., keys that used to walk clockwise past this spot to the next server, and now stop at the newcomer instead. Those keys are handed off from exactly one existing server (the new node's clockwise successor) to the new node. Every other key still routes to the same server as before.

When you remove a server, all of its keys — and only its keys — move to its clockwise successor, which simply inherits the departing node's arc. Once again, every other key is untouched. No key ever moves between two servers that both survived the change; movement only ever happens at the boundary of the node that joined or left.

PredictA cluster has 4 servers and you remove one. Roughly what fraction of keys move, and where do they go?

Hint: Whose arc does the departing node's arc merge into?

About 1/4 of keys move — specifically the keys the dead node owned — and they all go to a single server: the dead node's clockwise successor, which absorbs its arc. The other ~3/4 of keys, spread across the three surviving nodes, do not move at all. (With virtual nodes the dead node's load is split across many successors instead of dumped on one, which is exactly why vnodes matter — see below.)

Why this beats mod-NAdding 1 server to a 3-server cluster reshuffles ~75% of keys under mod-N, versus only ~25% under consistent hashing — and that 25% moves cleanly onto the new node rather than churning among the old ones. The surviving nodes see zero disruption, so there is no cluster-wide cache-miss storm and no all-to-all data shuffle.

The numbers: how many keys move, and how fast is lookup#

Three quantities matter: how many keys move on a membership change, how expensive a lookup is, and how evenly load is balanced across servers.

Keys moved on add/remove ≈ K/N. With K keys spread uniformly over N servers, each server owns about K/N keys. Adding a server claims one new arc (~K/N keys); removing one sheds its arc (~K/N keys). This is the headline result: membership changes touch a 1/N slice, not the whole dataset.
Lookup cost: O(log N). The ring is stored as a sorted array of positions; finding a key's successor is a lower-bound binary search (with wrap-around). With V virtual nodes per server it is O(log(N·V)) — still logarithmic.
Load balance variance is the catch. With a single random point per server, arc lengths follow roughly an exponential distribution: the most-loaded server can own on the order of log N times the average. You do NOT get a clean 1/N share for free — smoothing it out is what virtual nodes are for.

Go deeperThe load-variance math, and why √V shows up

Drop N points uniformly at random on a unit circle. The N arcs between consecutive points are not equal — they are approximately exponentially distributed with mean 1/N. The largest arc is about (ln N)/N, so the busiest server can carry roughly ln N times the average load. For N = 100 that is a ~4–5× imbalance from pure chance, with nothing wrong with your hash function.

Give each physical server V virtual points instead of one. A server's total share is now the sum of V independent arcs, so by the law of large numbers it concentrates around 1/N. The relative standard deviation of a server's load scales like 1/√V: going from 1 to 100 vnodes cuts the spread by ~10×, and from 1 to 256 by ~16×. This is the precise reason Cassandra historically defaulted to 256 vnodes per node — enough copies to keep every node within a few percent of its fair share. Karger's original paper makes this rigorous: with Θ(log N) virtual copies per node, every node's load is within a constant factor of the average with high probability.

The cost of large V is metadata: the ring array, the gossiped token list, and (in a database) the number of distinct ranges each node must stream during repair or bootstrap all grow with N·V. That is why newer Cassandra versions lowered the default to 16 vnodes paired with a smarter token-allocation algorithm — you do not need hundreds of random tokens if you place a handful of them deliberately.

Under the hood: the ring is just a sorted array#

The 'ring' is a metaphor; the data structure is a sorted array of (position, serverId) pairs. Finding the owner of a key is a standard lower-bound binary search — find the first entry whose position is ≥ keyHash. If none exists (the key's hash is past the last position), wrap around to the first entry. That wrap is the only thing that makes it a 'ring' rather than a line.

Inserting a server: compute its V positions, binary-search each into the array, splice them in, then re-route only the keys in the newly claimed arcs. Removing a server: drop its entries and re-route its keys to their new successors. The ring update is O(V log(N·V)); the re-routing is O(affected keys) ≈ O(K/N).

The hash function mattersConsistent hashing assumes positions are spread uniformly. A weak or biased hash clusters keys and overloads some servers no matter how many vnodes you add. Production systems use well-distributed non-cryptographic hashes such as MurmurHash3 or xxHash (Cassandra uses Murmur3 for its partitioner). The simulator uses FNV-1a 32-bit — fast, and uniform enough to show the distribution story.

Virtual nodes, weighting, and replication#

With only one position per server, distribution is lumpy. If three servers land at 0.1, 0.2, and 0.8, the server at 0.8 owns the arc from 0.2 all the way around through 0 — about 60% of the ring, and therefore 60% of the traffic. That is not a bad hash; it is the inherent variance of placing a few random points.

Virtual nodes fix it. Each physical server is given V positions — computed by hashing server#0, server#1, …, server#(V-1) — so it appears at V scattered spots around the ring. With many small arcs per server instead of one big one, each server's total share converges to ≈ 1/N, and when a server dies its load is split across many successors rather than dumped entirely on one neighbor.

with a single token each, the gaps between nodes are wildly uneven

N1N2N30wrap

One point per node: uneven arcs → hot spots

each node gets many tokens, scattered around the ring

N1N2N3N2N1N3N1N2N3N1N20wrap

V points per node: interleaved → even

Weighted nodes: a bigger machine can be given proportionally more vnodes (e.g. 2× the tokens) so it owns ~2× the keys — a clean way to run a heterogeneous cluster.
Replication: to keep R copies of each key, walk clockwise from the key past its primary and pick the next R DISTINCT physical nodes (skipping additional vnodes of nodes already chosen). This is Dynamo's 'preference list'. Skipping vnodes of the same physical node is essential — otherwise all R replicas could land on one machine.
Rack/zone awareness: replica selection often also skips nodes in the same rack or availability zone so the R copies survive a correlated failure.
Vnode count is a tuning knob: more vnodes = smoother load and finer-grained rebalancing, but a larger ring, more tokens to gossip, and more ranges to stream during repair.

Alternatives: rendezvous (HRW), jump hash, bounded loads#

The ring is the classic design, but it is not the only way to get the 'minimal movement on resize' property. Three alternatives come up constantly in interviews and production.

Rendezvous hashing (Highest Random Weight, HRW): for a key, compute hash(key, server) for every server and pick the server with the highest score. Adding or removing a server only changes the winner for keys whose top choice was that server — so movement is still ~K/N — and you get even load with no vnodes or ring to store. The cost is O(N) per lookup (you score every server), which is fine for modest N. It generalizes cleanly to picking the top-R servers for replication.
Jump consistent hash (Lamping & Veach, Google, 2014): a ~5-line function mapping a key to a bucket in [0, N) using zero memory, faster than a ring, with near-perfect balance and exactly the optimal K/N movement when N grows. Its limitation: buckets must be numbered 0..N-1 and you can only add/remove the highest-numbered bucket cleanly — great for sharding a resizable storage pool, awkward when arbitrary named nodes come and go.
Bounded-load consistent hashing (Mirrokni, Thorup & Zadimoghaddam, Google, 2017): a ring variant with a hard cap on any server's load (at most (1+ε) times the average). If a key's natural owner is already full, it overflows to the next node clockwise. This eliminates hot spots from skewed key popularity, at the price of moving an extra O(1/ε²) keys per update. Used in Google Cloud Pub/Sub and available in Envoy and HAProxy.

Trade-offs and when to reach for it#

Consistent hashing is the right tool when membership is dynamic and data movement is expensive: distributed caches, sharded key-value and wide-column stores, and stateful load balancing where you want session affinity to survive scale events. The core trade is balance vs. metadata, tuned through the vnode count.

More vnodes → smoother load and gentler failover (a dead node's keys spread over many successors), but a larger ring, more tokens to gossip, and more ranges to stream on repair/bootstrap.
Fewer vnodes → cheaper metadata and faster lookups, but lumpier load and a harder failover (one neighbor inherits a whole arc).
Ring (vnodes) vs. rendezvous: rendezvous needs no ring state and balances perfectly, but is O(N) per lookup; the ring is O(log N) but needs the token table and vnode tuning.
Plain consistent hashing balances by KEY COUNT, not by popularity. If a few keys are red-hot, the owning node is overloaded regardless of vnodes — that is when you reach for bounded-load consistent hashing or add a caching/replication layer for hot keys.

The one-line summaryConsistent hashing trades a little routing complexity (a sorted ring of virtual tokens) for an enormous win: membership changes move ~K/N of the data instead of nearly all of it.

Failure modes#

Most consistent-hashing incidents trace back to two situations: too little smoothing, or load that the algorithm cannot see.

Hot spots without vnodes: with one token per node, random arc-length variance alone can leave one node owning several times the average. The fix is virtual nodes (or deliberate token placement).
Cascading load on node loss: when a node dies, its entire arc lands on a single successor, which can then exceed capacity and fall over too — propagating around the ring. Vnodes spread the dead node's load over many successors, turning a potential cascade into a small uniform bump.
Key skew (popularity hotspots): consistent hashing balances how many distinct keys each node owns, not how often they are accessed. One viral key overloads its owner no matter how perfect the ring. Mitigate with bounded-load consistent hashing, per-key replication, or a front cache.
Independent ring views: if nodes/clients disagree on the ring (stale gossip, mismatched vnode config, or different hash functions across client libraries), the same key routes to different nodes from different callers — silent inconsistency. Everyone must agree on the hash function and the token map.

Comparison: the four schemes side by side#

	mod-N	Consistent (ring+vnodes)	Rendezvous (HRW)	Jump hash
Keys moved on resize	~all (catastrophic)	~K/N	~K/N	~K/N (optimal)
Lookup cost	O(1)	O(log N)	O(N)	O(log N), no memory
Memory / state	none	ring of N·V tokens	none	none
Load balance	even (until resize)	even with enough vnodes	even, no tuning	near-perfect
Arbitrary named nodes	yes	yes	yes	no — buckets 0..N-1 only
Easy top-R for replicas	no	yes (walk clockwise)	yes (top-R scores)	no

Rule of thumb: ring + vnodes for large dynamic clusters with replication (Dynamo/Cassandra); rendezvous when N is small and you want zero ring state and perfect balance; jump hash for sharding a resizable, sequentially numbered storage pool.

Where consistent hashing runs in the wild#

Consistent hashing is the partitioning layer under a huge share of distributed storage and routing. The differences between systems are mostly about vnode counts, replication strategy, and whether the ring is gossiped peer-to-peer or held by a coordinator.

Amazon Dynamo / DynamoDB: the paper that popularized the technique. A gossiped ring with virtual nodes; each key is replicated to the next R distinct nodes clockwise (the 'preference list'). Cassandra and Riak are direct descendants of this design.
Apache Cassandra: token ring with virtual nodes (historically 256 per node, newer defaults ~16 with smarter allocation). The ring drives both data placement and replica selection.
memcached clients / ketama: client-side consistent hashing (the ketama algorithm) so that adding or removing a cache server invalidates only its share of keys instead of the whole cache — the original motivating use case.
CDNs: route a URL to a cache server via the ring, so adding a cache only re-homes (and cold-misses) its slice of URLs rather than reshuffling the entire edge.
Discord: uses rendezvous (HRW) hashing to assign which service node owns a given guild/channel, getting even placement without a ring to maintain.
Envoy & Maglev: Envoy offers ring-hash and Maglev load-balancing policies for sticky routing; Google's Maglev uses its own consistent-hashing scheme to keep connections pinned to the same backend even as the backend set changes.

Common misconceptions & gotchas#

Doesn't hashing already balance load? Why is this special?

A good hash does spread keys evenly across a FIXED number of buckets — that is what mod-N gives you. The special problem consistent hashing solves is what happens when the number of buckets CHANGES. Under mod-N a resize remaps almost every key; under consistent hashing it remaps only ~K/N. The win is stability across membership changes, not balance at a fixed size.

What do virtual nodes actually fix?

Two things. First, variance: with one random point per server, arc lengths are very uneven, so some servers naturally own far more of the ring than others. Spreading each server over many points makes every server's share converge to ≈ 1/N. Second, failover: with one point, a dead server's entire arc lands on a single neighbor; with many points, its load is split across many neighbors, avoiding a cascade.

How is replication done on the ring?

Walk clockwise from the key past its primary owner and select the next R DISTINCT physical nodes, skipping any further virtual nodes that belong to a node you already picked (and often skipping same-rack/zone nodes too). Those R nodes form the key's replica set — Dynamo calls it the preference list. Skipping same-physical-node vnodes is what stops all R copies landing on one machine.

Does consistent hashing fix hot keys?

No. It balances how many distinct keys each node owns, not how often they are accessed. A single viral key overloads its owner regardless of the ring. For access skew you need bounded-load consistent hashing, replication of hot keys, or a caching tier.

QuizYou run a 10-node cache with one token per node and no virtual nodes. One node fails at peak traffic and a second node falls over moments later. What most likely happened, and what prevents it?

The hash function was weak; switch to a cryptographic hash
The dead node's entire arc dumped onto its single successor, overloading it — virtual nodes would have spread that load across many nodes
mod-N rehashed everything; that is expected on failure
Nothing prevents it; node failures always cascade on a ring

Show answer

The dead node's entire arc dumped onto its single successor, overloading it — virtual nodes would have spread that load across many nodes — With one token per node, a failed node's whole key range (and traffic) lands on exactly one clockwise successor, which can then exceed capacity and fail too — a cascade. Virtual nodes give each physical node many small arcs, so a dead node's load is split across many successors, turning a dangerous spike into a small uniform bump. A stronger hash does not help, and consistent hashing specifically avoids the mod-N global reshuffle.

In an interview#

Lead with the problem mod-N can't solve: adding or removing a server reshuffles nearly all keys, causing a rehash storm. Then state the fix in one sentence: hash servers and keys into the same circular space and route each key to its first clockwise server, so a membership change moves only ~K/N keys instead of all of them.

Then go a level deeper, unprompted, because that is what separates a strong answer: virtual nodes smooth the otherwise-lumpy load and split a dead node's load across many successors; replication walks clockwise to the next R distinct nodes; the whole thing is a sorted array with an O(log N) binary search, not magic. Quote the K/N movement and the O(log N) lookup. Name the systems (Dynamo, Cassandra, Riak, ketama) and at least one alternative (rendezvous/HRW or jump hash), and the one weakness: it balances key count, not key popularity, so hot keys still need bounded loads or replication.

Then open the simulator: add a few nodes and watch the ring fill, put keys and trace each one clockwise to its owner, then remove a node and confirm only its keys remapped while everything else stayed put — and that vnodes keep the per-node load even.

References & further reading#

References

Karger, Lehman, Leighton, Panigrahy, Levine & Lewin — Consistent Hashing and Random Trees (STOC 1997) — the paper that introduced consistent hashing and the load-balance bounds
DeCandia et al. — Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) — ring + virtual nodes + clockwise preference-list replication in production
Lamping & Veach — A Fast, Minimal Memory, Consistent Hash Algorithm (Jump Hash, 2014) — ~5 lines, zero memory, optimal K/N movement for numbered buckets
Mirrokni, Thorup & Zadimoghaddam — Consistent Hashing with Bounded Loads (2016) — caps any node's load at (1+ε)× average; the fix for popularity hot spots
Eisenbud et al. — Maglev: A Fast and Reliable Software Network Load Balancer (NSDI 2016) — Google's consistent-hashing-based L4 load balancer with connection stickiness
Rendezvous (Highest Random Weight) hashing — overview — the ringless alternative: score every node, pick the highest

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Build the ring, then remove a node →Place three nodes on the hash ring, hash eight keys to their clockwise owner, then remove a node and watch only its keys move — everyone else stays put.Adding a node moves ~1/N keys →With keys already placed across three nodes, add a fourth — only the keys that now fall between it and its predecessor move to it; the rest don't budge.A key finds its node clockwise →Hash a key to a point on the ring, then walk clockwise to the first node — that's its owner. The same hash always lands on the same node.

Open the Consistent Hashing simulator →

Up nextKademlia DHT

← Back to the learning path