Skip List · System Internals

The big picture#

TL;DRthe 30-second version

A skip list is a sorted linked list plus randomized 'express lanes' — each node gets a tower of forward pointers whose height is decided by repeated coin flips (promote with probability p, usually 1/2).
Search starts at the top-left sentinel, moves right while the next key is smaller than the target, and drops a level when it would overshoot. The geometric thinning of upper levels makes this expected O(log n).
Insert and delete run the same search to find each level's predecessor, then splice or unlink one tower — no rotations, no global rebalancing. That makes the code short and easy to make lock-free.
It powers Redis sorted sets (ZSET), the LevelDB/RocksDB memtable, and java.util.concurrent.ConcurrentSkipListMap. The classic alternative — a balanced BST (red-black/AVL) — gives a hard worst-case bound but is fiddlier and harder to make concurrent.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the probability math and edge cases you can skip on a first pass and return to later.

every key lives at L0; taller towers are rarer — a box means the key reaches that level

H36912172125303340NILL3H25NILL2H925NILL1H69172533NILL0H36912172125303340NIL

Three express lanes over a sorted base list (p = 1/2)

Start here: the problem it solves#

A sorted linked list is easy to insert into, but searching it is O(n): you have to walk one node at a time because there is no way to jump ahead. An array fixes search with binary search at O(log n), but then every insert or delete shifts elements — O(n) again — so neither plain structure gives you fast ordered search and fast updates at the same time.

The textbook answer is a balanced binary search tree — red-black or AVL — which guarantees O(log n) search, insert, and delete. The catch is the machinery that keeps it balanced: rotations, recoloring, and case analysis that are famously fiddly to implement correctly. Worse, those rotations restructure pointers far from the point of change, which makes a balanced tree hard to update from multiple threads without taking coarse locks.

A skip list gets the same expected O(log n) operations as a balanced tree, but stays balanced on average through randomness instead of rotations. There is no rebalancing step at all — an insert or delete only touches the tower of the affected node and the predecessors that point at it. That locality is what makes the code short and what makes concurrent, even lock-free, implementations practical.

The trade-offSkip lists give probabilistic O(log n) search/insert/delete with simple, rotation-free code that is easy to make concurrent. The cost is extra forward pointers (about two per node on average) and bounds that hold in expectation and with high probability, not as a hard worst case. For the overwhelming majority of systems that is a trade worth making.

The mechanism: stacked linked lists and coin-flip towers#

Picture several linked lists stacked on top of each other. The bottom level (level 0) contains every node in sorted order. Each level above holds a sparse subset of the level below — the express lanes. Every node carries a 'tower' of forward pointers, one per level it participates in, so a tall tower is reachable from several lanes and a height-1 node only from the base list.

Heights are chosen by chance, not by position. When you insert a node you flip a coin to decide its tower height: it is always at level 0; with probability p (typically 1/2) it is also promoted to level 1; with probability p again to level 2; and so on, stopping at the first tails. This makes roughly half the nodes reach level 1, a quarter level 2, an eighth level 3 — the geometric thinning that makes the upper lanes act like a coarse index over the lanes beneath them.

Allocate the new node and run a search to record the predecessor at every level (the 'update' array).
Flip coins to pick the tower height h: start at 1, and while a fair coin says heads and h is below the cap, increment h.
If h exceeds the list's current top level, raise the head sentinel's level so the new tower has somewhere to attach.
For each level 0..h-1, splice the node in: new.next = pred.next; pred.next = new — exactly like a singly linked list insert, repeated per level.

Why randomness instead of balancing?A balanced tree carefully chooses structure to guarantee depth. A skip list lets the coin flips produce a good-enough structure on average — no existing node ever has to move to keep things balanced. Crucially, a node's height is decided locally at insert time and never changes, so updates stay local and the analysis is the same no matter what order keys arrive in.

Go deeperSentinels, the level cap, and back pointers

Real implementations use a head sentinel whose tower is as tall as the current list, and often a NIL/tail sentinel with a +∞ key so the 'move right' loop needs no null check. The maximum level is capped — Pugh suggests MaxLevel = log_{1/p}(N) for the expected size N (16 levels covers ~65k keys at p=1/2; Redis uses 32, covering ~2^64). Capping bounds the pointer overhead and is harmless because exceeding it is astronomically unlikely.

Plain skip lists store only forward pointers, so they are singly linked per level. Implementations that need to walk backwards — Redis ZSETs support ZREVRANGE — add a single back pointer at level 0, turning the base list into a doubly linked list while leaving the express lanes one-directional.

Searching: skim high, drop low#

A search starts at the head on the highest occupied level and moves right as long as the next node's key is still smaller than the target. When the next node would overshoot (its key is ≥ the target, or there is none), the search drops down one level and continues from the same node. It keeps skimming and dropping until it falls off the bottom of level 0, where the next node is either the target (found) or larger (absent).

Start top-left at the head sentinel, on the highest occupied level.
Move right while the next key is < the target (each high-level hop skips many base nodes at once).
When the next key is ≥ the target, drop down one level and repeat from the current node.
After dropping below level 0, the next node is either the target (found) or larger (absent).

skim right while the next key is < 30, then drop a level — repeat until L0

H36912172125303340NILL3H25NILL2H925NILL1H69172533NILL0H36912172125303340NIL

Search path for key 30 (highlighted = traversed)

Insert and delete reuse this exact search first — recording the predecessor at each level in an 'update' array — then splice in the new tower or unlink the target's tower at those recorded points. Because the express lanes thin out geometrically, the expected number of hops, and therefore the cost of all three operations, is O(log n).

PredictYou search for a key that sits near the end of a 1,000,000-key skip list. Roughly how many nodes does the search visit, and why isn't it ~1,000,000?

Hint: How many express-lane hops does each level save, and how many levels are there?

About 30–40 node visits, not a million. With p = 1/2 the list has ~log2(1,000,000) ≈ 20 levels, and the expected work is ~log_{1/p}(n)/p ≈ 2·log2(n) ≈ 40 steps. Each high-level hop skips, on average, all the lower-level nodes between two express stops, so the search covers exponential distance per level and only walks node-by-node in the last short stretch near the target.

Complexity: why the coin flips give O(log n)#

All three operations — search, insert, delete — are expected O(log n) time. Space is O(n): every node stores its key/value plus its tower of forward pointers, and the towers add only a constant factor on average.

Expected levels: a node reaches level k with probability p^k, so the tallest tower in n nodes is ~log_{1/p}(n) levels. At p = 1/2 that is about log2(n) levels.
Expected pointers per node: a node's height is a geometric random variable with mean 1/(1−p). At p = 1/2 that is exactly 2 forward pointers per node on average, so total space is ~2n pointers.
Expected search cost: ~(1/p)·log_{1/p}(n) comparisons. At p = 1/2 that is ~2·log2(n) — the same Θ(log n) shape as a balanced tree, with a small constant.

Go deeperThe backward-climb argument and high-probability bounds

Pugh analyzes the search path by walking it backwards from the target. At each step the path either climbs up a level (probability p, because the current node has a taller tower) or steps left (probability 1−p). The expected number of leftward steps needed to traverse one level is 1/p, and the number of levels is log_{1/p}(n) with high probability, giving expected cost (1/p)·log_{1/p}(n). Minimizing this over p gives an optimum near p ≈ 1/e ≈ 0.37, but p = 1/2 is the standard choice because it makes the coin flip a single bit and costs almost nothing extra.

The bound is not just 'expected' — it holds with high probability. The probability that search cost exceeds its mean by a constant factor falls off exponentially in n (a Chernoff-style tail). Concretely, the chance the structure degenerates badly shrinks faster than any polynomial, which is why in practice a skip list never visibly misbehaves even though, in principle, an unlucky run of coin flips could.

Tuning pp controls the search-speed/space trade. Lower p (e.g. 1/4) means taller-but-fewer express lanes: fewer pointers per node (1/(1−p) ≈ 1.33) and less memory, but more horizontal steps per level. Higher p (e.g. 1/2) means more pointers per node (2) and faster searches. Redis uses p = 1/4 to save memory; many textbook and in-memory implementations use p = 1/2 for speed.

Variants worth knowing#

Deterministic / 1-2-3 skip lists — replace the coin flips with a balance rule (between 1 and 3 nodes of one height between any two of the next height up), giving a guaranteed O(log n) worst case at the cost of more bookkeeping. They are the skip-list analogue of a 2-3-4 tree.
Indexable skip lists — store a 'span' count on each forward pointer (how many base nodes it jumps). Summing spans along the search path yields the rank of a key in O(log n), enabling order-statistic queries like 'the 1000th smallest' or ZRANK. Redis ZSETs use exactly this.
Concurrent / lock-free skip lists — because updates are local pointer splices, they map naturally onto compare-and-swap (CAS). Java's ConcurrentSkipListMap is a non-blocking design; deletions mark a node logically before physically unlinking it, so readers never see a torn structure. This concurrency-friendliness is a major reason skip lists are chosen over balanced trees in multithreaded code.

Go deeperHow lock-free deletion avoids a half-removed node

Naively CAS-ing a node out level by level can let another thread observe it present at one level and gone at another. Lock-free skip lists (Pugh's concurrent design; Java's ConcurrentSkipListMap) solve this with logical deletion: the node's value is first marked (e.g. set to a sentinel or its next pointer 'marked') with a single CAS, which atomically makes it logically absent. Physical unlinking from each level then happens lazily, and any thread that trips over a marked node helps finish the unlink before proceeding. Readers therefore always see a consistent linearizable view without locks.

Trade-offs and when to reach for one#

A skip list is the right call when you want an ordered map with O(log n) operations and you value simple, concurrency-friendly code over a hard worst-case guarantee. It is a weaker fit when you need a strict bound, the tightest possible memory, or cache-optimal scans over huge on-disk datasets.

For: in-memory ordered maps/sets, ranking and range queries, and anything multithreaded — the local, rotation-free updates make lock-free implementations tractable where a balanced tree would need coarse locking.
Against (hard real-time): the O(log n) bound is probabilistic. A safety-critical system that cannot tolerate even an astronomically rare slow operation should prefer a deterministic balanced tree or a deterministic skip list.
Against (memory-tight or on-disk): towers cost extra pointers and the nodes are scattered in memory, so cache locality is worse than a packed array or a B-tree. For disk-resident data, a B-tree's high fan-out and page alignment win decisively — which is why databases index on disk with B-trees, not skip lists.

Simplicity is the headline featurePugh's original motivation was not raw speed — a skip list and a balanced tree have the same asymptotics — but that the skip list is dramatically easier to implement, reason about, and parallelize. In an interview, 'simpler than a balanced tree and naturally concurrent' is the answer to 'why would anyone use this instead of a red-black tree?'

Failure modes: the unlucky worst case#

The pathological case is real but exponentially unlikely: if every coin flip comes up tails, every node lands at level 0 and the skip list degenerates into a plain linked list with O(n) search. Less dramatically, an unlucky distribution of heights can make some searches longer than the mean.

Two properties keep this from mattering. First, the bad cases are vanishingly rare — the probability of degradation by a constant factor falls off exponentially in n, so a million-key list effectively never misbehaves. Second, the randomness comes from the structure's own coin flips, not from the keys, so there is no adversarial input ordering that forces the worst case the way sorted insertions degrade a naive BST. (If an attacker could observe or control the RNG, they could in principle bias heights — so security-sensitive deployments seed the RNG unpredictably.)

The p tuning trade, restated as a failure tradePush p too high and you spend memory on pointers for little search gain; push p too low and searches do more horizontal walking per level, widening the tail of slow operations. p = 1/2 (speed) and p = 1/4 (memory, Redis's choice) bracket the sweet spot for almost all workloads.

Skip list vs balanced BST vs B-tree vs hash table#

	Skip list	Balanced BST (red-black)	B-tree	Hash table
Search / insert / delete	O(log n) expected	O(log n) worst case	O(log n) worst case	O(1) average
Ordered ops (range, rank, min/max)	Yes — base list + spans	Yes — in-order traversal	Yes — sorted leaves	No — unordered
Worst-case guarantee	Probabilistic	Deterministic	Deterministic	Probabilistic (O(n) on collisions)
Implementation simplicity	Simple, no rotations	Fiddly rotations/recoloring	Complex (splits/merges)	Simple
Concurrency	Easy — local CAS, lock-free exists	Hard — rotations touch far nodes	Hard — node splits lock subtrees	Moderate — bucket/stripe locks
Memory locality	Poor — scattered towers	Poor — scattered nodes	Excellent — packed pages	Good — contiguous buckets
Best home	In-memory ordered + concurrent	In-memory ordered, strict bound	On-disk / large ordered indexes	Unordered key lookup

Read this table as: hash tables win when you do not need order; B-trees win on disk because of fan-out and locality; balanced BSTs win when you need a hard bound in memory; and skip lists win when you want ordered, in-memory operations with simple code that is easy to make concurrent.

Where skip lists run in the wild#

Redis sorted sets (ZSET) — the canonical example. A ZSET is a skip list ordered by score (with a back pointer for reverse range, and span counts for ZRANK) paired with a hash map from member to score. Small ZSETs use a compact listpack; large ones switch to the skip list. This is why ZADD/ZRANGE/ZRANK are all O(log n).
LevelDB & RocksDB memtables — the in-memory write buffer in front of an LSM tree is a skip list. It must support concurrent sorted inserts and ordered iteration during a flush; the skip list's lock-friendly inserts fit perfectly. (The LSM explainer's memtable is exactly this.)
Java's java.util.concurrent.ConcurrentSkipListMap / ConcurrentSkipListSet — the standard library's concurrent sorted map, a lock-free skip list. It is the go-to when you need a thread-safe NavigableMap.
Apache Lucene & HBase — Lucene has used skip lists within postings to skip over document IDs during query evaluation; HBase's in-memory store (MemStore) is skip-list backed (Java's ConcurrentSkipListMap), mirroring the LSM memtable pattern.

A recurring shapeNotice the pattern: skip lists show up wherever you need an ordered, in-memory index that several threads touch at once — Redis ZSETs, LSM memtables, concurrent maps, search postings. That is precisely the niche where 'ordered + simple + concurrent' beats a balanced tree.

Common misconceptions & gotchas#

Why use randomness instead of a balanced tree?

Same asymptotics, far less code. A skip list reaches O(log n) without rotations, recoloring, or the case analysis a red-black/AVL tree needs — and because updates are local pointer splices, it is much easier to make concurrent (even lock-free). The price is a probabilistic rather than deterministic bound.

Is the O(log n) guaranteed?

No — it is expected and holds with high probability, not as a hard worst case. With catastrophically unlucky coin flips the list degenerates toward O(n), but the probability of degrading by any constant factor falls off exponentially in n, so a large list effectively never misbehaves. The randomness is internal, so there is no adversarial key ordering that forces the worst case.

Why p = 1/2?

It makes the height a single coin flip and balances search speed against memory: ~2 pointers per node and ~2·log2(n) search steps. The theoretical optimum for search time is around p = 1/e ≈ 0.37, and memory-sensitive systems like Redis use p = 1/4 to cut pointers to ~1.33 per node at the cost of slightly longer walks per level.

How do concurrent skip lists work?

Updates are local, so they map onto compare-and-swap. Lock-free designs (Pugh's, Java's ConcurrentSkipListMap) delete in two phases: a single CAS logically marks the node absent, then physical unlinking from each level happens lazily, with passing threads helping finish it. Readers always see a consistent, linearizable structure without taking locks.

QuizRedis chose p = 1/4 (rather than 1/2) for its sorted-set skip list. What is it primarily optimizing, and what does it give up?

Optimizing search speed; giving up worst-case guarantees
Optimizing memory (fewer pointers per node); giving up some search speed
Optimizing concurrency; giving up ordered iteration
Optimizing insert speed; giving up delete speed

Show answer

Optimizing memory (fewer pointers per node); giving up some search speed — Average pointers per node is 1/(1−p): 2.0 at p=1/2 but only ~1.33 at p=1/4. Lowering p makes towers taller-but-rarer, cutting per-node pointer memory — important when a server holds millions of sorted-set members. The cost is more horizontal steps per level, i.e. slightly slower searches. Concurrency and ordered iteration are unaffected by p.

In an interview#

Lead with the pitch: a skip list gives balanced-tree performance (expected O(log n) search/insert/delete) with much simpler, rotation-free code that is easy to make concurrent — at the cost of expected rather than worst-case bounds. Name the uses: Redis sorted sets (ZSET), LevelDB/RocksDB memtables, and Java's ConcurrentSkipListMap.

Be ready to explain the coin-flip promotion (geometric level distribution, ~2 pointers/node and ~log2(n) levels at p=1/2), the skim-high/drop-low search, and why there is no rebalancing. Expect the follow-up about the worst case — with terrible luck every node lands at level 0 and search degrades to O(n), which is vanishingly unlikely, which is why the bound is probabilistic. A strong candidate also mentions span counts for O(log n) rank queries and the two-phase logical-delete trick that makes lock-free versions safe.

Then open the simulator: INSERT a spread of keys and watch towers of different heights appear, SEARCH a key and follow the path skipping across the express lanes, then DELETE a key and see it unlinked from every level of its tower.

References & further reading#

References

William Pugh — Skip Lists: A Probabilistic Alternative to Balanced Trees (CACM, 1990) — the original paper: the structure, the search, and the expected-O(log n) analysis
Java — ConcurrentSkipListMap (JDK API docs) — the standard-library lock-free concurrent sorted map
Redis — Sorted Sets (data type docs) — ZSET semantics; skip-list-backed for large sets
Redis source — t_zset.c (skip list implementation) — real ZSKIPLIST with spans and back pointers; ZSKIPLIST_P = 0.25
LevelDB source — skiplist.h (memtable skip list) — the in-memory write buffer that fronts the LSM tree
Aref et al. — The Ubiquitous Skiplist: A Survey (2024) — modern survey: variants, concurrency, and systems usage

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Insert, search, delete →Build a skip list, then search using the express lanes (skip ahead, drop down) and delete a key — the structure stays sorted and balanced without rotations.The express lane in action →Watch a search for a far-right key ride the high lanes past most nodes, dropping down only as it closes in — versus a missing key that falls off the end.Deletes keep it sorted →Remove keys from several lanes at once and the list stays perfectly ordered — no rebalancing, just unlinking a node from each lane it sits on.

Open the Skip List simulator →

Up nextB+ Tree

← Back to the learning path