The big picture#
TL;DRthe 30-second version
- A skip list is a sorted linked list plus randomized 'express lanes' — each node gets a tower of forward pointers whose height is decided by repeated coin flips (promote with probability p, usually 1/2).
- Search starts at the top-left sentinel, moves right while the next key is smaller than the target, and drops a level when it would overshoot. The geometric thinning of upper levels makes this expected O(log n).
- Insert and delete run the same search to find each level's predecessor, then splice or unlink one tower — no rotations, no global rebalancing. That makes the code short and easy to make lock-free.
- It powers Redis sorted sets (ZSET), the LevelDB/RocksDB memtable, and java.util.concurrent.ConcurrentSkipListMap. The classic alternative — a balanced BST (red-black/AVL) — gives a hard worst-case bound but is fiddlier and harder to make concurrent.
Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the probability math and edge cases you can skip on a first pass and return to later.
every key lives at L0; taller towers are rarer — a box means the key reaches that level
Start here: the problem it solves#
A sorted linked list is easy to insert into, but searching it is O(n): you have to walk one node at a time because there is no way to jump ahead. An array fixes search with binary search at O(log n), but then every insert or delete shifts elements — O(n) again — so neither plain structure gives you fast ordered search and fast updates at the same time.
The textbook answer is a balanced binary search tree — red-black or AVL — which guarantees O(log n) search, insert, and delete. The catch is the machinery that keeps it balanced: rotations, recoloring, and case analysis that are famously fiddly to implement correctly. Worse, those rotations restructure pointers far from the point of change, which makes a balanced tree hard to update from multiple threads without taking coarse locks.
A skip list gets the same expected O(log n) operations as a balanced tree, but stays balanced on average through randomness instead of rotations. There is no rebalancing step at all — an insert or delete only touches the tower of the affected node and the predecessors that point at it. That locality is what makes the code short and what makes concurrent, even lock-free, implementations practical.
The mechanism: stacked linked lists and coin-flip towers#
Picture several linked lists stacked on top of each other. The bottom level (level 0) contains every node in sorted order. Each level above holds a sparse subset of the level below — the express lanes. Every node carries a 'tower' of forward pointers, one per level it participates in, so a tall tower is reachable from several lanes and a height-1 node only from the base list.
Heights are chosen by chance, not by position. When you insert a node you flip a coin to decide its tower height: it is always at level 0; with probability p (typically 1/2) it is also promoted to level 1; with probability p again to level 2; and so on, stopping at the first tails. This makes roughly half the nodes reach level 1, a quarter level 2, an eighth level 3 — the geometric thinning that makes the upper lanes act like a coarse index over the lanes beneath them.
- Allocate the new node and run a search to record the predecessor at every level (the 'update' array).
- Flip coins to pick the tower height h: start at 1, and while a fair coin says heads and h is below the cap, increment h.
- If h exceeds the list's current top level, raise the head sentinel's level so the new tower has somewhere to attach.
- For each level 0..h-1, splice the node in: new.next = pred.next; pred.next = new — exactly like a singly linked list insert, repeated per level.
Go deeperSentinels, the level cap, and back pointers
Real implementations use a head sentinel whose tower is as tall as the current list, and often a NIL/tail sentinel with a +∞ key so the 'move right' loop needs no null check. The maximum level is capped — Pugh suggests MaxLevel = log_{1/p}(N) for the expected size N (16 levels covers ~65k keys at p=1/2; Redis uses 32, covering ~2^64). Capping bounds the pointer overhead and is harmless because exceeding it is astronomically unlikely.
Plain skip lists store only forward pointers, so they are singly linked per level. Implementations that need to walk backwards — Redis ZSETs support ZREVRANGE — add a single back pointer at level 0, turning the base list into a doubly linked list while leaving the express lanes one-directional.
Searching: skim high, drop low#
A search starts at the head on the highest occupied level and moves right as long as the next node's key is still smaller than the target. When the next node would overshoot (its key is ≥ the target, or there is none), the search drops down one level and continues from the same node. It keeps skimming and dropping until it falls off the bottom of level 0, where the next node is either the target (found) or larger (absent).
- Start top-left at the head sentinel, on the highest occupied level.
- Move right while the next key is < the target (each high-level hop skips many base nodes at once).
- When the next key is ≥ the target, drop down one level and repeat from the current node.
- After dropping below level 0, the next node is either the target (found) or larger (absent).
skim right while the next key is < 30, then drop a level — repeat until L0
Insert and delete reuse this exact search first — recording the predecessor at each level in an 'update' array — then splice in the new tower or unlink the target's tower at those recorded points. Because the express lanes thin out geometrically, the expected number of hops, and therefore the cost of all three operations, is O(log n).
PredictYou search for a key that sits near the end of a 1,000,000-key skip list. Roughly how many nodes does the search visit, and why isn't it ~1,000,000?
Hint: How many express-lane hops does each level save, and how many levels are there?
About 30–40 node visits, not a million. With p = 1/2 the list has ~log2(1,000,000) ≈ 20 levels, and the expected work is ~log_{1/p}(n)/p ≈ 2·log2(n) ≈ 40 steps. Each high-level hop skips, on average, all the lower-level nodes between two express stops, so the search covers exponential distance per level and only walks node-by-node in the last short stretch near the target.
Complexity: why the coin flips give O(log n)#
All three operations — search, insert, delete — are expected O(log n) time. Space is O(n): every node stores its key/value plus its tower of forward pointers, and the towers add only a constant factor on average.
- Expected levels: a node reaches level k with probability p^k, so the tallest tower in n nodes is ~log_{1/p}(n) levels. At p = 1/2 that is about log2(n) levels.
- Expected pointers per node: a node's height is a geometric random variable with mean 1/(1−p). At p = 1/2 that is exactly 2 forward pointers per node on average, so total space is ~2n pointers.
- Expected search cost: ~(1/p)·log_{1/p}(n) comparisons. At p = 1/2 that is ~2·log2(n) — the same Θ(log n) shape as a balanced tree, with a small constant.
Go deeperThe backward-climb argument and high-probability bounds
Pugh analyzes the search path by walking it backwards from the target. At each step the path either climbs up a level (probability p, because the current node has a taller tower) or steps left (probability 1−p). The expected number of leftward steps needed to traverse one level is 1/p, and the number of levels is log_{1/p}(n) with high probability, giving expected cost (1/p)·log_{1/p}(n). Minimizing this over p gives an optimum near p ≈ 1/e ≈ 0.37, but p = 1/2 is the standard choice because it makes the coin flip a single bit and costs almost nothing extra.
The bound is not just 'expected' — it holds with high probability. The probability that search cost exceeds its mean by a constant factor falls off exponentially in n (a Chernoff-style tail). Concretely, the chance the structure degenerates badly shrinks faster than any polynomial, which is why in practice a skip list never visibly misbehaves even though, in principle, an unlucky run of coin flips could.
Variants worth knowing#
- Deterministic / 1-2-3 skip lists — replace the coin flips with a balance rule (between 1 and 3 nodes of one height between any two of the next height up), giving a guaranteed O(log n) worst case at the cost of more bookkeeping. They are the skip-list analogue of a 2-3-4 tree.
- Indexable skip lists — store a 'span' count on each forward pointer (how many base nodes it jumps). Summing spans along the search path yields the rank of a key in O(log n), enabling order-statistic queries like 'the 1000th smallest' or ZRANK. Redis ZSETs use exactly this.
- Concurrent / lock-free skip lists — because updates are local pointer splices, they map naturally onto compare-and-swap (CAS). Java's ConcurrentSkipListMap is a non-blocking design; deletions mark a node logically before physically unlinking it, so readers never see a torn structure. This concurrency-friendliness is a major reason skip lists are chosen over balanced trees in multithreaded code.
Go deeperHow lock-free deletion avoids a half-removed node
Naively CAS-ing a node out level by level can let another thread observe it present at one level and gone at another. Lock-free skip lists (Pugh's concurrent design; Java's ConcurrentSkipListMap) solve this with logical deletion: the node's value is first marked (e.g. set to a sentinel or its next pointer 'marked') with a single CAS, which atomically makes it logically absent. Physical unlinking from each level then happens lazily, and any thread that trips over a marked node helps finish the unlink before proceeding. Readers therefore always see a consistent linearizable view without locks.
Trade-offs and when to reach for one#
A skip list is the right call when you want an ordered map with O(log n) operations and you value simple, concurrency-friendly code over a hard worst-case guarantee. It is a weaker fit when you need a strict bound, the tightest possible memory, or cache-optimal scans over huge on-disk datasets.
- For: in-memory ordered maps/sets, ranking and range queries, and anything multithreaded — the local, rotation-free updates make lock-free implementations tractable where a balanced tree would need coarse locking.
- Against (hard real-time): the O(log n) bound is probabilistic. A safety-critical system that cannot tolerate even an astronomically rare slow operation should prefer a deterministic balanced tree or a deterministic skip list.
- Against (memory-tight or on-disk): towers cost extra pointers and the nodes are scattered in memory, so cache locality is worse than a packed array or a B-tree. For disk-resident data, a B-tree's high fan-out and page alignment win decisively — which is why databases index on disk with B-trees, not skip lists.
Failure modes: the unlucky worst case#
The pathological case is real but exponentially unlikely: if every coin flip comes up tails, every node lands at level 0 and the skip list degenerates into a plain linked list with O(n) search. Less dramatically, an unlucky distribution of heights can make some searches longer than the mean.
Two properties keep this from mattering. First, the bad cases are vanishingly rare — the probability of degradation by a constant factor falls off exponentially in n, so a million-key list effectively never misbehaves. Second, the randomness comes from the structure's own coin flips, not from the keys, so there is no adversarial input ordering that forces the worst case the way sorted insertions degrade a naive BST. (If an attacker could observe or control the RNG, they could in principle bias heights — so security-sensitive deployments seed the RNG unpredictably.)
Skip list vs balanced BST vs B-tree vs hash table#
| Skip list | Balanced BST (red-black) | B-tree | Hash table | |
|---|---|---|---|---|
| Search / insert / delete | O(log n) expected | O(log n) worst case | O(log n) worst case | O(1) average |
| Ordered ops (range, rank, min/max) | Yes — base list + spans | Yes — in-order traversal | Yes — sorted leaves | No — unordered |
| Worst-case guarantee | Probabilistic | Deterministic | Deterministic | Probabilistic (O(n) on collisions) |
| Implementation simplicity | Simple, no rotations | Fiddly rotations/recoloring | Complex (splits/merges) | Simple |
| Concurrency | Easy — local CAS, lock-free exists | Hard — rotations touch far nodes | Hard — node splits lock subtrees | Moderate — bucket/stripe locks |
| Memory locality | Poor — scattered towers | Poor — scattered nodes | Excellent — packed pages | Good — contiguous buckets |
| Best home | In-memory ordered + concurrent | In-memory ordered, strict bound | On-disk / large ordered indexes | Unordered key lookup |
Read this table as: hash tables win when you do not need order; B-trees win on disk because of fan-out and locality; balanced BSTs win when you need a hard bound in memory; and skip lists win when you want ordered, in-memory operations with simple code that is easy to make concurrent.
Where skip lists run in the wild#
- Redis sorted sets (ZSET) — the canonical example. A ZSET is a skip list ordered by score (with a back pointer for reverse range, and span counts for ZRANK) paired with a hash map from member to score. Small ZSETs use a compact listpack; large ones switch to the skip list. This is why ZADD/ZRANGE/ZRANK are all O(log n).
- LevelDB & RocksDB memtables — the in-memory write buffer in front of an LSM tree is a skip list. It must support concurrent sorted inserts and ordered iteration during a flush; the skip list's lock-friendly inserts fit perfectly. (The LSM explainer's memtable is exactly this.)
- Java's java.util.concurrent.ConcurrentSkipListMap / ConcurrentSkipListSet — the standard library's concurrent sorted map, a lock-free skip list. It is the go-to when you need a thread-safe NavigableMap.
- Apache Lucene & HBase — Lucene has used skip lists within postings to skip over document IDs during query evaluation; HBase's in-memory store (MemStore) is skip-list backed (Java's ConcurrentSkipListMap), mirroring the LSM memtable pattern.
Common misconceptions & gotchas#
Why use randomness instead of a balanced tree?
Same asymptotics, far less code. A skip list reaches O(log n) without rotations, recoloring, or the case analysis a red-black/AVL tree needs — and because updates are local pointer splices, it is much easier to make concurrent (even lock-free). The price is a probabilistic rather than deterministic bound.
Is the O(log n) guaranteed?
No — it is expected and holds with high probability, not as a hard worst case. With catastrophically unlucky coin flips the list degenerates toward O(n), but the probability of degrading by any constant factor falls off exponentially in n, so a large list effectively never misbehaves. The randomness is internal, so there is no adversarial key ordering that forces the worst case.
Why p = 1/2?
It makes the height a single coin flip and balances search speed against memory: ~2 pointers per node and ~2·log2(n) search steps. The theoretical optimum for search time is around p = 1/e ≈ 0.37, and memory-sensitive systems like Redis use p = 1/4 to cut pointers to ~1.33 per node at the cost of slightly longer walks per level.
How do concurrent skip lists work?
Updates are local, so they map onto compare-and-swap. Lock-free designs (Pugh's, Java's ConcurrentSkipListMap) delete in two phases: a single CAS logically marks the node absent, then physical unlinking from each level happens lazily, with passing threads helping finish it. Readers always see a consistent, linearizable structure without taking locks.
QuizRedis chose p = 1/4 (rather than 1/2) for its sorted-set skip list. What is it primarily optimizing, and what does it give up?
- Optimizing search speed; giving up worst-case guarantees
- Optimizing memory (fewer pointers per node); giving up some search speed
- Optimizing concurrency; giving up ordered iteration
- Optimizing insert speed; giving up delete speed
Show answer
Optimizing memory (fewer pointers per node); giving up some search speed — Average pointers per node is 1/(1−p): 2.0 at p=1/2 but only ~1.33 at p=1/4. Lowering p makes towers taller-but-rarer, cutting per-node pointer memory — important when a server holds millions of sorted-set members. The cost is more horizontal steps per level, i.e. slightly slower searches. Concurrency and ordered iteration are unaffected by p.
In an interview#
Lead with the pitch: a skip list gives balanced-tree performance (expected O(log n) search/insert/delete) with much simpler, rotation-free code that is easy to make concurrent — at the cost of expected rather than worst-case bounds. Name the uses: Redis sorted sets (ZSET), LevelDB/RocksDB memtables, and Java's ConcurrentSkipListMap.
Be ready to explain the coin-flip promotion (geometric level distribution, ~2 pointers/node and ~log2(n) levels at p=1/2), the skim-high/drop-low search, and why there is no rebalancing. Expect the follow-up about the worst case — with terrible luck every node lands at level 0 and search degrades to O(n), which is vanishingly unlikely, which is why the bound is probabilistic. A strong candidate also mentions span counts for O(log n) rank queries and the two-phase logical-delete trick that makes lock-free versions safe.
Then open the simulator: INSERT a spread of keys and watch towers of different heights appear, SEARCH a key and follow the path skipping across the express lanes, then DELETE a key and see it unlinked from every level of its tower.
References & further reading#
- William Pugh — Skip Lists: A Probabilistic Alternative to Balanced Trees (CACM, 1990) — the original paper: the structure, the search, and the expected-O(log n) analysis
- Java — ConcurrentSkipListMap (JDK API docs) — the standard-library lock-free concurrent sorted map
- Redis — Sorted Sets (data type docs) — ZSET semantics; skip-list-backed for large sets
- Redis source — t_zset.c (skip list implementation) — real ZSKIPLIST with spans and back pointers; ZSKIPLIST_P = 0.25
- LevelDB source — skiplist.h (memtable skip list) — the in-memory write buffer that fronts the LSM tree
- Aref et al. — The Ubiquitous Skiplist: A Survey (2024) — modern survey: variants, concurrency, and systems usage
Ready to try it?
The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.