B+ Tree · System Internals

The problem: find one row among billions, fast#

TL;DRthe 30-second version

Every node is one disk page. Internal nodes hold only keys + child pointers (a routing map); leaf nodes hold the real key→value data, sorted.
Pack hundreds of keys into each page, so the tree is wide and shallow: a fanout of ~500 means a 4-level tree indexes ~62 billion entries — any lookup is 3–4 page reads, and the top levels stay in RAM.
All values live in the leaves, chained in a linked list — so a range scan finds the first leaf, then just walks the chain.
An insert goes to a leaf; a full leaf splits and pushes a separator key up, keeping every leaf at the same depth. A delete can borrow or merge to keep nodes at least half full.
Read-optimized and updated in place — great for reads and ranges, but random inserts cause page splits and random writes (the weakness LSM trees attack). It powers PostgreSQL, MySQL, SQLite, and most file systems.

Imagine a library with one million books, shelved randomly. Finding a specific book means checking every shelf — that's an O(n) search, hopeless at scale. Now imagine the books are sorted alphabetically on shelves, and there's an index card catalog at the front that says which shelf range each letter lives on. You check the catalog (a few comparisons), walk straight to the right shelf, and scan just a few books. That catalog is a B+ tree index.

Databases face the same problem. A table can have billions of rows stored across thousands of disk pages. Reading every page just to find one row is too slow. A B+ tree index stores the key values in sorted order across a small number of pages, so any lookup needs to read at most O(log n) pages — typically 3 or 4 even for tables with millions of rows.

What is a page?A page (also called a block) is the unit the database reads from and writes to disk — typically 4 KB to 16 KB. Reading a single byte from disk costs almost the same as reading a full page, so databases think in pages, not bytes. A B+ tree is designed to minimize the number of pages you have to read.

So why not a binary search tree? A balanced BST also gives O(log n) lookups — but log base 2. With a billion keys, log2(10^9) is about 30, so a lookup touches ~30 nodes. If each node is a separate disk page, that's ~30 random disk reads per lookup. Disk seeks, not comparisons, are the bottleneck: the cost of an on-disk index is dominated by how many pages it must fetch, and a binary tree fetches far too many.

The key idea: trade comparisons for page readsA B+ tree packs hundreds of keys into each page instead of one. That raises the fanout (branching factor) from 2 to several hundred, which drops the base of the logarithm from 2 to ~500. log2 of a billion is ~30; log500 of a billion is ~3.3. The tree goes from 30 levels deep to 4 — turning ~30 disk reads into ~4. Wide-and-shallow beats tall-and-thin whenever a node is a page.

The tree structure: two kinds of nodes#

A B+ tree has two types of nodes, each stored as one disk page.

Internal nodes (index pages) — hold only keys, no values. They act as a roadmap: each key k is a separator, and the pointers to the left and right tell you 'keys less than k go left, keys ≥ k go right'. You never need to read an internal node's values — they exist only to route you to the right leaf.
Leaf nodes (data pages) — hold actual key-value pairs, sorted by key. They also have a next-page pointer, linking all leaves in sorted order like a doubly linked list. This linked list is what makes range scans fast: once you find the starting leaf, you just follow the chain.

Why separate index and data pages?Internal nodes only store keys (not values), so they're small and more of them fit in the cache. In a typical database, the top 2–3 levels of the tree fit permanently in memory, meaning almost every lookup only reads one or two pages from disk — the leaf pages.

Internal nodes hold separators only · leaves hold the real key→value data

[30]separator
- [10 · 20]< 30
  - [5 8]
  - [10 15]
  - [20 25]
- [40 · 50]≥ 30
  - [30 35]
  - [40 45]
  - [50 60]

A small B+ tree: keys route at the top, data + a linked list at the leaves

Finding a key: O(log n) every time#

Searching for a key works like the library card catalog. Start at the root and compare your key against each separator. If your key is less than the separator, go left; otherwise go right. Repeat at every level until you reach a leaf. At the leaf, do a linear scan of the (small) sorted array of keys.

Compare search key against root's separators to choose a child pointer.
Repeat at each internal node until you reach a leaf node.
Scan the leaf's keys (just a handful) for an exact match.
If found, return the value. If not, the key does not exist — no false positives.

The height of a B+ tree with n keys and branching factor b is ⌈log_b(n)⌉. With b=100 (a realistic internal node size) and 100 million rows, the tree is only 4 levels tall. So any search reads at most 4 pages — an enormous improvement over a full table scan.

PredictA B+ tree index has 4 levels (root, two internal levels, leaves). A point query is one of the most common operations in the database. How many of those 4 pages are actually read from disk on a typical lookup?

Hint: Which pages get touched on every single query, regardless of the key?

Usually just one — the leaf. The root is read on every query, so it's permanently in the buffer pool (cache); the one or two internal levels below it are read so often they almost always stay cached too. Only the leaf page, which holds the specific key, is likely to incur an actual disk read. This is why a shallow, high-fanout tree is so effective: the small upper levels live in RAM and absorb most of the work.

Inserting: what happens when a page is full?#

Inserting a new key always goes into a leaf. Walk down from root to the correct leaf (same path as search), then add the key in sorted order.

But pages have a fixed size — they hold at most 'order' keys. When a leaf overflows (gets one too many keys), it splits into two leaves of roughly equal size, and the first key of the right leaf is pushed up into the parent as a new separator.

Walk root → leaf, following separators just like search.
Insert the key into the leaf in sorted position.
If the leaf now has too many keys, split it: left half keeps the lower keys, right half gets the upper keys.
Push the first key of the right half up to the parent as a new separator.
If the parent also overflows, split it too. Keep propagating up.
If even the root splits, create a new root — this is the only way the tree grows taller.

The tree only grows taller from the rootSplits propagate upward, but new leaves are added at the bottom. This keeps all leaf nodes at the same depth, guaranteeing that every search path is exactly the same length — O(log n) always, never O(n) in the worst case.

Range scans: why the linked list matters#

A range query like 'give me all orders from January to March' is where B+ trees shine over B-trees. Because all data lives in leaf pages and the leaves are linked in sorted order, a range scan takes O(log n + k) time: O(log n) to find the starting leaf, then O(k) to walk the chain collecting keys.

A regular B-tree (without the leaf linked list) would need to traverse back up to the parent at every level boundary, making range scans 2–3× slower. B+ trees avoid that entirely — once you reach the first leaf, you follow a flat chain.

This is why SQL databases use B+ treesSQL queries frequently ask for sorted ranges (ORDER BY, BETWEEN, >, <). The B+ tree leaf linked list makes these operations essentially free once the starting point is found.

Deleting: underflow and the two fixes#

Deleting a key removes it from the leaf. But if the leaf then has too few keys — called underflow — the tree must rebalance to stay efficient. There are exactly two ways to fix underflow.

Borrow from a sibling: if a neighboring leaf has more than the minimum number of keys, move one key over. Update the parent separator to reflect the new boundary. This is the preferred fix — no pages are removed.
Merge with a sibling: if the sibling is at the minimum too, merge the two leaves into one, and remove the separator from the parent. The parent may now underflow too — repeat the check moving upward.

Just as a root split makes the tree taller, repeated merges can make it shorter — the tree shrinks by one level when the root ends up with just one child and gets collapsed.

In practice, many engines skip the mergeTextbook B+ trees rebalance on every delete to keep nodes ≥50% full. Real systems often don't: PostgreSQL and others leave a sparsely-filled leaf alone and only reclaim a page once it becomes completely empty, because eager merging adds write and locking overhead and the freed space is usually reused by later inserts. The cost is that a table churned by many deletes can leave the index bloated (low page fill) until it is reindexed or vacuumed.

Complexity: why a 4-level tree indexes billions#

Let f be the fanout — the number of children per internal node, bounded by how many keys fit in a page. A B+ tree of n entries has height h ≈ ⌈log_f(n)⌉. Every operation walks one root-to-leaf path, so search, insert, and delete are all O(log_f n) page accesses. A range scan that returns k results is O(log_f n + k/b): one descent to the first leaf, then a walk along the linked leaves, b entries per page.

PredictA page holds ~500 keys (fanout 500), and the tree is 4 levels deep. Roughly how many rows can it index — and how many page reads to find one?

Hint: Each level multiplies the reach by the fanout; a lookup reads one page per level.

500⁴ ≈ 62 billion rows — and a lookup reads just 4 pages, one per level (often 1–2, since the top levels stay cached in RAM). That's the whole trick: a wide, shallow tree turns billions of rows into a handful of reads.

Operation	Cost	Disk reads (typical)
Point lookup	O(log_f n)	1–2 (upper levels cached)
Insert / update	O(log_f n)	1 read + 1 write, more on a split
Delete	O(log_f n)	1 read + 1 write, more on a merge
Range scan (k rows)	O(log_f n + k/b)	1 descent + ⌈k/b⌉ leaf pages

The magic is in the base of the logarithm. Fanout is large because internal nodes store only keys and child pointers, not values — so hundreds fit in a single page. The height grows only with the logarithm of n in that large base, so the tree stays almost flat even at enormous scale.

Go deeperWorking the fanout math: how a few levels reach billions

Take a 16 KB page (MySQL InnoDB's default) and an internal entry of ~16 bytes (an 8-byte key plus a ~6-byte child pointer, with overhead). That's roughly 16384 / 16 ≈ 1000 children per node in the best case; with larger keys and partial fill, a realistic effective fanout is ~300–500. The number of leaf-level slots is f raised to the height.

Fanout f	Height 3 (f³)	Height 4 (f⁴)
100	1,000,000	100,000,000
300	27,000,000	8,100,000,000
500	125,000,000	62,500,000,000

So with a fanout of ~500, a tree just four levels deep addresses ~62 billion entries. Because the root and the next one or two levels are tiny relative to the dataset and are hit on every query, the database keeps them resident in its buffer pool. The effective disk cost of a point lookup is therefore one to two reads — essentially the leaf, and sometimes the level above it — no matter how large the table grows. This is the property that makes B+ trees the default on-disk index for fifty years running.

Variants: B-tree, B+ tree, and the copy-on-write family

"B-tree" is often used loosely to mean the whole family. The distinctions matter in an interview.

B-tree (the original, Bayer & McCreight 1972) — stores keys AND their values in every node, internal and leaf alike. A lookup can stop early at an internal node if the key is there. The downside: values in internal nodes lower the fanout (fewer keys per page) and there is no leaf linked list, so range scans must walk up and down the tree.
B+ tree — values live only in the leaves; internal nodes hold keys purely as separators, and the leaves are linked. Higher fanout (more keys per internal page) plus O(k) range scans. This is what virtually every modern database actually uses, even when the docs say "B-tree."
B*-tree — a tweak that delays splits by first redistributing keys to a sibling, splitting two full nodes into three. It keeps nodes ~⅔ full instead of ½, improving space utilization at the cost of more complex inserts.

Go deeperBulk loading, prefix compression, and write-optimized B-trees

Bulk loading: building an index by inserting keys one at a time causes endless splits and random I/O. Instead, engines sort all keys first, then pack leaves left-to-right to a target fill factor and build internal levels bottom-up. This is how CREATE INDEX and PostgreSQL's index build are fast — sequential writes, no splits, and you can leave free space per page for future inserts.

Prefix and suffix compression: in a sorted index, adjacent keys share long common prefixes (think 'user:000001', 'user:000002'). Storing each key as a delta from its neighbor, and truncating separator keys in internal nodes to the shortest string that still distinguishes the two subtrees, raises the effective fanout and shrinks the tree further. Graefe's survey catalogs dozens of such techniques.

Copy-on-write (COW) B+ trees: instead of updating a page in place, write a new copy of every page on the root-to-leaf path and atomically swap a new root pointer. LMDB is the canonical example — it keeps two B+ trees (data and freelist) in a memory-mapped file, and because old pages are never overwritten, readers see a consistent snapshot with no locks while a single writer proceeds. This is also how MVCC and crash-atomic on-disk structures are built (ZFS and Btrfs use the same COW-tree idea for filesystems).

Write-optimized B-trees (Bε-trees / fractal trees): attach a small message buffer to each internal node. Inserts and deletes are written as messages into the root's buffer and flushed down in batches only when a buffer fills, so many updates are amortized into one page write. This narrows the B-tree's write-amplification gap with LSM trees while preserving B-tree-style range scans. TokuDB / PerconaFT (Bε-tree) and the academic fractal-tree work are the notable implementations.

Strengths and limits

Predictable reads — search always takes O(log_f n) pages. No bloom filter, no level probing — just follow the separators. Latency is tight and uniform, which OLTP loves.
Range scans are free — the leaf linked list makes ORDER BY, BETWEEN, and prefix scans fast with no extra work once the start leaf is found.
In-place updates — overwriting a value modifies one leaf page. No tombstones, no background compaction, low space amplification.
Write amplification on inserts — a single insert can split a page (and cascade splits upward), and each modified page is eventually written back in full. Under heavy random-insert load this multiplies disk writes.
Poor sequential-write behavior — unlike LSM trees, B+ trees write in place. A stream of random-key inserts scatters writes across the tree, causing random I/O that is slow on spinning disks and wears SSDs through write amplification.

Go deeperConcurrency: latch coupling and the right-link trick

Many threads read and write one shared tree, so pages need short-term locks called latches. The classic technique is latch coupling ("crabbing"): descending the tree, you latch a child before releasing the parent, like a crab moving hand-over-hand. Reads take shared latches; a writer that might split takes exclusive latches and can only release ancestors once it knows the child won't split (i.e., the child has room).

The root and upper levels are a contention hot spot because every operation passes through them. The Blink-tree (Lehman & Yao, 1981) reduces this: each node keeps a right-link to its successor and a high key, so a reader that arrives during a concurrent split can simply follow the right-link to find the moved keys instead of holding latches up the whole path. PostgreSQL's nbtree is a Blink-tree, which is why it can run highly concurrent index access with minimal latching.

Failure modes: where B+ trees hurt

Random-insert write amplification — inserting random (e.g. UUID) primary keys dirties pages all over the tree and triggers frequent splits, turning each logical write into several random page writes. Using a monotonically increasing key (an auto-increment id, or UUIDv7) makes inserts append to the rightmost leaf, avoiding most splits and random I/O.
Low page fill / fragmentation — after many deletes and out-of-order inserts, leaves can sit half-empty. The index occupies far more pages than the data needs, so scans read more pages and the cache holds less useful data. The fix is REINDEX / VACUUM (Postgres) or OPTIMIZE TABLE (MySQL) to repack the leaves.
Hot-page contention — a workload that hammers one key range (e.g. all inserts at the right edge with a sequential key, or a counter row) concentrates latch contention on a few pages and their shared ancestors, capping write concurrency even though the tree as a whole is huge.
Wide rows in the leaves — in a clustered index (InnoDB), the whole row lives in the leaf. Very wide rows mean fewer rows per leaf page, a taller tree, and more I/O; large columns are pushed off-page to keep fanout up.

B+ tree vs LSM, hash index, and skip list

	B+ tree	LSM tree	Hash index	Skip list
Point read	O(log_f n), 1–2 reads	O(log n), may probe levels	O(1) average	O(log n)
Write	In-place; splits on insert	Sequential, batched — fastest	O(1), in-place	O(log n), in-memory
Range scan	Excellent — linked leaves	OK — merge across SSTables	None — unordered	Good — sorted links
Space	Compact; some page slack	Bloat until compaction	Compact + load factor	Pointer overhead
Lives on	Disk	Disk	Memory or disk	Memory (e.g. memtables)

A hash index beats a B+ tree on raw point-lookup speed but cannot do ranges or ordered scans at all — fatal for SQL's ORDER BY / BETWEEN, which is why B+ trees are the default and hash indexes are a niche add-on. A skip list is a probabilistically balanced ordered structure with the same O(log n) bounds and far simpler concurrent code, but its scattered pointers make it a poor fit for paged disk storage — so it shows up as an in-memory structure (Redis sorted sets, LSM memtables), not as the on-disk index.

Against the LSM tree, the trade is symmetric: B+ trees win reads and range scans and keep space tight; LSM trees win write throughput by turning random writes into sequential ones. Many modern systems — CockroachDB, TiKV, MyRocks — layer an LSM (RocksDB) underneath a sorted, range-scannable key-value interface, taking LSM write throughput while serving ordered reads.

Where B+ trees run in the wild

The B+ tree is the workhorse on-disk index. When a database says it has a "B-tree index," it is almost always a B+ tree.

PostgreSQL — the default index type (nbtree) is a Blink-tree B+ tree; it backs primary keys, unique constraints, and most secondary indexes. Table rows live in a separate heap, so a secondary index entry points to a heap tuple.
MySQL / InnoDB — the table itself IS a B+ tree: the clustered index stores full rows in the leaves, ordered by primary key. Secondary indexes are separate B+ trees whose leaves hold the primary key, so a secondary lookup does two B+ tree descents.
SQLite — every table and index is a B-tree; the file format is literally a set of B-tree pages. Table B-trees store rows by rowid; index B-trees store keys.
Oracle, SQL Server, Db2 — all use B+ trees as the default index, with their own twists (e.g. SQL Server's clustered vs nonclustered indexes mirror InnoDB's clustered/secondary split).
MongoDB / WiredTiger — WiredTiger's row store is a B+ tree with per-page in-memory update structures; it also supports an LSM option, but B+ tree is the default.
Filesystems — NTFS (directory & MFT indexes), HFS+ / APFS, ext4's HTree directories, XFS, and ReFS use B+ tree (or B-tree) structures to index file metadata and extents.

Common misconceptions & gotchas

B-tree vs B+ tree — what's the actual difference?

In a B-tree, every node (internal and leaf) stores keys together with their values, and there is no leaf linked list. In a B+ tree, values live only in the leaves, internal nodes hold keys purely as routing separators, and the leaves are chained together. That gives the B+ tree higher fanout (a shallower tree) and O(k) range scans. Nearly every production database labelled "B-tree" is really a B+ tree.

Why not just use a binary search tree on disk?

A BST has fanout 2, so its height is log2(n) ≈ 30 for a billion keys — and if each node is a disk page, that's ~30 random disk reads per lookup. Disk seeks dominate the cost. A B+ tree packs hundreds of keys per page, raising fanout to several hundred and dropping the height to 3–4. Fewer, fatter nodes mean far fewer page reads.

Why are random inserts slow on a B+ tree?

An in-place update has to read the target leaf, modify it, and write it back; random keys scatter those targets across the whole tree, producing random I/O. Worse, a full leaf must split — allocating a new page and updating the parent, sometimes cascading splits up the tree — so one logical insert becomes several physical page writes. Sequential keys (auto-increment, UUIDv7) sidestep this by always appending to the rightmost leaf.

Does deleting a row shrink the index file?

Usually not right away. Most engines don't merge underfull pages on every delete; they leave the slack to be reused by future inserts and only reclaim a page once it's fully empty. A delete-heavy table can leave the index bloated until you REINDEX/VACUUM (Postgres) or OPTIMIZE/rebuild (MySQL).

QuizYou're designing a high-ingest table and choose a random UUIDv4 as the primary key. Inserts get slower and the index bloats. Why, and what's the fix?

B+ trees can't store UUIDs; switch to a hash index
Random keys scatter inserts across leaves, causing splits and random I/O; use a time-ordered key like auto-increment or UUIDv7 so inserts append to the rightmost leaf
The tree got too tall; reduce the page size to make it shorter
Nothing is wrong; UUIDs are optimal for B+ trees

Show answer

Random keys scatter inserts across leaves, causing splits and random I/O; use a time-ordered key like auto-increment or UUIDv7 so inserts append to the rightmost leaf — A random UUIDv4 lands in a random leaf on every insert, so the engine dirties pages all over the tree and triggers frequent page splits — random write I/O plus write amplification. A monotonically increasing key (auto-increment id, or time-ordered UUIDv7) makes every insert hit the current rightmost leaf, so pages fill sequentially with minimal splitting. This is a classic InnoDB clustered-index gotcha.

In an interview

Lead with the purpose: a B+ tree keeps data sorted across disk pages, giving O(log_f n) point lookups and O(log_f n + k) range scans — and the guarantee holds for every query, not just the average case. The crux is fanout: nodes are pages holding hundreds of keys, so the tree is only 3–4 levels deep and the upper levels stay cached. Contrast it with the LSM tree (B+ tree wins reads and range scans; LSM wins write throughput).

Know the components: internal nodes route, leaf nodes store, the leaf linked list enables range scans. Know what splits and merges look like — a root split is the only way the tree grows taller, repeated merges can shrink it. Be ready for the B-tree vs B+ tree distinction, why a BST loses on disk, and why random-key inserts are slow. Name real systems: PostgreSQL nbtree, MySQL InnoDB's clustered index, SQLite, WiredTiger, and most filesystems.

Then try it in the simulator: insert until you see a leaf split push a separator up, run a range scan and watch it walk the leaf list, and delete until an underflow triggers a borrow or a merge.

References & further reading

References

Bayer & McCreight — Organization and Maintenance of Large Ordered Indexes (1972) — the original B-tree paper, Acta Informatica 1
Douglas Comer — The Ubiquitous B-Tree (ACM Computing Surveys, 1979) — the classic survey; defines B-tree vs B+ tree vs B*-tree
Goetz Graefe — Modern B-Tree Techniques (Foundations and Trends in Databases, 2011) — encyclopedic: splits, compression, concurrency, recovery
Lehman & Yao — Efficient Locking for Concurrent Operations on B-Trees (1981) — the Blink-tree right-link design PostgreSQL uses
PostgreSQL Documentation — B-Tree Indexes — the nbtree access method: structure, deduplication, behavior
LMDB — Lightning Memory-Mapped Database — copy-on-write B+ tree with lock-free readers

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Build, search, range, delete →Insert nine keys (watching leaves split and the tree grow up), then a point search, a range scan across leaves, and two deletes.Sequential inserts cascade splits →Insert keys in order and watch leaves fill, split, and push separators upward — the tree growing a level when the root itself splits.Range scan walks the leaves →B+ trees keep all values in the leaves, linked in order — so a range query finds the start, then walks leaf to leaf without revisiting the upper levels.

Open the B+ Tree simulator →

Up nextBloom Filter

← Back to the learning path