The problem: writes land at random spots on disk#
TL;DRthe 30-second version
- Don't update data where it lives on disk β that's a slow random write. Buffer writes in a sorted table in memory (the memtable), then write the whole batch to disk in one sequential sweep.
- Log each write first (the WAL) so a crash can't lose it. When the memtable fills, write it out as an immutable sorted file (an SSTable).
- A read checks newest to oldest, and a Bloom filter lets it skip files that can't hold the key. A background merge (compaction) keeps the file count and stale data in check.
- The bargain: very fast writes, paid for by rewriting data as it merges downward. This is the engine inside RocksDB, Cassandra, LevelDB, and ScyllaDB.
Picture a librarian shelving 10,000 returned books an hour, in strict sorted order. Each book means walking to its exact shelf and sliding the others over to make room. The librarian spends the whole hour walking to random shelves β and it only gets worse as the pile grows.
A normal database index works the same way: to write a record it finds the exact disk page that record belongs on and updates it in place. A stream of writes to different keys becomes a stream of random disk jumps, and random writes are the slowest thing a disk does β a typical SSD sustains maybe 50β100 MB/s of small random writes, but 500+ MB/s writing sequentially. A 5β10Γ gap.
So here's the whole idea: stop hunting for the right spot. Never update data where it lives. Buffer writes in memory, and once you've gathered a batch, write them all to disk in one sequential sweep. That turns a storm of random writes into a few big sequential ones β the biggest speedup storage has to offer. Now let's build it up, one piece at a time.
Step 1 β the memtable: a sorted buffer in memory#
Every write first lands in the memtable β an in-memory sorted data structure (typically a red-black tree or skip list). Think of it as a scratchpad on your desk. Writes arrive at memory speed, and the memtable keeps all entries sorted by key so that when it eventually gets written to disk, the output is already in order with no extra sorting work needed.
Reads check the memtable first, because it holds the most recent version of every key that has been written recently. A memtable lookup is O(log n) with no disk I/O β very fast.
Step 2 β the WAL: crash safety before anything else#
There is a problem with the memtable: it lives in RAM. If the server crashes β power cut, OS panic, hardware failure β everything in RAM vanishes instantly. All recent writes that had not yet been flushed to disk are gone.
The Write-Ahead Log (WAL) is the safety net. Before any write enters the memtable, it is first appended to the WAL β a simple sequential file on disk. Appending to a sequential file is fast. The WAL's only job is to record what happened so that a crash can be recovered.
- Write arrives: PUT 'user:42' = 'Alice'
- Append the write to the WAL on disk β fast, sequential, takes a few microseconds
- Insert the write into the in-memory memtable β even faster, no disk I/O
- Acknowledge success to the caller
- On crash recovery: replay the WAL from the last checkpoint to rebuild whatever was in the memtable
Step 3 β SSTables: immutable sorted files on disk#
When the memtable fills up, it is written to disk as an SSTable β Sorted String Table. The name describes the format exactly: the key-value pairs are sorted by key, packed into a file. Once written, an SSTable is never modified. Immutability is what allows the whole system to avoid random writes.
An SSTable is more than a flat list of records. It has internal structure to make lookups fast:
- Data blocks β key-value pairs grouped into fixed-size blocks (typically 4β64 KB each). Keys are sorted within and across blocks.
- Block index β a small table at the end of the file listing the first key of each data block and its byte offset. A lookup binary-searches the index to jump directly to the right block, skipping all others.
- Bloom filter β a compact bit array (a few kilobytes) stored alongside each SSTable. It answers 'is this key definitely NOT in this file?' with no false negatives. If it says no, the entire file is skipped without a single byte of disk I/O.
- Footer β a few bytes at the very end pointing to where the index and bloom filter start in the file, so they can be loaded quickly on open.
The read path: newest to oldest, bloom-filtered#
To look up a key, the engine searches from newest to oldest and returns the first match:
- Check the active memtable β most recent writes live here. O(log n), no disk I/O.
- Check L0 SSTables newest-first β recently flushed files. L0 files can have overlapping key ranges, so all must be checked. Use the bloom filter on each to skip files that definitely do not contain the key.
- Check L1 files β files at this level have non-overlapping key ranges, so at most one file per lookup. Use the bloom filter, then the block index to read one block.
- Repeat at L2, L3, and deeper levels until found or all levels exhausted.
The first match wins β the newest version of the key is returned and the search stops. Most lookups are resolved in the memtable or L0; only keys that haven't been written recently require probing deeper levels.
PredictYou GET a key that was never written. How many SSTable data blocks does a well-tuned LSM actually read from disk?
Hint: What is the Bloom filter for?
Ideally zero. The Bloom filter on each SSTable answers 'definitely not here' for a key that was never added, so the engine skips every file without reading a data block. The cost is just the in-memory Bloom checks (plus rare false positives). A missing-key read is the case Bloom filters were invented to make cheap.
Deletes are writes: tombstones#
SSTables are immutable β you cannot go back and erase a key from an existing file. So how does deletion work?
A DELETE writes a tombstone: a special record with the key and a 'deleted' marker. This tombstone flows through the system exactly like any other write β into the WAL, into the memtable, eventually flushed into an SSTable. During reads, a tombstone shadows any older value for the same key, making the key appear deleted.
The old data is still physically present in an older SSTable. It gets removed during compaction, when the tombstone and the older copy it shadows are processed together and the tombstone wins β both are dropped from the output.
Compaction: merging the layers back together#
Without intervention, the number of SSTable files keeps growing as the memtable flushes repeatedly. More files means more files to check on every read, and more wasted space from stale versions and uncleared tombstones. Compaction is the background process that reverses this.
Compaction picks a set of SSTable files, reads them all, merges their entries keeping only the newest value per key, drops tombstoned entries that have no older copies to shadow, and writes a fresh smaller set of output files. The input files are then deleted.
- Leveled compaction β each level L has a maximum total size, typically 10x the level above. When L0 overflows, its files are compacted into L1. L1+ files have non-overlapping key ranges per level, so a lookup at each level needs at most one file read. This gives low read and space amplification at the cost of more write amplification β data is rewritten many times as it moves down levels. Used by RocksDB's default mode, LevelDB, and Cassandra's LCS (its size-tiered STCS and time-windowed TWCS are separate, non-leveled strategies).
- Tiered (size-tiered) compaction β files of similar size are grouped into tiers. When a tier has enough files (typically 4), they are merged into one larger file in the next tier. Data is rewritten fewer times β lower write amplification β but a level can hold many overlapping files, so reads and space usage are worse. Better for write-heavy workloads like time-series ingestion.
Newest at the top Β· each level about 10Γ larger
Go deeperWhy leveled compaction costs more writes than tiered
With leveled compaction and a level multiplier T (typically 10), each level is ~TΓ the size of the one above. To push a level's data down, the engine must rewrite it against the overlapping range of the next level β on average ~T bytes rewritten per byte promoted. Summed over the ~log_T(N) levels, total write amplification is roughly T Γ log_T(N): often 10β30Γ in practice.
Size-tiered compaction instead merges a whole tier of similar-sized files at once, so each byte is rewritten roughly once per tier β about log_T(N) times total, with no per-level T factor. That's far less write work, which is why ingest-heavy stores (Cassandra time-series, logs) prefer it. The cost shows up as space and read amplification: several same-level files can hold the same key, so reads probe more files and a major compaction can transiently need ~2Γ the dataset's disk space.
RocksDB also offers a hybrid 'leveled with L0 tiered' and a 'universal' (tiered) mode; Cassandra exposes STCS, LCS, and TWCS (time-window, for TTL'd time-series). The knob is always the same trade: rewrite data more now (leveled) for cheaper reads later, or rewrite less now (tiered) and pay on reads/space.
The three amplifications: write, read, space#
Every storage engine trades off three costs. You can reduce one, but only by accepting more of the others. LSM trees make a clear choice: minimize write amplification (extremely fast writes) at the cost of some read and space amplification, both of which compaction keeps bounded.
PredictWith leveled compaction, each level is about 10Γ the one above and there are ~7 levels. Roughly how many times does one byte get rewritten by the time it settles at the bottom?
Hint: It's rewritten about once per level as it merges downward.
Very roughly 10β30Γ in practice (about the level multiplier Γ the number of levels). So one byte you write can become 10β30 bytes actually written to disk. That's write amplification β the price an LSM pays to turn random writes into sequential ones and keep reads cheap.
- Write amplification β the ratio of bytes written to disk versus bytes logically written. Compaction rewrites data multiple times as it descends levels. A single byte ingested might be written 10β30x before reaching the bottom. More aggressive compaction lowers read and space amplification but raises write amplification.
- Read amplification β the number of disk reads a single lookup requires. Without bloom filters you would read every SSTable. With them and with leveled compaction, most lookups touch just one or two files.
- Space amplification β how much extra disk space the database uses beyond the minimum needed for the data. Pending tombstones, stale overwritten versions waiting for compaction, and overlapping SSTable files all contribute.
When to reach for an LSM (and when not to)
An LSM tree is the right call when writes dominate, are append-heavy, or arrive in bursts you can't afford to block on β and when you can tolerate slightly slower, less predictable reads in exchange. It's a poor fit when your workload is read-mostly with strict point-read latency targets, or when you lean heavily on large range scans and in-place updates.
- Good fit: write-heavy ingestion (metrics, logs, events, IoT), key-value and wide-column stores, time-series, and anything where ingest throughput is the bottleneck.
- Weaker fit: read-mostly OLTP needing single-digit-millisecond point reads with tight tail latency, or heavy range/scan-and-update patterns β a B-tree's in-place, single-path layout wins there.
- Watch out: delete- or update-heavy workloads create tombstones and stale versions that compaction must keep up with, or read latency and disk usage creep up.
LSM tree vs B-tree
| LSM tree | B-tree | |
|---|---|---|
| Write performance | Excellent β all writes are sequential appends | Good for light load; degrades under high write volume |
| Read performance | Good β bloom-filtered, but may probe multiple files | Excellent β predictable single-path lookup |
| Range scans | Reasonable β must merge across sorted SSTable files | Excellent β leaf linked list enables O(k) scan |
| Space usage | Temporary bloat until compaction catches up | Compact β in-place updates, no extra copies |
| Best for | Write-heavy: event logs, time-series, Cassandra, RocksDB | Read-heavy OLTP, range queries: PostgreSQL, MySQL, SQLite |
In the wild, and why it fits
The LSM design is everywhere storage has to absorb heavy writes. The differences between systems are mostly choices about compaction, the WAL, and what sits on top.
- LevelDB (Google) β the compact reference implementation; introduced the leveled-compaction design that most others build on. Embedded, single-process.
- RocksDB (Meta) β a heavily extended fork of LevelDB: multi-threaded compaction, column families, leveled/universal/FIFO compaction, tunable Bloom filters. The storage engine inside MySQL (MyRocks), CockroachDB, TiKV, Kafka Streams, and more.
- Apache Cassandra / ScyllaDB β wide-column stores; SSTables per table, pluggable compaction (STCS, LCS, TWCS). ScyllaDB reimplements Cassandra in C++ on a shard-per-core architecture for higher throughput.
- HBase / Bigtable β the original Bigtable paper's design: memtable + WAL + SSTables on a distributed filesystem (HDFS/Colossus).
Common misconceptions & gotchas
Does a delete immediately free disk space?
No. A delete writes a tombstone β an extra record β so disk usage briefly goes UP. The space is only reclaimed later, during the compaction that merges the tombstone with the older value it shadows. Delete-heavy workloads can pile up tombstones faster than compaction clears them.
Are LSM reads always slower than a B-tree's?
Not necessarily. Hot keys live in the memtable or L0 and are very fast, and Bloom filters make missing-key reads nearly free. The weakness is the worst case β a key in a deep level, or a range scan that must merge across many SSTables β and less predictable tail latency, not average reads.
Is the WAL redundant once data is in the memtable?
No β the memtable is in RAM and vanishes on crash. The WAL is the only durable copy of recent writes until the memtable is flushed to an SSTable, at which point that WAL segment can be discarded.
Does more compaction always make things better?
It lowers read and space amplification but raises write amplification (more rewrites, more disk wear, more CPU). Over-aggressive compaction can starve foreground writes. It's a balance, not a 'more is better' knob.
QuizA workload does millions of overwrites to a small set of hot keys. Which amplification suffers most, and which compaction strategy helps?
- Read amplification; switch to size-tiered compaction
- Space amplification; leveled compaction keeps stale versions bounded
- Write amplification; disable compaction entirely
- None; overwrites are free in an LSM
Show answer
Space amplification; leveled compaction keeps stale versions bounded β Repeated overwrites leave many stale versions of the same keys scattered across files (space amplification). Leveled compaction keeps each level non-overlapping and merges frequently, so old versions are collapsed quickly and space stays bounded β at the cost of higher write amplification. Disabling compaction would let stale data and files grow without limit.
In an interview
Lead with the trade: LSM trees convert random writes into sequential ones by buffering in the memtable and flushing to immutable SSTables. This buys massive write throughput at the cost of read amplification (multiple files to probe) and write amplification (compaction rewrites data as it descends levels). Bloom filters and leveled compaction keep both costs bounded.
Name the components in order: WAL (durability β log before you act), memtable (sorted in-memory buffer), SSTable (immutable sorted file with block index and bloom filter), compaction (background merge β keeps files manageable). Know leveled vs tiered compaction. Know tombstones. Name the real systems: RocksDB, Cassandra, LevelDB, ScyllaDB.
Then open the simulator: watch a PUT travel from WAL to memtable to SSTable on flush, run a GET that the bloom filter rejects without reading the file, delete a key and see the tombstone, then compact and watch the tombstone disappear as the data is physically reclaimed.
References & further reading
- O'Neil, Cheng, Gawlick & O'Neil β The Log-Structured Merge-Tree (1996) β the original paper that named the structure
- RocksDB Wiki β Compaction β leveled, universal, and FIFO strategies in production
- LevelDB β implementation notes β the compact reference design
- Apache Cassandra β how data is written & compaction strategies β STCS / LCS / TWCS in a wide-column store
- Martin Kleppmann β Designing Data-Intensive Applications, Ch. 3 β the clearest book-length treatment of LSM vs B-tree
Ready to try it?
The simulator is a real, deterministic implementation β pick a scenario and step through it, scrubbing the timeline forward and backward through every change.