System Internals
Open the simulator β†’
Storage engine

LSM Tree

How write-heavy databases turn slow random writes into fast sequential ones.

You're taking 10,000 writes a second, each landing at a random spot in a huge file on disk. Random writes are the slowest thing a disk does, so this crawls. How do you keep up without hunting for the right spot every time? Work through that and you'll have rebuilt the LSM tree β€” the engine inside RocksDB, Cassandra, and LevelDB β€” one decision at a time.

The problem: writes land at random spots on disk#

TL;DRthe 30-second version
  • Don't update data where it lives on disk β€” that's a slow random write. Buffer writes in a sorted table in memory (the memtable), then write the whole batch to disk in one sequential sweep.
  • Log each write first (the WAL) so a crash can't lose it. When the memtable fills, write it out as an immutable sorted file (an SSTable).
  • A read checks newest to oldest, and a Bloom filter lets it skip files that can't hold the key. A background merge (compaction) keeps the file count and stale data in check.
  • The bargain: very fast writes, paid for by rewriting data as it merges downward. This is the engine inside RocksDB, Cassandra, LevelDB, and ScyllaDB.

Picture a librarian shelving 10,000 returned books an hour, in strict sorted order. Each book means walking to its exact shelf and sliding the others over to make room. The librarian spends the whole hour walking to random shelves β€” and it only gets worse as the pile grows.

A normal database index works the same way: to write a record it finds the exact disk page that record belongs on and updates it in place. A stream of writes to different keys becomes a stream of random disk jumps, and random writes are the slowest thing a disk does β€” a typical SSD sustains maybe 50–100 MB/s of small random writes, but 500+ MB/s writing sequentially. A 5–10Γ— gap.

So here's the whole idea: stop hunting for the right spot. Never update data where it lives. Buffer writes in memory, and once you've gathered a batch, write them all to disk in one sequential sweep. That turns a storm of random writes into a few big sequential ones β€” the biggest speedup storage has to offer. Now let's build it up, one piece at a time.

PUT user:1=alicea write comes in
append firstso a crash can't lose it
WALappended to a log on disk
then insertthe write is now readable
memtablekept sorted, in memory
Every write lands in memory β€” safely

Step 1 β€” the memtable: a sorted buffer in memory#

Every write first lands in the memtable β€” an in-memory sorted data structure (typically a red-black tree or skip list). Think of it as a scratchpad on your desk. Writes arrive at memory speed, and the memtable keeps all entries sorted by key so that when it eventually gets written to disk, the output is already in order with no extra sorting work needed.

Reads check the memtable first, because it holds the most recent version of every key that has been written recently. A memtable lookup is O(log n) with no disk I/O β€” very fast.

Why sorted?Keeping the memtable sorted means the flush to disk produces a single sequential sorted file rather than a chaotic jumble. It also makes the memtable itself cheap to search. When the memtable reaches its size limit β€” typically 64 to 256 MB β€” it is frozen, a new empty memtable takes its place, and the frozen one is written to disk.

Step 2 β€” the WAL: crash safety before anything else#

There is a problem with the memtable: it lives in RAM. If the server crashes β€” power cut, OS panic, hardware failure β€” everything in RAM vanishes instantly. All recent writes that had not yet been flushed to disk are gone.

The Write-Ahead Log (WAL) is the safety net. Before any write enters the memtable, it is first appended to the WAL β€” a simple sequential file on disk. Appending to a sequential file is fast. The WAL's only job is to record what happened so that a crash can be recovered.

  1. Write arrives: PUT 'user:42' = 'Alice'
  2. Append the write to the WAL on disk β€” fast, sequential, takes a few microseconds
  3. Insert the write into the in-memory memtable β€” even faster, no disk I/O
  4. Acknowledge success to the caller
  5. On crash recovery: replay the WAL from the last checkpoint to rebuild whatever was in the memtable
Why 'write-ahead'?The log is written before the memtable is updated β€” the intention is recorded first. If the server dies between logging and the memtable update, the WAL replay on restart will redo the write. Nothing is lost. Once the memtable is eventually flushed to disk as an SSTable, the corresponding WAL segment is deleted β€” the SSTable is now the durable copy.

Step 3 β€” SSTables: immutable sorted files on disk#

When the memtable fills up, it is written to disk as an SSTable β€” Sorted String Table. The name describes the format exactly: the key-value pairs are sorted by key, packed into a file. Once written, an SSTable is never modified. Immutability is what allows the whole system to avoid random writes.

An SSTable is more than a flat list of records. It has internal structure to make lookups fast:

  • Data blocks β€” key-value pairs grouped into fixed-size blocks (typically 4–64 KB each). Keys are sorted within and across blocks.
  • Block index β€” a small table at the end of the file listing the first key of each data block and its byte offset. A lookup binary-searches the index to jump directly to the right block, skipping all others.
  • Bloom filter β€” a compact bit array (a few kilobytes) stored alongside each SSTable. It answers 'is this key definitely NOT in this file?' with no false negatives. If it says no, the entire file is skipped without a single byte of disk I/O.
  • Footer β€” a few bytes at the very end pointing to where the index and bloom filter start in the file, so they can be loaded quickly on open.

The read path: newest to oldest, bloom-filtered#

To look up a key, the engine searches from newest to oldest and returns the first match:

Memtablein-memory Β· newest writes
not herea bloom filter saying 'definitely not here' skips the whole file
L0 SSTablesnewest-first Β· ranges overlap Β· bloom-checked
not here
L1non-overlapping Β· ≀ 1 file Β· bloom + block index
not here
L2 … deepersame, until found or all levels exhausted
GET key: probe newest β†’ oldest, stop at the first match
  1. Check the active memtable β€” most recent writes live here. O(log n), no disk I/O.
  2. Check L0 SSTables newest-first β€” recently flushed files. L0 files can have overlapping key ranges, so all must be checked. Use the bloom filter on each to skip files that definitely do not contain the key.
  3. Check L1 files β€” files at this level have non-overlapping key ranges, so at most one file per lookup. Use the bloom filter, then the block index to read one block.
  4. Repeat at L2, L3, and deeper levels until found or all levels exhausted.

The first match wins β€” the newest version of the key is returned and the search stops. Most lookups are resolved in the memtable or L0; only keys that haven't been written recently require probing deeper levels.

PredictYou GET a key that was never written. How many SSTable data blocks does a well-tuned LSM actually read from disk?

Hint: What is the Bloom filter for?

Ideally zero. The Bloom filter on each SSTable answers 'definitely not here' for a key that was never added, so the engine skips every file without reading a data block. The cost is just the in-memory Bloom checks (plus rare false positives). A missing-key read is the case Bloom filters were invented to make cheap.

Read amplificationA single read might touch the memtable plus several SSTable files across multiple levels. Each file touched is extra work beyond the minimum. This is read amplification. Bloom filters eliminate most of it β€” a bloom filter saying 'definitely not here' avoids even opening the file. Compaction (covered next) reduces the number of files over time.

Deletes are writes: tombstones#

SSTables are immutable β€” you cannot go back and erase a key from an existing file. So how does deletion work?

A DELETE writes a tombstone: a special record with the key and a 'deleted' marker. This tombstone flows through the system exactly like any other write β€” into the WAL, into the memtable, eventually flushed into an SSTable. During reads, a tombstone shadows any older value for the same key, making the key appear deleted.

The old data is still physically present in an older SSTable. It gets removed during compaction, when the tombstone and the older copy it shadows are processed together and the tombstone wins β€” both are dropped from the output.

Tombstone accumulationUntil compaction clears them, tombstones occupy space and must be checked on reads. A delete-heavy workload can accumulate many tombstones, slowing reads and bloating disk usage. This is one reason compaction must run regularly β€” Cassandra users sometimes hit 'tombstone storms' when deletes outpace compaction.

Compaction: merging the layers back together#

Without intervention, the number of SSTable files keeps growing as the memtable flushes repeatedly. More files means more files to check on every read, and more wasted space from stale versions and uncleared tombstones. Compaction is the background process that reverses this.

Compaction picks a set of SSTable files, reads them all, merges their entries keeping only the newest value per key, drops tombstoned entries that have no older copies to shadow, and writes a fresh smaller set of output files. The input files are then deleted.

  • Leveled compaction β€” each level L has a maximum total size, typically 10x the level above. When L0 overflows, its files are compacted into L1. L1+ files have non-overlapping key ranges per level, so a lookup at each level needs at most one file read. This gives low read and space amplification at the cost of more write amplification β€” data is rewritten many times as it moves down levels. Used by RocksDB's default mode, LevelDB, and Cassandra's LCS (its size-tiered STCS and time-windowed TWCS are separate, non-leveled strategies).
  • Tiered (size-tiered) compaction β€” files of similar size are grouped into tiers. When a tier has enough files (typically 4), they are merged into one larger file in the next tier. Data is rewritten fewer times β€” lower write amplification β€” but a level can hold many overlapping files, so reads and space usage are worse. Better for write-heavy workloads like time-series ingestion.

Newest at the top Β· each level about 10Γ— larger

L0sstsstsstnewest β€” key ranges can overlap
L1sorted sstsorted sstnon-overlapping, sorted
L2sstsstsstsstabout 10Γ— larger again
The level structure compaction builds on disk
Go deeperWhy leveled compaction costs more writes than tiered

With leveled compaction and a level multiplier T (typically 10), each level is ~TΓ— the size of the one above. To push a level's data down, the engine must rewrite it against the overlapping range of the next level β€” on average ~T bytes rewritten per byte promoted. Summed over the ~log_T(N) levels, total write amplification is roughly T Γ— log_T(N): often 10–30Γ— in practice.

Size-tiered compaction instead merges a whole tier of similar-sized files at once, so each byte is rewritten roughly once per tier β€” about log_T(N) times total, with no per-level T factor. That's far less write work, which is why ingest-heavy stores (Cassandra time-series, logs) prefer it. The cost shows up as space and read amplification: several same-level files can hold the same key, so reads probe more files and a major compaction can transiently need ~2Γ— the dataset's disk space.

RocksDB also offers a hybrid 'leveled with L0 tiered' and a 'universal' (tiered) mode; Cassandra exposes STCS, LCS, and TWCS (time-window, for TTL'd time-series). The knob is always the same trade: rewrite data more now (leveled) for cheaper reads later, or rewrite less now (tiered) and pay on reads/space.

The three amplifications: write, read, space#

Every storage engine trades off three costs. You can reduce one, but only by accepting more of the others. LSM trees make a clear choice: minimize write amplification (extremely fast writes) at the cost of some read and space amplification, both of which compaction keeps bounded.

PredictWith leveled compaction, each level is about 10Γ— the one above and there are ~7 levels. Roughly how many times does one byte get rewritten by the time it settles at the bottom?

Hint: It's rewritten about once per level as it merges downward.

Very roughly 10–30Γ— in practice (about the level multiplier Γ— the number of levels). So one byte you write can become 10–30 bytes actually written to disk. That's write amplification β€” the price an LSM pays to turn random writes into sequential ones and keep reads cheap.

  • Write amplification β€” the ratio of bytes written to disk versus bytes logically written. Compaction rewrites data multiple times as it descends levels. A single byte ingested might be written 10–30x before reaching the bottom. More aggressive compaction lowers read and space amplification but raises write amplification.
  • Read amplification β€” the number of disk reads a single lookup requires. Without bloom filters you would read every SSTable. With them and with leveled compaction, most lookups touch just one or two files.
  • Space amplification β€” how much extra disk space the database uses beyond the minimum needed for the data. Pending tombstones, stale overwritten versions waiting for compaction, and overlapping SSTable files all contribute.
The tuning knobsMemtable size: larger means fewer, bigger flushes (less write amplification) but more RAM needed and longer crash recovery. Bloom bits per key: more bits mean fewer false positives and less read amplification, at a small memory cost. Compaction strategy: leveled for read-heavy workloads, tiered for write-heavy. Block size: larger blocks improve compression and sequential scans; smaller blocks make point reads cheaper.
When to reach for an LSM (and when not to)

An LSM tree is the right call when writes dominate, are append-heavy, or arrive in bursts you can't afford to block on β€” and when you can tolerate slightly slower, less predictable reads in exchange. It's a poor fit when your workload is read-mostly with strict point-read latency targets, or when you lean heavily on large range scans and in-place updates.

  • Good fit: write-heavy ingestion (metrics, logs, events, IoT), key-value and wide-column stores, time-series, and anything where ingest throughput is the bottleneck.
  • Weaker fit: read-mostly OLTP needing single-digit-millisecond point reads with tight tail latency, or heavy range/scan-and-update patterns β€” a B-tree's in-place, single-path layout wins there.
  • Watch out: delete- or update-heavy workloads create tombstones and stale versions that compaction must keep up with, or read latency and disk usage creep up.
The tuning knobsMemtable size: larger means fewer, bigger flushes (less write amplification) but more RAM and longer crash recovery. Bloom bits per key: more bits mean fewer false positives and less read amplification, at a small memory cost (~10 bits/key β‰ˆ 1% false positives). Compaction strategy: leveled for read-heavy, tiered for write-heavy. Block size: larger blocks compress better and speed scans; smaller blocks make point reads cheaper.
LSM tree vs B-tree
LSM treeB-tree
Write performanceExcellent β€” all writes are sequential appendsGood for light load; degrades under high write volume
Read performanceGood β€” bloom-filtered, but may probe multiple filesExcellent β€” predictable single-path lookup
Range scansReasonable β€” must merge across sorted SSTable filesExcellent β€” leaf linked list enables O(k) scan
Space usageTemporary bloat until compaction catches upCompact β€” in-place updates, no extra copies
Best forWrite-heavy: event logs, time-series, Cassandra, RocksDBRead-heavy OLTP, range queries: PostgreSQL, MySQL, SQLite
In the wild, and why it fits

The LSM design is everywhere storage has to absorb heavy writes. The differences between systems are mostly choices about compaction, the WAL, and what sits on top.

  • LevelDB (Google) β€” the compact reference implementation; introduced the leveled-compaction design that most others build on. Embedded, single-process.
  • RocksDB (Meta) β€” a heavily extended fork of LevelDB: multi-threaded compaction, column families, leveled/universal/FIFO compaction, tunable Bloom filters. The storage engine inside MySQL (MyRocks), CockroachDB, TiKV, Kafka Streams, and more.
  • Apache Cassandra / ScyllaDB β€” wide-column stores; SSTables per table, pluggable compaction (STCS, LCS, TWCS). ScyllaDB reimplements Cassandra in C++ on a shard-per-core architecture for higher throughput.
  • HBase / Bigtable β€” the original Bigtable paper's design: memtable + WAL + SSTables on a distributed filesystem (HDFS/Colossus).
It's not just databasesThe same shape shows up in embedded stores (BadgerDB, Pebble β€” CockroachDB's Go LSM), message systems, and search indexes (Lucene segments are essentially immutable, merge-compacted SSTables for an inverted index). Once you recognize 'buffer in memory, flush immutable sorted files, merge in the background,' you see it constantly.
Common misconceptions & gotchas
Does a delete immediately free disk space?

No. A delete writes a tombstone β€” an extra record β€” so disk usage briefly goes UP. The space is only reclaimed later, during the compaction that merges the tombstone with the older value it shadows. Delete-heavy workloads can pile up tombstones faster than compaction clears them.

Are LSM reads always slower than a B-tree's?

Not necessarily. Hot keys live in the memtable or L0 and are very fast, and Bloom filters make missing-key reads nearly free. The weakness is the worst case β€” a key in a deep level, or a range scan that must merge across many SSTables β€” and less predictable tail latency, not average reads.

Is the WAL redundant once data is in the memtable?

No β€” the memtable is in RAM and vanishes on crash. The WAL is the only durable copy of recent writes until the memtable is flushed to an SSTable, at which point that WAL segment can be discarded.

Does more compaction always make things better?

It lowers read and space amplification but raises write amplification (more rewrites, more disk wear, more CPU). Over-aggressive compaction can starve foreground writes. It's a balance, not a 'more is better' knob.

QuizA workload does millions of overwrites to a small set of hot keys. Which amplification suffers most, and which compaction strategy helps?

  1. Read amplification; switch to size-tiered compaction
  2. Space amplification; leveled compaction keeps stale versions bounded
  3. Write amplification; disable compaction entirely
  4. None; overwrites are free in an LSM
Show answer

Space amplification; leveled compaction keeps stale versions bounded β€” Repeated overwrites leave many stale versions of the same keys scattered across files (space amplification). Leveled compaction keeps each level non-overlapping and merges frequently, so old versions are collapsed quickly and space stays bounded β€” at the cost of higher write amplification. Disabling compaction would let stale data and files grow without limit.

In an interview

Lead with the trade: LSM trees convert random writes into sequential ones by buffering in the memtable and flushing to immutable SSTables. This buys massive write throughput at the cost of read amplification (multiple files to probe) and write amplification (compaction rewrites data as it descends levels). Bloom filters and leveled compaction keep both costs bounded.

Name the components in order: WAL (durability β€” log before you act), memtable (sorted in-memory buffer), SSTable (immutable sorted file with block index and bloom filter), compaction (background merge β€” keeps files manageable). Know leveled vs tiered compaction. Know tombstones. Name the real systems: RocksDB, Cassandra, LevelDB, ScyllaDB.

Then open the simulator: watch a PUT travel from WAL to memtable to SSTable on flush, run a GET that the bloom filter rejects without reading the file, delete a key and see the tombstone, then compact and watch the tombstone disappear as the data is physically reclaimed.

References & further reading
References