Bitcask · System Internals

The problem: find one value in a 50 GB file#

TL;DRthe 30-second version

Keep every value in a file on disk, and never change bytes you've already written — just append new ones to the end.
Keep a small map in memory — the keydir — from each key to where its value sits in the file.
A read is one map lookup plus one jump to that spot on disk. A write is one append. A delete appends a 'gone' marker called a tombstone.
The catch: the map holds every key, so all your keys must fit in RAM. The values live on disk and don't count.
That's Bitcask — the storage engine behind Riak.

Start with the simplest store that could work: one file, and to save a pair you add it to the end. Writing is trivial. Reading is the trouble — to find the value for user:1 you'd read the file from the top, checking keys, until you hit it. On a 50 GB file that's up to 50 GB read to answer one lookup. Unusable.

The obvious fix — keep the file sorted by key so you can binary-search it — quietly brings back the thing we're trying to avoid. Keeping a file sorted means shifting bytes on disk every time you insert in the middle, and that random shuffling is slow. So the real question gets sharper: how do you find where a key's value lives without reading the file, and without keeping it sorted?

Build it, one decision at a time#

Here's the move. Keep a small map in memory that answers exactly one question: for each key, where in the file is its value? So the map holds key → (which file, which byte offset, how many bytes). Now a read is two cheap steps — look the key up in the map (instant, it's in RAM), then jump straight to that offset on disk and read just those bytes. One lookup, one jump, no scanning, no matter how big the file grows. This map is the keydir.

PUT user:1=alicea write comes in

append the recordbytes already written are never touched again

data filethe record is added to the end

point the map at it →

keydirmap in RAM: user:1 → where the value is

A write appends to the file; the in-memory keydir points at where it landed

keydir key	file	offset	size
user:1	data.2	42	5
user:2	data.1	0	3
B:user:5	data.1	14	9

That fixes reads. But values change — someone overwrites user:1 with a new value. You could go back and rewrite the old bytes in place, but that's a random write into the middle of the file, and if the power cuts mid-write you're left with a value that's half old and half new: corrupt. So don't rewrite. Append the new value to the end, exactly like a fresh write, and repoint the keydir at the new spot. The old value is still sitting on disk, but nothing points to it anymore — it's dead weight. This is the append-only rule: never change bytes you've already written.

Deleting has the same snag — you can't cut bytes out of the middle of an append-only file. So you append again: a tiny record called a tombstone that means 'this key is gone.' A read that lands on a tombstone reports not-found, and the key is dropped from the keydir.

Append-only has one obvious cost: every overwrite and every delete leaves dead bytes behind, so the file only grows. Fix it with a background job called merge (or compaction): it reads the old files, copies just the values the keydir still points to into a fresh file, and deletes the originals. Dead records vanish; live data stays. Merge is the only thing that ever frees space.

One loose end. The keydir lives only in RAM, so a crash wipes it. You could rebuild it by scanning every file top to bottom — but that reads all the values back just to relearn their offsets, which is slow. So when merge writes a compacted file, it also writes a small hint file beside it: just the keys and their offsets, no values. On restart you read the hints and rebuild the keydir in a fraction of the time.

You just built BitcaskAn append-only log of records on disk, an in-memory keydir mapping key → location, tombstones for deletes, a background merge to reclaim space, and hint files for fast restart. That's the whole engine — and it's the default storage backend Basho built for Riak.

PUT user:1=alice — append a record to the end of the file, then set keydir[user:1] → that spot. Done.
PUT user:1=alicia (overwrite) — append a new record at the end; repoint keydir[user:1] to it. The old alice record is now dead but still on disk.
GET user:1 — look up keydir[user:1], jump to that offset, read the bytes: alicia. One lookup, one jump.
DELETE user:1 — append a tombstone, then drop user:1 from the keydir. Later GETs miss and return not-found.
Restart — the keydir is gone with RAM; rebuild it from the hint files, and the store is ready.

PredictYou overwrite the same hot key one million times, then read it once. How many records for that key are on disk, and how many does the read touch?

Hint: What does the keydir point to, and what happens to all the older versions?

On disk: up to a million records — every overwrite appended a new one and left the last as dead weight (until a merge clears them). The read touches exactly one: the keydir only ever holds the newest location, so it's one lookup and one jump straight to the latest value. The stale versions cost disk space, never read time.

Go deeperGo deeper: what one record looks like on disk

recordcrc4 Btimestamp4 Bkey size2–4 Bvalue size4 Bkeythe keyvaluethe value

One record — self-contained, appended at the end of the file

The keydir doesn't store the key's bytes twice — it already hashed the key to find the entry — so each entry is only the value's location and size. That's why the keydir is far smaller than the data on disk: pointers, not payloads.

The one number that decides everything#

Everything hinges on one number. The keydir keeps an entry for every live key, and the whole thing has to fit in RAM. Notice what's not in it: the values. An entry is just the key plus a small fixed pointer — which file, the offset, the size, a timestamp. So the memory you need scales with how many keys you have, not how much data. Ten-byte keys or ten-megabyte values cost the keydir the same.

PredictYou have 10 million keys, and each keydir entry costs about 64 bytes. How much RAM does the index need — and does it fit on a 16 GB machine?

Hint: Multiply keys × bytes-per-entry, then compare to 16 GB.

10,000,000 × 64 bytes ≈ 640 MB. It fits easily. Now try 10 billion keys: ~640 GB — no ordinary machine holds that, even if the values would. The values could be 50 GB or 50 TB and it wouldn't change this number. Key count is the ceiling.

The number to rememberRAM needed ≈ live keys × a few tens of bytes. Values live on disk and don't count. Capacity planning for Bitcask is really just key-count planning.

Read: one map lookup plus at most one jump to disk — the same cost whether you hold a thousand keys or a billion records. That's why its latency is low and steady.
Write: one append to the end of a file, plus one map update. No random disk writes.
Memory: grows with the number of live keys, not the size of the data. This is the ceiling.
Range scans (give me every key from B:user:1 to B:user:9): not supported — the file isn't sorted, so there's no order to scan.

Go deeperGo deeper: why the read is one jump — and often zero

Because the keydir stores the exact offset and length, the engine reads precisely those bytes in one positioned read — it never scans a block index, checks a second file, or re-reads to pick a version. And the jump is often zero real disk I/O: a value read or written recently is still in the operating system's file cache, so the read is served from memory. The 'one jump' is the worst case, not the typical one, which is why the latency is both low and predictable.

Where it breaks

Crash mid-write: the active file may end with a torn, partial record. On the next open, the CRC check fails for that record and it is discarded; all earlier records are intact because they were never mutated. Worst case is losing the last few unsynced writes.
Keydir rebuild after crash: the in-memory index is always lost on restart and must be rebuilt — fast from hint files, slow from a full data-file scan. Startup time scales with how much of the dataset must be read, so a large store with no hint files can have a long, write-blocking recovery.
RAM exhaustion: the keydir grows with the live key count. Add too many distinct keys and the process runs out of memory — there is no graceful spill-to-disk. This is the failure that most often forces teams off Bitcask.
Merge competing with foreground traffic: merge reads and rewrites large amounts of data, consuming disk bandwidth and I/O the live workload also wants. Run it during peak load and read/write latency spikes; defer it too long and dead records bloat disk and lengthen the next recovery.

Recovery is a scan, not a replayThere is no separate write-ahead log to replay — the data files are the log. Recovery simply rebuilds the keydir from hint files (or data files). That is the whole upside of "the log is the database": fewer moving parts, one source of truth, and a recovery path you can reason about in a sentence.

When to reach for Bitcask (and when not to)

Bitcask is the right call when you want predictable low-latency point reads and high write throughput with the simplest possible operational story, and when your key set comfortably fits in RAM. It is the wrong call the moment key count outgrows memory or the access pattern needs ordered scans.

Good fit: bounded key sets with large or small values — session/profile stores, device and feature state, counters, caches with durability; workloads that value tight tail-latency SLAs and trivial crash recovery.
Weaker fit: very large or unbounded key counts (the keydir won't fit), or anything that leans on range scans, prefix scans, or ordered iteration.
Watch out: overwrite- or delete-heavy traffic generates dead records fast, so merge must be scheduled and resourced; and startup time grows with dataset size unless hint files are present.

The defining bargainBitcask trades two capabilities — unbounded key counts and ordered range queries — for two guarantees: radically simple design and operations, and single-seek, low-variance reads. If you need those guarantees and can live without those capabilities, few engines are easier to run.

Bitcask vs LSM tree vs B-tree vs in-memory

	Bitcask	LSM tree	B-tree	Pure in-memory
Read cost	1 lookup + ≤1 seek	memtable + several files (Bloom-gated)	tree descent, ~log(n) pages	1 lookup, no disk
Write cost	1 sequential append	append + later compaction (write-amp)	in-place random page update	1 in-memory write
Range scans	Not supported (unordered log)	Good — merge sorted SSTables	Excellent — ordered leaves	Depends on structure
Memory need	All keys in RAM	Indexes/filters; data on disk	Cache hot pages; data on disk	All keys AND values in RAM
Durability	Append log + CRC, easy recovery	WAL + immutable SSTables	In-place, careful crash handling	Needs external snapshot/AOF
Best for	Bounded keys, predictable reads	Write-heavy, large datasets	Read-heavy OLTP, range queries	Hot data that fits in RAM

Bitcask and LSM are cousins — both log-structured and write-optimized — but they diverge on ordering. An LSM keeps data sorted in SSTables (enabling range scans) and uses a memtable, leveled compaction, and Bloom filters, accepting read amplification to support far larger key sets on disk. Bitcask keeps everything unordered and indexes it entirely in RAM, buying the single-seek read at the cost of the keys-in-RAM ceiling and no scans.

In the wild, and why it fits

Bitcask was built at Basho as a pluggable storage backend for Riak, the distributed key-value database. Riak shards keys across nodes with consistent hashing; each node runs a local storage engine per partition, and Bitcask was the default — chosen precisely because its predictable latency and trivial recovery suited a system that already handled distribution, replication, and failure at the cluster layer.

Riak KV — Bitcask as the per-vnode backend: low, predictable latency and fast crash recovery, with the documented caveat that all keys must fit in memory. Riak also offers LevelDB (an LSM) for workloads with too many keys or a need for ordered/secondary-index queries.
The Sheehy & Smith paper formalized the design and made "in-memory hash index over an append-only log" a widely cited reference point.
The pattern — append-only data files plus an in-memory key→offset map and periodic compaction — recurs in many later log-structured and embedded key-value stores and write-ahead-log designs.

Why a distributed DB wanted a dead-simple local engineRiak's hard problems (replication, conflict resolution, membership) lived above the storage layer. At the bottom it wanted an engine whose behavior was boringly predictable and whose recovery was a single log scan — so the local layer never became the source of latency surprises or data-loss puzzles. Bitcask is that engine.

Common questions & gotchas

Isn't this just an in-memory database like Redis?

No — and the difference is the whole point. Only the keys and small pointers live in RAM (the keydir); the values sit on disk. A pure in-memory store has to hold every value in RAM too, so it's bounded by total data size. Bitcask is bounded only by key count, so it serves a dataset far larger than memory while still reading in a single seek.

When would I pick an LSM tree over Bitcask?

When the key set won't fit in RAM, or you need ordered/range scans. An LSM keeps data sorted on disk and scales past memory, at the cost of read amplification (several files, Bloom-gated). Bitcask trades those away for a single-seek read and an engine you can operate without thinking. Same log-structured family, opposite bet on ordering.

Does a merge pause reads and writes?

No — merge runs in the background against the immutable files while the active file keeps taking writes. But it reads and rewrites a lot of data, so it competes for disk I/O with live traffic. Run it at peak and latency spikes; defer it forever and dead records bloat disk and slow the next recovery. Merge cadence is a scheduling decision, not fire-and-forget.

Big values are fine, so how many keys is too many?

Value size barely matters — values live on disk. Key count is the ceiling: budget the hashed key plus a fixed pointer (tens of bytes) per live key. Tens of millions of keys is a gigabyte-ish of RAM and totally fine; tens of billions is not. Out-of-memory errors with plenty of free disk is the signature of hitting that limit.

How much can I actually lose in a crash?

By default, only the last few writes that hadn't been flushed to disk yet — Bitcask relies on the OS to flush, so acknowledged-but-unsynced writes are the exposure. Everything already on disk survives (it was never mutated) and a torn tail is caught by its CRC. Need a tighter bound? Set the sync policy to fsync per write or on a timer, trading throughput for durability.

QuizYour Bitcask store starts rejecting writes with out-of-memory errors, even though disk is mostly free. What is the most likely cause?

The active data file grew too large and needs a bigger threshold
The keydir has too many entries — the live key count exceeded available RAM
Reads are too slow because Bloom filters are misconfigured
Merge deleted the hint files, corrupting the index

Show answer

The keydir has too many entries — the live key count exceeded available RAM — Bitcask keeps an in-memory keydir entry for every live key, so RAM use scales with key count, independent of how much disk the values occupy. Plenty of free disk with out-of-memory errors is the classic signature of the keys-in-RAM ceiling. (Bitcask has no Bloom filters; the data-file threshold affects file rolling, not index memory; and hint files only speed startup.)

In an interview

Lead with the split in one breath: append-only data files on disk, an in-memory keydir mapping key → (file, offset, size, timestamp). Everything else follows from it — writes are sequential appends, reads are one hash lookup plus one seek, deletes are tombstones, a background merge compacts dead records and writes hint files, and recovery just rebuilds the keydir from those hints. Name it: Bitcask, the original Riak backend.

Then show judgment by naming the limits before you're asked: all keys must fit in RAM, and there are no range scans because the log is unordered. If they push on scale or ordering, contrast an LSM tree — same log-structured family, but sorted on disk with leveled compaction and Bloom filters, buying range scans and beyond-RAM capacity at the price of read amplification. The one-liner to land: Bitcask trades unbounded keys and range queries for radical simplicity and single-seek, low-variance reads.

References & further reading

References

Sheehy & Smith — Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho, 2010) — the original design paper
Internet Archive — Bitcask paper (mirror) — stable backup of the paper
Riak KV docs — Bitcask backend — configuration, tuning, and the keys-in-memory caveat
basho/bitcask — reference implementation (Erlang/C) — the source: record format, merge, hint files
Martin Kleppmann — Designing Data-Intensive Applications, Ch. 3 — uses Bitcask to introduce log-structured storage and hash indexes

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Write and read →Write two keys, then read one back.Overwrite and delete →Two keys already exist — overwrite one (its old copy goes dead) and delete the other (a tombstone).Merge reclaims space →The file already holds dead copies — merge to reclaim the space.

Open the Bitcask simulator →

Up nextLSM Tree

← Back to the learning path