System Internals
Open the simulator →
Probabilistic data structure

Bloom Filter

A bit array that says 'definitely not here' — saving you from expensive lookups.

A Bloom filter solves a specific, very common problem: you need to check whether something is in a large set, but the set is too big to keep in memory and each check costs an expensive disk read or network call. A Bloom filter is a tiny in-memory structure — often just kilobytes — that can instantly rule out most of those checks before they ever happen. Invented by Burton Bloom in 1970, it is the quintessential probabilistic data structure: it trades a small, controllable chance of being wrong for an enormous reduction in space.

The big picture#

TL;DRthe 30-second version
  • A Bloom filter is a bit array plus k hash functions. To ADD a key, set the k bits it hashes to. To QUERY a key, check those k bits: any 0 means definitely absent; all 1s means probably present.
  • It can be wrong in exactly one direction: false positives (says 'probably present' for a key never added) are possible; false negatives (says 'absent' for a key that was added) are impossible. Bits are only ever set, never cleared.
  • It is astonishingly compact — about 10 bits per key for a ~1% false-positive rate, versus the hundreds of bits a hash set needs to store full keys. It cannot list its contents, count exactly, or (in standard form) delete.
  • The math has two knobs: m/n (bits per key) and k (number of hashes). The optimal k = (m/n) x ln 2 sets about half the bits, and the false-positive rate is then ~0.6185^(m/n).
  • It is the membership pre-check in front of expensive lookups everywhere: LSM/SSTable engines (RocksDB, Cassandra, HBase), CDN caches, Chrome Safe Browsing (historically), and Bitcoin SPV clients.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the probability math and tuning you can skip on a first pass and return to later.

key“alice.com”
hash ×3 →
k = 3 hashesh1, h2, h3
ADD →
set bits 2, 5, 9to 1
ADD: hash the key with k functions, set those k bits to 1

highlighted = the 3 bits alice set

bits00011203041506070819010011
The bit array after adding “alice.com”
QUERYBits checkedVerdict
carol2 = 1 ✓, 4 = 0 ✗Definitely NOT present — one 0 ends it, skip the disk read
mallory2 = 1 ✓, 5 = 1 ✓, 9 = 1 ✓Probably present — but it could be a false positive

Start here: the problem it solves#

Imagine a web crawler that has visited 100 million URLs. Every time it discovers a new link, it needs to ask: have I already crawled this? Storing all 100 million URLs in a hash set would take roughly 8–10 GB of RAM — more than most machines can spare for a single lookup table. And the crawler needs to do this check billions of times.

The general shape of the problem: a cheap membership test guards an expensive operation. The expensive thing might be a disk seek to an on-disk file, a round trip to a database or remote cache, a re-download, or a duplicate write. Most of the time the answer is 'no, not present' — the link is new, the key isn't in this file, the item hasn't been seen. If you could answer the easy 'no' cases in memory, you would skip the expensive lookup almost every time.

Here is the key observation: the crawler does not need a perfect answer. It can tolerate occasionally revisiting a page it already crawled (a false positive — the filter wrongly said 'maybe seen'). What it cannot tolerate is missing a new page because the filter wrongly said 'already seen' (a false negative). One direction of error is acceptable; the other is not. A Bloom filter is built to provide exactly that asymmetry.

The trade-offA Bloom filter uses about 10 bits per element — roughly 125 MB for 100 million URLs — instead of 8 GB for a hash set. It guarantees zero false negatives (never says 'absent' for something that was added) and has a small, tunable false-positive rate (occasionally says 'possibly present' for something that was not). That ~64x memory saving is often worth the occasional false positive — especially since a false positive here just means one wasted lookup, not a wrong final answer.

The data structure: an array of bits#

A bit is the smallest unit of storage a computer has: it is either 0 or 1. A Bloom filter is nothing more than a large array of bits, all initialized to 0. For 100 million elements at 10 bits each, the array has 1 billion bits — 125 MB total.

Along with the bit array, you need k hash functions. A hash function takes any input — a URL, a username, a number — and maps it to a position in the bit array. With m bits in the array, each hash function returns a number in [0, m-1]. You pick k to be small — typically between 3 and 10 — and each function gives an independent, different position for the same input.

What is a hash function here?Think of it as a black box that deterministically scrambles an input into a position number. Give it 'alice.com' and it always returns the same number — say, 42. Give it 'bob.com' and it returns 101. Each of the k hash functions uses different internal math, so the same input usually maps to k different positions. They must be fast and spread inputs uniformly across the array — cryptographic strength is not required.
Go deeperYou don't actually need k independent hash functions

Computing k separate strong hashes per key would be slow. In practice, implementations use double hashing: compute two independent hashes h1 and h2 once, then derive the i-th position as g_i = h1 + i x h2 (mod m), for i = 0..k-1. Kirsch and Mitzenmacher showed this introduces no meaningful increase in false-positive rate versus k truly independent hashes.

This is why production filters (RocksDB, Cassandra) can use a fast non-cryptographic hash like MurmurHash or xxHash, split its bits into two halves, and synthesize all k probe positions cheaply. The hash just needs good distribution, not collision resistance.

Adding a key: set k bits to 1#

To add a key to the filter, run it through all k hash functions. Each function gives you one position in the bit array. Set all k of those positions to 1.

  1. ADD 'alice.com' with k=3: hash functions return positions [42, 1891, 503]
  2. Set bit 42 to 1, bit 1891 to 1, bit 503 to 1
  3. ADD 'bob.com': hash functions return positions [101, 42, 7]
  4. Set bit 101 to 1 (new), bit 42 is already 1 — no change, bit 7 to 1 (new)

Notice that bit 42 was already set by 'alice.com' and 'bob.com' maps to it too. This is a collision — two keys share a bit. Collisions during adds are harmless. They matter only during queries, where they are the source of false positives.

Querying: what a 0 means vs what all 1s mean#

To check whether a key is in the set, run it through the same k hash functions and check the resulting positions in the array.

  • If ANY of the k bits is 0: the key was definitely never added. A key that was added would have set all k of its positions to 1. Finding even one 0 proves this key was not responsible for those positions — it is absent. This answer is guaranteed correct: no false negatives ever.
  • If ALL k bits are 1: the key was possibly added. Every position happens to be 1, but they might have been set by other keys rather than this one. This is a 'maybe' answer — could be a genuine match (true positive) or a coincidence (false positive).
Concrete exampleQUERY 'carol.com' with k=3: hash functions return [42, 500, 999]. Check bit 42: it is 1 (set earlier by alice.com and bob.com). Check bit 500: it is 0 — never been set. Found a 0 — 'carol.com' is DEFINITELY NOT in the set. Skip the expensive lookup entirely. No disk read, no network call.
PredictA query can short-circuit. The moment it checks a bit that is 0, it can stop and answer 'absent'. So for a key that is genuinely absent, roughly how many of the k bits does it inspect on average?

Hint: About half the bits in a well-tuned filter are 1. What's the chance the first probed bit is already 0?

Usually just one or two. At the optimal operating point about half the bits are 0, so there's roughly a 50% chance the very first probe lands on a 0 and the query returns 'absent' immediately. On average a true-negative lookup inspects under two bits before short-circuiting — that's why a Bloom check is essentially free compared to the disk read it prevents. Only the rare false positive (and every true positive) has to inspect all k bits.

Why false negatives are impossible#

The no-false-negatives guarantee follows from one rule: bits are never cleared, only set. Once a bit flips to 1, it stays 1 for the lifetime of the filter.

If you added a key, you set all k of its bit positions to 1 at the time of insertion. Since those bits never go back to 0, querying that same key later will always find all k positions set — always returning 'possibly present'. It is physically impossible for the filter to say 'absent' about a key that was added.

The filter can only be wrong in the other direction: saying 'possibly present' for a key that was never added. Never the reverse.

False positives: when bits lie by coincidence#

As you add more keys, more bits in the array flip to 1. After enough additions, a key that was never added might hash to k positions that are all already 1 — set by different earlier keys as coincidental collisions. Querying that key returns 'possibly present' even though it was never inserted. That is a false positive.

The more full the array, the more likely any random set of k positions is already all 1s. A nearly empty array rarely produces false positives; a nearly full one produces them constantly.

  • At 10% fill (1 in 10 bits set): false positives are very rare — most queries find at least one 0 and correctly return 'absent'.
  • At 50% fill (half the bits are 1): this is the optimal operating point for many configurations — a good balance of memory use and accuracy.
  • At 90% fill (9 in 10 bits set): almost every query returns 'possibly present' whether or not the key was added — the filter has become nearly useless.
The false-positive rate formulaThe expected false-positive probability is approximately (fill ratio)^k — the chance that all k checked bits happen to be 1 already. With 50% fill and k=7 hash functions: 0.5^7 = 0.78%. With 50% fill and k=3: 0.5^3 = 12.5%. More hash functions mean a lower false-positive rate at the same fill level — up to a point. Add too many hashes and each insert sets more bits, so the array fills faster and the fill ratio climbs, eventually making things worse.
Go deeperDeriving the false-positive probability from scratch

Let m be the number of bits, n the number of inserted keys, and k the number of hash functions. Assume hashes are uniform and independent. Each insert sets k bits. The probability that one specific bit is NOT set by one specific hash of one specific key is (1 - 1/m). Across all k x n hash operations, the probability a given bit is still 0 is (1 - 1/m)^(kn), which for large m is very close to e^(-kn/m).

So the probability that a given bit IS set is p_set = 1 - e^(-kn/m). A false positive happens when a never-added key hashes to k positions that are all already 1. Treating those k bits as roughly independent, the false-positive probability is approximately P = (1 - e^(-kn/m))^k. This is the canonical Bloom filter formula. (The bits aren't perfectly independent, so this slightly understates the true rate; Bose et al. give the exact analysis, but the formula is excellent in practice.)

Notice the 'fill ratio' shorthand used above: e^(-kn/m) is the fraction of bits expected to be 0, so (1 - e^(-kn/m)) is exactly the fill ratio, and P = (fill ratio)^k. The two descriptions are the same equation.

To minimize P, take the derivative with respect to k (holding m and n fixed). The minimum is at k* = (m/n) x ln 2 ≈ 0.693 x (m/n). Substituting back, at the optimal k exactly half the bits are set (fill ratio = 1/2) and the false-positive rate becomes P = (1/2)^k* = (1/2)^((m/n) ln 2) ≈ 0.6185^(m/n). Each additional bit per key therefore multiplies the rate by ~0.6185.

Inverting that for sizing: to hit a target false-positive probability p, you need m/n = -ln(p) / (ln 2)^2 ≈ -1.44 x log2(p) bits per key. For p = 1% that's ~9.6 bits/key; for p = 0.1%, ~14.4 bits/key; for p = 0.01%, ~19.2 bits/key. The relationship is the heart of every Bloom-filter sizing calculator.

Choosing size: bits per key and number of hash functions#

Two numbers control everything. m/n is the number of bits allocated per key (total bits m divided by expected keys n). k is the number of hash functions. Your goal is to minimize the false-positive rate for a given memory budget.

Bits/key (m/n)Optimal kFalse-positive rate
64~5%
86~2%
107~1% (the common default)
1410~0.1%
2014~0.007%

The optimal k for a given m/n is k = (m/n) x ln 2, which equals approximately 0.69 x bits-per-key. At this k value, exactly half the bits are set on average — the sweet spot where the false-positive rate is minimized for the memory you have. Use fewer hashes and the array is under-used; use more and it fills up too fast.

Rule of thumbEach extra bit per key multiplies the false-positive rate by about 0.6 (roughly a 40% drop), so every ~1.4 extra bits per key halves it. ~10 bits/key buys ~1%; aim for the standard 10 bits unless you have a specific accuracy target. Crucially, the bits-per-key figure is independent of how many keys you store — a filter for a billion keys at 1% needs the same 10 bits each as one for a thousand keys.

Time complexity is the other half of the story and it is excellent: ADD and QUERY are both O(k) — a fixed, small number of hash computations and bit accesses, completely independent of how many keys n the filter holds. There is no tree to descend and no list to scan. Space is O(m) = O(n) bits, with the constant (bits per key) set by your target false-positive rate.

Variants: deletes, growth, and the cuckoo alternative#

The standard Bloom filter is the baseline; several variants relax its limitations at a cost.

  • Counting Bloom filter — replaces each bit with a small integer counter (typically 4 bits). ADD increments the k counters; DELETE decrements them; QUERY treats any non-zero counter as 'set'. This is the standard way to support deletion, at roughly 4x the memory of a plain filter. Counters can overflow under heavy load, and deleting a key never added can corrupt the filter.
  • Scalable Bloom filter — chains a sequence of filters with geometrically tightening false-positive rates. When the active filter fills past its target, a new, larger one is started. QUERY checks all filters in the chain. This removes the need to know n in advance, at the cost of slower queries and slightly more space.
  • Partitioned / blocked Bloom filter — splits the bit array into k equal slices (one per hash) or into cache-line-sized blocks so all k probes for a key hit the same cache line. Blocked filters trade a slightly higher false-positive rate for a large speedup from cache locality, which is why RocksDB uses them.
  • Cuckoo filter — a different structure (a cuckoo hash table of short fingerprints) that supports deletion natively, often uses less space than a Bloom filter for false-positive rates below ~3%, and has better cache locality. The trade-off: inserts can fail when the table gets too full, and it has a hard capacity limit.
Go deeperWhy standard Bloom filters can't delete

Deletion would mean clearing a key's k bits back to 0. But bits are shared: a single bit may have been set by several different keys. Clearing it to remove one key would also 'remove' every other key that depended on that bit — and now a query for one of those keys could find a 0 and wrongly answer 'absent'. That is a false negative, which breaks the filter's one hard guarantee.

The counting variant fixes this by tracking how many keys set each position: decrementing a counter from 3 to 2 doesn't erase the other two keys' contributions. The price is the extra bits per slot — which is exactly why deletion isn't free.

Trade-offs: what you give up for the space#

A Bloom filter's superpower — storing membership in ~10 bits per key instead of hundreds — comes from throwing away the keys themselves. Once you internalize that it keeps no keys, only a fog of bits, every limitation follows.

  • Space vs accuracy — the central dial. Fewer bits per key saves memory but raises the false-positive rate; the relationship is fixed (~0.6 multiplier per bit). You choose the point on that curve; you cannot beat it.
  • No exact answers on 'present' — a positive result is always 'probably', never 'certainly'. If a wrong 'yes' is catastrophic (not merely a wasted lookup), a Bloom filter alone is the wrong tool.
  • No deletes in standard form — membership is monotonic; you can only add. Workloads that churn keys need a counting or cuckoo filter, or periodic rebuild from the source of truth.
  • Can't enumerate or count — it answers one question (is x a member?) and nothing else. No iteration, no exact cardinality (use a HyperLogLog for counting).
  • Sized up front — you must estimate n to pick m. Overshoot and you waste memory; undershoot and the false-positive rate silently degrades as you overfill.
The mental model that ties it togetherA Bloom filter is a lossy, one-way fingerprint of a set. You can ask 'could this be in the set?' and get a fast, mostly-right answer that is conservatively biased toward 'yes'. You cannot ask 'what is in the set?', 'how many?', or 'take this out.' Reach for it whenever a cheap, occasionally-wrong 'probably not' lets you skip an expensive 'definitely'.

Failure modes: saturation and silent degradation#

Bloom filters fail quietly, not loudly. They never throw an error or return a wrong 'absent' — they just get less useful as the false-positive rate climbs. The dangerous cases are the ones you don't notice.

  • Saturation — if you insert far more than the planned n keys, the array fills past 50% and the false-positive rate rises steeply. Past ~80–90% fill almost every query returns 'probably present' and the filter stops saving any lookups. Because there is no error, this can go unnoticed until your 'cheap pre-check' is silently doing nothing.
  • Over-hashing — choosing k much larger than (m/n) ln 2 sets too many bits per insert, saturating the array faster and raising the false-positive rate. More hashes is not always better.
  • Bad or correlated hashes — if hash functions aren't well-distributed (or are correlated), keys cluster into fewer bits than expected, inflating the false-positive rate above the formula's prediction.
  • Mistaken deletes — clearing bits in a standard filter (or decrementing a counter for a key that was never added in a counting filter) introduces false negatives, breaking the core guarantee. This is the one way to make a Bloom filter actually lie about absence.
Operational guardTrack the actual fill ratio (fraction of bits set) in production, not just the inserted count. When it drifts toward 50%+, it's time to rebuild a larger filter or roll a fresh one. Many systems rebuild the filter when the underlying data is rewritten anyway — an SSTable's Bloom filter is built once at flush time for a known key count, which sidesteps saturation entirely.

Bloom filter vs hash set vs cuckoo filter#

Bloom filterHash setCuckoo filter
Memory per key~10 bits (configurable)~100–400 bits (stores full keys)~7–12 bits at low FPP
'Absent' answerCertain — no false negativesCertainCertain — no false negatives
'Present' answerProbable — tunable FPPCertainProbable — tunable FPP
DeletesNo (counting variant adds it)YesYes, natively
List / countNoYesNo
Lookup workO(k) hashes, scatteredO(1) averageO(1), ≤2 cache lines
CapacityDegrades gracefully when overfullGrows freelyInserts can fail when full
Best forCheap pre-check before an expensive lookupExact membership, small/medium setLike Bloom but needs deletes / lower FPP

Rule of thumb: use a hash set when the set fits in memory and you need exactness, deletion, or enumeration. Use a Bloom filter when the set is huge, you only need membership, and a small false-positive rate is acceptable. Reach for a cuckoo filter when you want Bloom's space savings but also need deletes or a very low false-positive rate.

Where Bloom filters run in the wild#

The pattern is always the same: a tiny in-memory filter guards an expensive operation and answers the common 'definitely not present' case for free.

  • LSM-tree storage engines — RocksDB, LevelDB, Cassandra, HBase, and ScyllaDB attach a Bloom filter to each on-disk SSTable. A point lookup checks the filter first; if it says 'not here', the entire file is skipped with zero disk I/O. This is the single most important use of Bloom filters in databases (see the LSM explainer's read path).
  • Google Bigtable — the original Bigtable paper describes optional per-SSTable Bloom filters to avoid disk seeks for non-existent rows; this is where many engineers first met the idea.
  • CDN and web caches — Akamai and others use Bloom filters to detect 'one-hit wonders' (objects requested exactly once). Only cache an object the second time it's seen, so the cache isn't polluted by URLs that will never be requested again.
  • Chrome Safe Browsing (historically) — early versions shipped a Bloom filter of known-malicious URL prefixes to the browser so most safe URLs were cleared locally without a server round trip. Google later moved to a different prefix-set structure, but it's a classic example.
  • Bitcoin SPV clients (BIP 37) — lightweight wallets sent a Bloom filter of their addresses to full nodes so the node could return only relevant transactions without learning exactly which addresses belonged to the wallet. (Later deprecated over privacy and DoS concerns, but a textbook use of the structure.)
  • Databases and dedup pipelines — PostgreSQL ships a bloom index extension; data pipelines use Bloom filters to skip already-processed records and to short-circuit expensive joins.

Common misconceptions & gotchas#

Do more hash functions always reduce false positives?

No — there is an optimum. More hashes mean more bits checked on a query (good), but also more bits set on every insert, which fills the array faster (bad). The sweet spot is k = (m/n) x ln 2, where about half the bits are set. Below it the array is under-used; above it the array saturates and the false-positive rate climbs again.

Can a Bloom filter ever produce a false negative?

Not a standard one. If a key was added, all k of its bits were set to 1 and bits are never cleared, so a later query for that key always sees all 1s and returns 'probably present'. The only way to create a false negative is to break the rules — deleting bits in a plain filter, or decrementing a counter for a key that was never inserted in a counting filter.

Can I delete from a Bloom filter?

Not from a standard one. Bits are shared between keys, so clearing one key's bits would corrupt others and create false negatives. Use a Counting Bloom filter (counters instead of bits, ~4x memory) or a cuckoo filter, which both support deletion. The common production alternative is to simply rebuild the filter from the source of truth.

Does a bigger array fix a high false-positive rate?

Yes, but only if you also re-tune k. The false-positive rate depends on bits-per-key (m/n) and k together. Doubling m without raising k leaves k below optimal; doubling m and recomputing k = (m/n) ln 2 drops the rate sharply (each extra bit/key multiplies it by ~0.6).

Is the false-positive rate fixed forever?

No — it rises as you insert more keys. The advertised rate assumes you stay at or below the planned capacity n. Overfill the filter and the rate degrades silently, with no error to warn you. Monitor the fill ratio, not just the count.

QuizYou provisioned a Bloom filter for 1 million keys at a 1% false-positive rate (~10 bits/key, k=7), but traffic grew and you've now inserted 4 million keys without resizing. What happened to the false-positive rate?

  1. It stayed at 1% — the filter self-adjusts
  2. It rose sharply, because the array is now ~4x overfilled and far more than half the bits are set
  3. It dropped, because more keys means more information
  4. Queries started returning false negatives
Show answer

It rose sharply, because the array is now ~4x overfilled and far more than half the bits are setBits-per-key is now ~2.5 (m fixed, n quadrupled), so the array is heavily saturated — well past the 50% fill that k=7 was tuned for. The false-positive rate climbs steeply (toward double digits), and queries increasingly return 'probably present' for absent keys. False negatives never occur (bits are only set). The fix is to rebuild a larger filter; the filter cannot self-adjust.

In an interview#

Lead with the use case: a Bloom filter sits in front of something expensive — a disk read, a network call, a database query — and cheaply rules out items that are definitely absent. It trades a small, tunable false-positive rate for massive memory savings versus a real set.

Explain the mechanism in two sentences: k hash functions map the key to k bit positions; ADD sets those bits to 1, QUERY checks them — a single 0 means definitely absent, all 1s means possibly present. State the guarantee and why it holds (bits never cleared). Know the sizing rule: ~10 bits/key, k = 0.69 x bits-per-key, ~1% false-positive rate at 50% fill. State the headline limitation: no deletes — and name the Counting Bloom filter as the fix.

Name real uses: LSM tree SSTables skip an entire file with one bloom check (no disk read needed — exactly what you see in the LSM simulator), CDN origin checks, database index skip scans, deduplication pipelines at scale. Then open the simulator: add a handful of keys, query one that is definitely absent (watch it exit on a 0 bit), find a false positive (all k bits happen to be set), and watch the fill ratio and false-positive probability update live.

If pressed on alternatives, contrast it with a cuckoo filter (adds deletes and lower false-positive rates) and a HyperLogLog (different job — approximate distinct-count, not membership). Mention double hashing as the practical trick that avoids computing k independent hashes. Those three details signal you've used the structure, not just read about it.

References & further reading#

References