HyperLogLog · System Internals

The big picture#

TL;DRthe 30-second version

HyperLogLog (HLL) estimates the number of distinct elements in a stream using a fixed, tiny amount of memory — about 12 KB for billions of items — instead of the O(n) memory an exact hash-set would need.
It hashes each item to uniform bits and watches for a rare pattern: a hash with k leading zeros suggests you have seen on the order of 2^k distinct items. Duplicates re-hash to the same bits, so they never inflate the estimate.
To cut the noise, it splits items across m registers by their first p bits, keeps the maximum leading-zero rank in each, and combines them with a harmonic mean. Standard error is 1.04/√m — about 0.81% for m=16,384.
HLLs are mergeable: union is the element-wise max of two register arrays, which makes them ideal for distributed and parallel counting. They cannot do intersection directly.
Real systems: Redis PFADD/PFCOUNT/PFMERGE, Presto/Trino and Redshift approx_distinct, BigQuery APPROX_COUNT_DISTINCT, and Apache Druid / DataSketches.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced internals (the math, the bias corrections, the encodings) you can skip on a first pass and return to later.

itemalice@example.com

hash →

hash()uniform 64-bit

remaining bits → rank 40001… = 3 leading zeros, +1

Hash an item, then split the hash: first bits pick the register, the rest set the rank

m = 2^p registers, ~6 bits each · highlighted = the one this item touched

regs2r00r15r21r30r43r54r62r7r6 ← max(old 4, new 4) = 4

Each register keeps the MAX rank it has ever seen

Start here: the problem it solves#

Imagine you run a website and want to know how many unique visitors you had today. The exact way is to keep a set of every visitor ID you have seen and count it. With 50 million unique visitors and a 16-byte ID, that set is roughly 800 MB of RAM — for a single counter. Run that per page, per hour, per country, and the memory cost explodes.

The exact approach is fundamentally O(n) in memory: to know whether the next item is new, you must remember every item you have already seen. There is no way around that if you demand an exact answer. Counting distinct IPs hitting a network, distinct search terms, distinct users across a fleet of servers — all hit the same wall.

Here is the key observation: you usually do not need the exact count. 'About 49.7 million' is just as useful as '49,732,104' for capacity planning, dashboards, billing, or abuse detection. If you can tolerate a small error — typically around 1% — you can replace that 800 MB set with about 12 KB.

The trade-offHyperLogLog uses a fixed array of m registers (about 12 KB total for the common configuration) and answers distinct-count queries with a standard error of 1.04/√m — about 0.81% for m=16,384 — regardless of whether you counted a thousand items or ten billion. You give up exactness; you gain memory that does not grow with the data.

The intuition: rare patterns reveal scale#

Think about flipping a coin and recording the longest run of heads you ever see. If the best anyone managed was 2 heads in a row, you probably did not flip many times. If someone hit 20 heads in a row, you must have flipped an enormous number of times, because that streak has a roughly 1-in-a-million chance. The rarest event you have observed is a fingerprint of how many trials happened.

HyperLogLog applies this to hashes. Hash each item to a stream of random-looking bits. A hash that starts with many leading zeros is rare — a hash beginning with k zeros happens about once every 2^k distinct items. So if the longest run of leading zeros you have ever seen is k, a good guess is that you have seen roughly 2^k distinct items. Duplicates hash to the same bits, so they never push the maximum higher — only genuinely new items can.

Why hashing mattersThe hash must scramble inputs into uniformly random bits, so that leading-zero runs follow the clean 1-in-2^k probability. The same input always hashes to the same bits, which is exactly why a repeated item is automatically ignored — it lands on a register that already records its rank. A weak hash with structure or collisions skews the leading-zero statistics and biases the estimate.

The mechanism: from one estimate to many registers#

A single longest-zero-run is a wildly noisy estimator — one freakishly lucky item could make it say 'a billion' after ten inserts. The fix is stochastic averaging: split the stream into many independent sub-streams and average their estimates. HyperLogLog does this with the hash itself: the first p bits choose one of m = 2^p registers; the remaining bits decide the rank (leading zeros + 1). Each register independently tracks the maximum rank of the items that landed in it.

Hash the item to a wide bit string (modern HLL uses a 64-bit hash).
Take the top p bits as a register index (e.g. p=14 → m = 16,384 registers).
In the remaining bits, count the leading zeros before the first 1, then add 1 — that is the rank ρ.
If ρ is greater than what that register currently holds, raise the register to ρ. Otherwise do nothing (this is why duplicates and out-of-order inserts don't matter — the operation is an idempotent max).

Combining the registers is not a plain arithmetic mean — a single large register would still dominate and inflate the answer. HyperLogLog uses the harmonic mean of 2^register across all m registers, which strongly damps outliers, multiplied by a bias-correction constant α_m and by m². The harmonic mean is the key idea that separates HyperLogLog from its predecessor LogLog and gives it near-optimal accuracy.

Go deeperThe exact estimator and the constant α_m

The raw estimate is E = α_m · m² / Σ_j 2^(−M[j]), where M[j] is the value in register j and the sum runs over all m registers. The Σ 2^(−M[j]) term is the reciprocal of a harmonic mean of the 2^M[j] values; multiplying by m² rescales it into a cardinality.

α_m corrects a systematic multiplicative bias in that raw formula. The paper gives α_16 = 0.673, α_32 = 0.697, α_64 = 0.709, and for m ≥ 128 the approximation α_m ≈ 0.7213 / (1 + 1.079/m). These constants come from integrating the estimator's expected value and are what make the estimate asymptotically unbiased.

The standard error of this estimator is 1.04/√m. That single relationship drives every sizing decision: pick the error you can tolerate, solve for m, and that fixes your memory. m must be a power of two because it is addressed by p whole bits of the hash.

A worked example#

Take a small sketch with p=4, so m=16 registers, all starting at 0. We add three items.

Add 'apple' → hash 0110 1000… Register index = 0110 = 6. Remaining bits 1000… start with a 1, so 0 leading zeros, rank ρ = 1. Register 6 goes 0 → 1.
Add 'banana' → hash 0110 0010… Register index = 6 again. Remaining bits 0010… have 2 leading zeros, rank ρ = 3. Register 6 goes 1 → 3 (a new item raised the max).
Add 'apple' again → identical hash to step 1. Register index 6, rank 1. max(3, 1) = 3, so nothing changes. The duplicate is silently absorbed.

After these inserts, fourteen registers are still 0 and two hold small ranks. Because the sketch is so sparse, HyperLogLog would not trust the harmonic-mean formula here — it would use linear counting (described next): with z = 14 of m = 16 registers still zero, m·ln(m/z) = 16·ln(16/14) ≈ 2.1, very close to the true 2. As you pour in thousands of distinct items, the registers fill up, the harmonic-mean estimator takes over, and the answer tracks the real cardinality to within about 1.04/√m.

PredictYou add the same 1,000 distinct items to an HLL in a different random order, twice. How do the two final register arrays compare, and what does that tell you?

Hint: Each register stores a max, and a hash doesn't depend on arrival order.

They are identical, bit for bit. Each register holds the maximum rank among items that mapped to it, and an item's hash (hence its register and rank) doesn't depend on when it arrives. max is commutative and idempotent, so order and duplicates are irrelevant. That order-independence is exactly why two HLLs can be merged by taking element-wise maxes — the union sketch is the same one you'd get by adding every element to a single HLL.

Small-range and large-range corrections#

The harmonic-mean formula is biased when very few distinct items have been added and most registers are still 0. In that regime HyperLogLog switches to linear counting: if z of the m registers are still zero, the estimate is m·ln(m/z). This is accurate precisely when the sketch is sparse, which is why small counts in the simulator come out almost exactly right.

The original 2007 algorithm also defined a large-range correction for the regime near 2^32, where a 32-bit hash starts running out of distinct values and collisions bias the estimate downward. The practical fix, introduced by HyperLogLog++, is simply to use a 64-bit hash so the saturation point moves out past any realistic cardinality, and to replace the original small-range correction with an empirically measured bias table.

Accuracy is constant, not relative to data sizeThe standard error depends only on m (the number of registers), not on the cardinality. With m=16,384 the error is about 0.81% whether you count a hundred items or ten billion. To halve the error you quadruple m — and memory — because error scales as 1/√m.

Memory and accuracy#

HyperLogLog's headline property is that memory is O(m) and completely independent of n, the number of items. Adding an item is O(1): one hash, one register index, one max. Querying the estimate is O(m): scan the registers and compute the harmonic mean.

Memory is O(m) registers. Each register only needs to hold a small leading-zero rank — with a 64-bit hash the rank fits in 6 bits, so m=16,384 registers pack into 16,384 × 6 bits ≈ 12 KB. That is the standard Redis configuration.
Standard error is 1.04/√m. For m=16,384 (p=14) that is ≈ 0.81%; for m=4,096 (p=12) ≈ 1.6%; for m=1,024 (p=10) ≈ 3.25%. Halving the error means 4× the registers.
Add is O(1), independent of current cardinality. Estimate is O(m). Merge of two sketches is O(m).
The error is a standard deviation, not a hard bound: roughly 65% of estimates fall within ±1 standard error and ~95% within ±2, so an occasional query can be off by a couple of percent even at p=14.

Why ~6 bits per register is enoughA register stores the largest leading-zero rank seen among the items that landed in it. With a 64-bit hash and p=14 bits used for the index, 50 bits remain, so the rank can be at most ~51 — which fits comfortably in 6 bits (max 63). You never need a full counter per register; you need just enough bits to record a max rank. This is the whole reason a few KB can summarize billions of items.

The family: LogLog → HyperLogLog → HLL++#

HyperLogLog is the most refined member of a lineage of probabilistic counters, each improving the accuracy you get per bit of memory.

Flajolet–Martin (1985) / LogLog (2003) — the original rank-based counters. LogLog averages the register values arithmetically, which gives a standard error of about 1.30/√m.
SuperLogLog (2003) — discards the largest register values before averaging to cut the influence of outliers, improving the error to about 1.05/√m.
HyperLogLog (2007) — replaces the arithmetic mean with the harmonic mean, reaching 1.04/√m with a simpler, cleaner analysis. This is the version everyone means by 'HLL'.
HyperLogLog++ (Google, 2013) — the production-grade variant: a 64-bit hash (removing the large-range correction), an empirical bias correction for the mid-range, and a sparse representation that stores only the non-zero registers for small cardinalities to get near-exact small counts in less memory.

Go deeperSparse vs dense encodings (Redis and HLL++)

When cardinality is small, most registers are zero, so materializing the full m-register array wastes space. HLL++ and Redis both keep a sparse encoding for low counts: a compact list of (register index, rank) pairs for just the non-zero registers. This uses far less than 12 KB while a sketch is small, and gives essentially exact counts in that regime.

Once enough registers are populated that the sparse list would exceed the dense array's size, the implementation promotes the sketch to the dense encoding: the fixed packed array of m 6-bit registers (~12 KB in Redis). The switch is transparent to the caller — PFADD and PFCOUNT behave the same; only the internal representation changes.

Strengths, limits, and mergeability#

HyperLogLog makes one clear bargain: give up the exact answer to gain memory that does not grow with the data. Whether that is a good deal depends on what you do with the count.

Approximate, not exact — never use HLL where the count must be precise (billing that must reconcile to the unit, fraud cases that hinge on an exact number). Use it for dashboards, capacity planning, trends, and 'roughly how many'.
Mergeable / unionable — the union of two HLLs is the element-wise max of their register arrays, and it equals the HLL you'd get by adding every element to one sketch. Each shard, server, or time-window keeps its own HLL; you merge them with no loss of accuracy. This is the property that makes HLL a staple of distributed analytics.
No direct intersection — there is no max-style trick for |A ∩ B|. You can estimate it via inclusion–exclusion (|A| + |B| − |A ∪ B|), but the errors of the three estimates add up, so intersections of similar-sized sets can be very inaccurate.
Tunable error — p (and thus m) is a dial: smaller sketches for cheap, fuzzy counts; larger sketches when you need tighter bounds. The cost is purely memory, not speed.

Mergeability is the killer featureBecause union is a per-register max, HLLs commute and associate freely: you can pre-aggregate one HLL per server per minute, then roll those up to per-hour or per-day or per-fleet totals after the fact, all without re-scanning raw events. A daily unique-visitor count is just the merge of 24 hourly sketches. Exact sets cannot do this without shipping and de-duplicating every element.

HyperLogLog vs the alternatives#

	Exact hash-set	Linear Counting	KMV / MinCount	HyperLogLog
Memory	O(n) — grows with data	O(max n) bitmap	O(k) — sample of k mins	O(m) — fixed, ~12 KB
Accuracy	Exact	Good only if sized ≥ n	Tunable via k	1.04/√m (~0.81% at p=14)
Best range	Any (if memory allows)	Small / bounded n	Small–medium n	Tiny to billions
Mergeable?	Yes, but expensive (union sets)	Yes (bitwise OR)	Yes (keep smallest k)	Yes (element-wise max)
Used for	When exactness required	Low-cardinality counts	Distinct + similarity	Web-scale distinct counts

HyperLogLog wins when cardinality can be huge and unknown in advance and a ~1% error is acceptable. Linear counting is actually more accurate for genuinely small sets — which is exactly why HLL borrows it for its small-range correction. KMV (k-minimum-values) keeps the k smallest hash values and can also estimate set similarity, but uses more memory than HLL for the same distinct-count accuracy.

Where HyperLogLog runs in the wild#

Approximate distinct counting is a default feature of most analytics engines, and HyperLogLog (usually the ++ variant) is the implementation underneath.

Redis — PFADD, PFCOUNT, and PFMERGE are HLL operations (the PF honors Philippe Flajolet). Redis uses up to 12 KB per key with a 0.81% standard error, with sparse and dense encodings under the hood.
Presto / Trino and Amazon Redshift — approx_distinct() / APPROXIMATE COUNT(DISTINCT) compute distinct counts via HLL; Trino also exposes the HyperLogLog type directly (approx_set, merge) so you can store and combine sketches.
Google BigQuery — APPROX_COUNT_DISTINCT plus an HLL_COUNT.* family (HLL_COUNT.INIT / MERGE / EXTRACT) for storing and re-aggregating sketches across queries.
Apache Druid and the Apache DataSketches library — HLL (and the related Theta sketch, which also supports set intersection) power distinct-count columns and roll-ups in OLAP workloads.

The shape to recognizeAnywhere you see a sub-second 'unique users / unique devices / unique IPs' number over billions of events, across many shards, with a small ± next to it, there is almost certainly a HyperLogLog (or a close cousin) doing the counting. The mergeability is what lets these systems pre-aggregate and roll up.

Common misconceptions & gotchas#

How can a few kilobytes possibly count billions of items?

Because it never stores the items — only a summary of how rare the rarest hash patterns were. Each register keeps a single small number (a max leading-zero rank, ~6 bits). The estimate is reconstructed from those m numbers, and the rarity of long zero-runs encodes the scale. Memory depends on m (your chosen accuracy), not on n.

Is the count exact?

No. It is a statistical estimate with a standard error of 1.04/√m — about 0.81% at the common p=14 setting. That is a standard deviation, so individual queries can occasionally be off by a couple of percent. Never use HLL where an exact count is required.

Can I merge two HyperLogLogs?

Yes, losslessly. Take the element-wise maximum of the two register arrays. The merged sketch is identical to one built by adding every element of both streams to a single HLL, so a union has the same accuracy as a fresh sketch. This requires both HLLs use the same hash and the same p (m). You cannot, however, directly compute an intersection.

Why does it struggle at small counts, and how is that fixed?

When most registers are still 0, the harmonic-mean formula is biased. HLL detects this sparse regime and switches to linear counting (m·ln(m/z) where z registers are zero), which is accurate for small sets. HyperLogLog++ goes further with a sparse representation and an empirical bias-correction table for the mid-range.

QuizYou run one HyperLogLog per server and merge them nightly into a fleet-wide unique-visitor count. A visitor who hit three different servers is counted how many times in the merged estimate?

Three times — once per server's sketch
Once — the same hash maps to the same register, and merge takes the max
Zero — merging cancels duplicates out entirely
It depends on the order the sketches are merged

Show answer

Once — the same hash maps to the same register, and merge takes the max — The visitor's ID hashes to the same register index and rank on every server, so each server's HLL records that one register identically. Merging takes the element-wise max, so the shared register contributes exactly once — the visitor is de-duplicated across servers for free, with no list of IDs ever shipped. Merge is commutative and associative, so order is irrelevant.

In an interview#

Lead with the win: HyperLogLog turns COUNT(DISTINCT) from O(n) memory into O(m) memory — a fixed ~12 KB — at the cost of a small, tunable error (1.04/√m, about 0.81% at p=14). Name where it is used: Redis PFADD/PFCOUNT, Presto/Trino and BigQuery approximate distinct counts, and unique-visitor analytics.

Be ready to explain the leading-zeros intuition (a rare pattern implies many trials), why you need many registers (one estimator is too noisy — split by the first p bits and combine with the harmonic mean), and the headline property: error depends on the register count, not the data size. A very common follow-up is mergeability — two HLLs combine by taking the element-wise max of their registers, which is what makes them perfect for distributed counting. Mention that intersections are not directly supported.

If pushed on edge cases, mention the small-range correction (linear counting when the sketch is sparse) and HLL++ (64-bit hash, sparse encoding, bias table). Then open the simulator: ADD a few values and watch each one hash into a register and set its rank, add many more to build cardinality, then ESTIMATE and see the harmonic mean land within a standard error of the true count — using only m bytes.

References & further reading#

References

Flajolet, Fusy, Gandouet & Meunier — HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm (2007) — the original paper; derives the harmonic-mean estimator, α_m, and 1.04/√m error
Heule, Nunkesser & Hall — HyperLogLog in Practice (EDBT 2013) — Google's HLL++: 64-bit hash, sparse representation, empirical bias correction
Redis — HyperLogLog data type (PFADD / PFCOUNT / PFMERGE) — the 12 KB / 0.81% standard error production implementation
Trino — aggregate & HyperLogLog functions (approx_distinct, approx_set, merge) — exposes the HyperLogLog type directly for storing and combining sketches
Google BigQuery — approximate aggregate & HLL_COUNT functions — APPROX_COUNT_DISTINCT and the HLL_COUNT.* sketch family

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Count 300 uniques in a few registers →Add 300 distinct values and estimate the count — landing within a few percent of the truth using only 64 tiny registers, not a 300-entry set.Duplicates don't inflate the count →Add the same handful of values over and over plus a few new ones — the estimate tracks distinct items, not how many times each was seen.Same memory at 10× the data →Pour in 1,000 distinct values — the estimate stays accurate and the register array never grows. That fixed footprint is the whole point.

Open the HyperLogLog simulator →

← Back to the learning path