The Numbers · System Internals

The big picture#

TL;DRthe 30-second version

The latency hierarchy spans ten orders of magnitude — from ~1 ns (L1 cache) to ~150 ms (cross-continent round trip), with RAM, SSD, and disk seeks in between. Knowing the rough gaps tells you instantly where a design will be slow.
Latency is how long one operation takes; throughput (often QPS — queries per second) is how many you do per second. They're different axes — you can have high throughput and bad latency, or vice versa.
Little's Law, L = λ·W, ties them together: the number of requests in flight equals the arrival rate times the time each spends in the system. It's how you size thread pools and connection limits.
Availability is quoted in nines: 99.9% (three nines) ≈ 8.8 hours of downtime a year; each extra nine cuts that ~10× and costs far more than 10× to reach. Promised in an SLA, targeted by a stricter SLO.

Everything below expands on these. Read the core sections top to bottom; the collapsible "Go deeper" box holds the full back-of-the-envelope capacity-estimation method you can return to when you need it.

Operation	Real latency	If 1 ns = 1 second
L1 cache	1 ns	1 second
RAM	100 ns	~1.7 minutes
SSD random read	16 µs	~4.4 hours
Datacenter round trip	0.5 ms	~5.8 days
Disk (HDD) seek	10 ms	~4 months
California → Netherlands RTT	150 ms	~4.75 years

The gaps are the whole point: each step down the table is roughly 100× slower than the one above it. Scaling so 1 ns = 1 second turns those abstract units into human time — a disk seek isn't "10 milliseconds," it's four months.

Start here: why estimate at all#

Imagine someone proposes: 'for each web request, we'll read 1,000 small records from disk, one at a time.' Is that fine, or a disaster? You don't need to build it to know. A random disk seek is on the order of 10 ms; a thousand of them in series is ~10 seconds. Per request. That design is dead on arrival, and you can say so in one sentence — because you know the numbers.

That is the entire value of this toolkit. Most architecture mistakes are caught not by clever insight but by an order-of-magnitude estimate that someone did (or failed to do) early. The figures don't have to be precise — being right to within 10× is usually enough to tell a workable design from a doomed one. The goal is a calculation you can do on a napkin, or in your head, in under a minute.

Order of magnitude is the unit that mattersNobody cares whether RAM access is 80 ns or 120 ns. What matters is that it's ~100× faster than SSD and ~100,000× faster than a disk seek. Estimation works because the gaps between tiers are enormous — so even a rough guess lands in the right tier, and the right tier is what determines whether a design flies.

The latency hierarchy#

Computers work across timescales no human can picture — billionths of a second to tenths of a second, ten orders of magnitude apart. The trick that makes them intuitive is to scale everything so that 1 nanosecond becomes 1 second of human time. Suddenly the hierarchy reads like a calendar.

L1 cache (~1 ns → 1 second): data the core just touched. Effectively free.
RAM (~100 ns → ~1.7 minutes): main memory. ~100× slower than cache — which is why a cache miss is a real event, not a rounding error.
SSD random read (~16 µs → ~4.4 hours): flash storage. Fast for a disk, but ~150× slower than RAM. This gap is the entire reason in-memory caches like Redis exist.
Datacenter round trip (~0.5 ms → ~5.8 days): a network hop to another machine in the same building. Cheap by network standards, but a microservice call crossing several of these adds up.
Disk seek (~10 ms → ~4 months): a spinning platter physically moving its head. Catastrophic to do per-request in a loop — the reason databases work hard to turn random I/O into sequential I/O.
Cross-continent round trip (~150 ms → ~4.75 years): bounded by the speed of light, which no amount of money can speed up. The reason for CDNs, edge caching, and regional deployments.

The one rule of thumb to memorizeEach tier down is roughly 100× slower than the one above it: cache → RAM → SSD/network → disk → another continent. If your hot path touches a slower tier than it needs to — a disk read where a cache would do, a cross-region call where a local one would do — that's almost always where the latency went.

Throughput, QPS, and Little's Law#

Latency is one axis; throughput is the other, and they're independent. Latency is how long a single operation takes. Throughput is how many you complete per second — for a server, usually measured as QPS (queries per second) or RPS (requests per second). A system can have great throughput and terrible latency (a big batch pipeline) or great latency and modest throughput (a single fast machine). You have to reason about both.

The bridge between them is Little's Law: L = λ·W. The average number of requests inside a system (L) equals the arrival rate (λ, in requests per second) times the average time each request spends inside (W, in seconds). It's astonishingly general — it assumes nothing about the arrival pattern or the service-time distribution — and it's the formula you actually use to size things.

Sizing concurrency: a service handling λ = 2,000 req/s, each taking W = 50 ms (0.05 s), has L = 2,000 × 0.05 = 100 requests in flight at any instant — so it needs ~100 threads, connections, or async slots to keep up. Provision fewer and a queue builds; requests wait, W rises, and it spirals.
The leverage of latency: because L = λ·W, cutting per-request latency cuts the concurrency you must provision in direct proportion. Halve W and you need half the threads for the same throughput — often cheaper than buying more machines.
Capacity from the other side: if you know you can run L = 100 in flight and each takes W = 50 ms, your max throughput is λ = L/W = 2,000 req/s. Past that, the system saturates and latency climbs.

PredictA service receives 5,000 requests per second, and each request takes 20 ms to handle. Roughly how many requests are being processed at the same time, and what does that tell you to provision?

Hint: Use L = λ·W. Watch the units — convert 20 ms to seconds.

L = λ·W = 5,000 × 0.020 = 100 requests in flight at once. So you need on the order of 100 concurrent workers — threads, connections, or async slots — to keep up without a queue forming. If you only provisioned 50, requests would start waiting, their effective time-in-system W would climb, and by Little's Law the backlog grows until something gives. Notice you computed a capacity requirement with no load test and no code — just one multiplication.

Back-of-the-envelope estimation#

Put the numbers together and you can size a whole system on a napkin. The method is always the same: start from a usage figure, convert to a per-second rate, then check it against the latency and storage budgets. A few constants make the arithmetic fast.

Seconds in a day ≈ 86,400, which is conveniently close to 100,000 (10⁵). So '1 million events per day' is ~12 per second; '1 billion per day' is ~12,000 per second. Round aggressively.
Read/write ratio: most systems are read-heavy (often 10:1 or more). Peak traffic is usually 2–3× the average. Size for the peak, not the average.
Storage: bytes per item × items. A billion 1 KB records is ~1 TB. Add replication (×3 is common) and overhead.

Go deeperGo deeper: a worked capacity estimate

Say you're designing a URL shortener expected to handle 100 million new links per day, read 10× as often as written. New writes: 100M / 86,400 ≈ 1,160 writes/second average; reads at 10× ≈ 11,600 reads/second; size for a ~3× peak, so plan for ~3,500 writes/s and ~35,000 reads/s.

Storage: at ~500 bytes per record (URL, short code, metadata), 100M/day × 365 days × 5 years ≈ 1.8 × 10¹¹ records × 500 bytes ≈ 90 TB before replication, ~270 TB with ×3. That immediately tells you this won't fit in RAM on one box — it needs partitioning and disk-backed storage, with a cache in front for the hot reads. You've sketched the architecture's shape from two numbers and a few multiplications.

Then sanity-check latency: 35,000 reads/s served from a cache (RAM, ~100 ns plus network) is comfortable; 35,000 reads/s each doing a disk seek (~10 ms) would need L = 35,000 × 0.01 = 350 concurrent disk operations, far beyond one disk — confirming you need either lots of disks or, better, that cache. Every step is just the numbers from this page.

What estimation buys, and its limits#

Estimation is cheap insurance: minutes of arithmetic that can save weeks of building the wrong thing. But it's a filter, not a guarantee.

Strength — speed: you reject doomed designs in seconds and focus effort on plausible ones. The disk-seek-in-a-loop design dies before anyone writes it.
Strength — shared language: 'that's ~10 ms per call, ×1,000 calls = 10 s' is a conversation anyone on the team can check. Estimates make trade-offs explicit.
Limit — averages hide tails: Little's Law and these figures describe the average. Real systems are judged on the tail (p99, p999) — the slowest 1% — which can be far worse than the mean and is what users actually feel. Estimate first, then measure the tail.
Limit — garbage in, garbage out: an estimate is only as good as its assumptions. Over-provision from a too-pessimistic guess and you waste money; under-provision from a too-rosy one and you fall over at peak. Revisit estimates with real data.

Estimate, then measureThe numbers tell you what should happen and roughly what to provision; they don't replace measurement. Use them to choose a design and a starting capacity, then load-test and watch the tail latencies. Estimation narrows the search space from infinite to a handful — measurement picks the winner.

Comparisons at a glance#

Two pairs worth keeping straight: latency versus throughput (different axes, often confused), and what each level of nines actually means.

	Latency	Throughput (QPS)
Measures	Time for ONE operation	Operations completed per second
Units	ns / µs / ms	requests/sec (QPS, RPS)
Improved by	Faster path: cache, fewer hops, closer servers	More parallelism: more cores, machines, shards
User feels	How snappy each action is	Whether the system keeps up under load
Tied together by	Little's Law: L = λ·W (in flight = rate × latency)	Little's Law: L = λ·W

Nines	Availability	Downtime / year	Downtime / day
Two	99%	~3.65 days	~14 minutes
Three	99.9%	~8.8 hours	~1.4 minutes
Four	99.99%	~52.6 minutes	~8.6 seconds
Five	99.999%	~5.3 minutes	~0.86 seconds

Where the numbers show up#

These figures aren't trivia — they're the daily reasoning behind real decisions.

Caching exists because of the RAM-vs-disk gap. The whole case for Redis or Memcached is one number: RAM is ~100,000× faster than a disk seek, so keeping hot data in memory turns a 10 ms read into a 100 ns one.
CDNs and edge computing exist because of the speed of light. A cross-continent round trip is ~150 ms no matter how fast your server is, so you push content to a server near the user. Physics, not engineering, sets that floor.
Thread pools and connection limits are sized with Little's Law. Database connection pools, web-server worker counts, and async concurrency limits are all 'L = λ·W' in disguise — provision for the requests in flight, not the requests per second.
SLAs and on-call are priced in nines. The jump from three to four nines (8.8 hours → 53 minutes of yearly downtime) is what justifies redundancy, automated failover, and a paid on-call rotation. Each nine is a budget line, captured as an SLA you promise customers and a tighter SLO you operate against.

Why interviewers ask for estimates'Design a system for 100 million users' is really 'show me you can turn that into requests/second, storage, and a latency budget.' Strong candidates start by estimating QPS and data size out loud; it's the fastest signal that someone can reason about scale rather than just name technologies.

Common questions & gotchas#

Latency vs throughput — what's the actual difference?

Latency is how long one operation takes (a time, like 50 ms). Throughput is how many operations complete per second (a rate, like 2,000 QPS). They're independent: a highway analogy — latency is how long your car takes to cross it; throughput is how many cars cross per minute. Adding lanes (parallelism) raises throughput without making any single car faster. Most confusion in design discussions comes from mixing these up.

Do I have to memorize the exact latency numbers?

No — memorize the orders of magnitude and the ~100× gaps between tiers: cache ~1 ns, RAM ~100 ns, SSD ~tens of µs, datacenter hop ~0.5 ms, disk seek ~10 ms, cross-continent ~150 ms. Being right to within 10× is enough to put a design in the right tier, and the tier is what decides whether it works.

What does Little's Law actually let me compute?

It connects in-flight count, arrival rate, and time-in-system: L = λ·W. Given any two you get the third. The common uses: size a thread/connection pool (L) from your traffic (λ) and latency (W); or find max throughput (λ = L/W) from a concurrency limit and per-request time. It makes no assumptions about the workload, which is why it's so reliable.

Why is each extra nine so expensive?

Because downtime shrinks ~10× per nine while the causes get harder to remove. Three nines (8.8 h/year) you can hit with careful ops; four nines (53 min/year) means almost no manual recovery time, so you need automated failover and redundancy; five nines (5 min/year) means the system must self-heal faster than humans can react. Each nine removes a whole class of acceptable failure, and that costs far more than 10×.

QuizA proposed design serves each web request by making 200 sequential calls to a service 1 ms away (a datacenter round trip each). Before building it, what's your estimate and verdict?

~200 ms per request — likely too slow; batch the calls or fetch in parallel
~1 ms per request — fine, the calls are cheap
Can't estimate without load-testing it first
~200 µs per request — excellent

Show answer

~200 ms per request — likely too slow; batch the calls or fetch in parallel — 200 sequential calls × 1 ms each ≈ 200 ms of pure round-trip time per request, before any actual work — almost certainly too slow for a web request, and you know it without writing a line. The fix is structural: batch the 200 calls into one, or issue them in parallel so you pay ~1 ms of travel once instead of 200 times (the same 'parallelize round-trips' lesson from client-server). This is exactly the kind of design estimation rejects in seconds.

In an interview#

When given a scale ('100 million users', '1 billion events a day'), immediately turn it into numbers out loud: convert to requests per second (remember ~86,400 ≈ 10⁵ seconds per day), estimate storage (bytes × items × replication), and state a latency budget. That single move signals you reason about scale instead of just naming technologies.

Lean on the latency hierarchy to justify choices: 'reads must be fast, and RAM is ~100,000× faster than a disk seek, so we cache the hot set.' Use Little's Law to size concurrency: 'at 5,000 req/s and 20 ms each, that's 100 in flight, so ~100 connections.' Mention peak vs average (2–3×) and the read/write ratio so your capacity number is the one that matters.

Close on reliability in nines: state what target the system needs, translate it to a downtime budget, and note the cost — three nines is careful ops, four-plus needs automated failover and redundancy. Acknowledge that estimates describe the average and you'd confirm with measurement, watching tail (p99) latency. Estimate to choose; measure to verify.

References & further reading#

References

Latency Numbers Every Programmer Should Know — the canonical table, interactive and updated over time
Jeff Dean — Numbers Everyone Should Know (slides) — the original source of the latency hierarchy
Wikipedia — Little's Law — L = λ·W, with the (minimal) assumptions
Wikipedia — High availability (the nines) — availability percentages and their downtime budgets
Google SRE Book — Service Level Objectives — how SLAs, SLOs, and error budgets are set in practice

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Open the The Numbers simulator →

Up nextTrie Autocomplete

← Back to the learning path