System Internals
Open the simulator →
Traffic control

Rate Limiting

How an API decides, request by request, who gets served and who gets a 429 — token bucket, leaky bucket, and fixed/sliding window counters, then how to make it work across a fleet.

Any service with finite capacity needs a rule for what happens when traffic exceeds it. Rate limiting answers that with a simple contract: each client gets to make at most N requests in some span of time, and the server enforces it cheaply, in memory, on every request. The interesting part isn't the contract — it's the bookkeeping. This page (and the simulator) walks through five ways to implement it in increasing order of accuracy and cost, then tackles the part that actually breaks in production: enforcing one limit across a whole fleet of servers.

The big picture#

TL;DRthe 30-second version
  • A rate limiter decides per request — accept or reject — capping each client (by API key, IP, or user id) to a budget so one bad actor can't starve everyone. Over-budget requests get a fast HTTP 429, not an unbounded queue.
  • Five classic algorithms trade accuracy against cost: token bucket (burst-tolerant, O(1)), leaky bucket (smooths output, O(1)), fixed window (one counter, but bursts at boundaries), sliding-window log (exact, O(limit) memory), sliding-window counter (approximate, O(1) — what Cloudflare ships).
  • Token and leaky bucket shape WHEN a budget is spent (all at once vs. smoothed); the window algorithms measure HOW MUCH was spent over a time span. Different axes — interviewers love when you separate them.
  • The hard part is distributed enforcement: per-node limits don't compose, so production limiters keep counters in a shared store (Redis INCR+EXPIRE, sorted-set sliding windows, or GCRA via redis-cell) and choose between exact-but-central and approximate-but-local.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" box on distributed limiting holds the advanced internals (Redis data structures, GCRA, Lua) you can skip on a first pass.

capacity = 5 tokens · refill = 1 token / second · ★ = a token, · = empty slot

t=0sfull — an idle client banked 5; a burst of 5 is allowed
t=0s·····empty — the 6th request is DENIED → 429
t=1s····+1 token refilled
t=2s···now limited to ~1 req/sec
t=5sback to full if it stayed idle
Token bucket: idle time banks a burst, then it's a steady drip

Start here: why count requests at all?#

A server has finite CPU, memory, and downstream capacity (a database, a third-party API). Without a limit, one misbehaving client — a retry loop, a scraper, a bug — can starve everyone else. This is the noisy-neighbor problem, and retry storms make it worse: a downstream blip makes every client retry at once, and those retries pile onto the very system that's already struggling. Rate limiting caps each client (by API key, IP, or user id) to a budget, and rejects requests over budget so the rest of the fleet stays healthy.

Limiting vs. queuing vs. load-sheddingRate limiting rejects per-client over a budget (a fast, cheap 429). Queuing buffers excess work and processes it later — fine until the queue is unbounded and latency climbs until the system falls over. Load-shedding is the whole-system version: when the server itself is overloaded, drop a fraction of ALL traffic (often by priority) regardless of who sent it. They're complementary tools — a leaky bucket is really a rate limiter and a small bounded queue in one — but the accept/reject decision is what the simulator focuses on.

Token bucket: a budget that refills continuously#

A bucket holds up to `capacity` tokens. Tokens refill continuously at `refillRate` tokens/second (not in discrete steps). Every request costs one token: if the bucket has at least one, it's allowed and a token is removed; otherwise the request is denied. Because refill is continuous, a client that's been idle accumulates tokens up to the cap and can spend them all in a single burst — that burst tolerance is the defining trait of a token bucket.

  1. On each request, compute elapsed time since the last refill and add elapsed × refillRate tokens, capped at capacity.
  2. If tokens ≥ 1, subtract 1 and allow.
  3. Otherwise deny — and tell the client roughly how long until the next token (1/refillRate seconds away from a full token).

Notice there's no background timer adding tokens on a schedule. Refill is computed lazily, on demand, from the elapsed time since the last touch — so the entire state for a client is just two numbers: a token count and a last-updated timestamp. That O(1) footprint is why token bucket scales to millions of keys.

This is what most public APIs actually useStripe, GitHub, and AWS all expose token-bucket-shaped limits (a steady rate plus burst headroom) because it's cheap — O(1) state per client — and matches how real traffic behaves: bursty, with idle gaps in between. The two parameters map cleanly to a published policy: refillRate is the sustained rate, capacity is the burst allowance.

Leaky bucket: smoothing into a steady output rate#

Where a token bucket controls the input rate (how fast you can spend tokens), a leaky bucket controls the output rate: requests fill a fixed-size queue, which drains at a constant `leakRate`. If the queue isn't full, the request is queued (allowed) and the level rises by one; if it's full, the request overflows (denied). The level falls continuously between requests as the queue drains.

The practical difference from a token bucket: a token bucket lets a client who's been idle spend a full burst instantly. A leaky bucket forces every request through at (at most) the leak rate, queue or no — it actively smooths bursts into a steady trickle, which is exactly what you want in front of a downstream system that can't handle spikes at all.

Fixed window counter: the simplest implementation, with a real flaw#

Divide time into fixed windows of `windowMs` (e.g. one window per second). Keep one counter per window: increment on each request, reset to 0 whenever the clock crosses into a new window. Allow while the counter is under `limit`. That's the entire algorithm — one integer, one comparison.

The boundary burst problemThe flaw: nothing stops a client from sending `limit` requests in the last millisecond of one window and another `limit` in the first millisecond of the next. From the client's perspective that's `2 × limit` requests in a couple of milliseconds — technically compliant with 'limit per window,' but far from the intended steady rate. This is exactly the gap the two sliding-window variants below close.
PredictLimit is 100 requests per minute with a fixed window. A client fires 100 requests at 11:00:59 and 100 more at 11:01:00. Did the limiter do its job?

Hint: Count per fixed window vs. count over any trailing 60 seconds.

Each burst lands in a separate window, so the fixed-window limiter allows both — yet that's 200 requests in about one second, double the intended rate. This boundary burst is the whole reason sliding windows exist.

Sliding window log: exact, at the cost of memory#

Keep a timestamp for every request, instead of just a count. On each new request, drop every timestamp older than `now - windowMs`, then allow only if fewer than `limit` timestamps remain — and if allowed, record the new timestamp. Because the window is checked relative to *now*, not a fixed grid, there's no boundary to burst across: the count over any trailing windowMs-wide span is always exact.

The cost is storage: O(limit) timestamps per client instead of a single integer, plus the work of pruning on every request. For a client with a small limit this is fine; for millions of clients with high limits it adds up.

Sliding window counter: approximating the log with two integers#

A compromise: keep two fixed-window counters — the current window's count and the immediately previous window's count — instead of every timestamp. Estimate the trailing-window count as `prevCount × overlapWeight + currCount`, where `overlapWeight` is how much of the previous window still falls inside the trailing windowMs-wide span (1.0 right at the window boundary, decaying to 0 by the next boundary).

limit = 100 / 60s · now = 15s into the current window

prev100 reqsthe full 60s window× overlapWeight 0.75
curr40 reqsthe first 15s so farcounted in full
Weighting the previous window by how much it still overlaps

This assumes requests are spread evenly across the previous window, which isn't always true — so it's an approximation, not exact like the log. In exchange it costs O(1) state per client (two counters) instead of O(limit), which is why it's the choice production rate limiters (e.g. Cloudflare's) actually ship. Cloudflare reported that on 400 million requests from 270,000 distinct sources, only about 0.003% were wrongly allowed or limited versus an exact sliding log — a tiny error for a huge memory saving.

The cost that decides everything: per-client state#

A rate limiter doesn't keep one number — it keeps state for every active key, and there can be millions of them (one per API key, per IP, per user, sometimes per route). So the question that picks the algorithm in practice is: how many bytes, and how much work, per key per request?

AlgorithmState per keyWork per request
Token bucketO(1) — count + timestampO(1)
Leaky bucketO(1) — level + timestampO(1)
Fixed windowO(1) — one counter (+ window id)O(1)
Sliding window counterO(1) — two countersO(1)
Sliding window logO(limit) — a timestamp per request in windowO(limit) to prune

Only the sliding log scales with the limit itself. Picture 10 million keys, each limited to 1,000 requests/minute. The four O(1) algorithms need a couple of words per key — call it ~16 bytes, ~160 MB total, comfortably in RAM. The sliding log can hold up to 1,000 timestamps per key — gigabytes, plus the pruning cost on every single request. That memory cliff is the whole reason the approximate sliding-window counter exists, and why exact limiting at scale is rare.

Don't forget expiryIdle keys must be reclaimed or memory grows without bound. In-memory limiters use an LRU or TTL sweep; Redis-based ones lean on EXPIRE so a key for a client that stops sending traffic simply evaporates. A limiter that never forgets keys is a memory leak with extra steps.

Distributed rate limiting: one limit across many servers#

Everything so far assumed a single process holding the counter. Real APIs run behind a load balancer across dozens or hundreds of nodes, and any node might handle any client's request. Now the problem is hard, because per-node limits don't compose.

Why per-node limits don't composeSay the policy is 100 req/min and you have 10 nodes. If you naively give each node a local limit of 10/min, a client whose traffic happens to spread evenly gets exactly 100 — but a client pinned to one node gets only 10, and a burst that the LB fans out across all 10 nodes can briefly pass up to 100 even though no single node saw more than its share. Local counters can't see the global total, so the global limit is either too strict, too loose, or both at different moments.

Two families of fixes, and the choice is the recurring theme of this page — exactness versus latency:

  • Shared store (central, exact): every node reads and updates one counter in a fast shared store, almost always Redis. Exact global enforcement, at the cost of a network round-trip per request and a dependency that must stay up.
  • Local + async reconciliation (distributed, approximate): each node keeps a local counter and periodically gossips or syncs totals (e.g. divides the global budget and rebalances). No per-request round-trip, but the global count lags, so the limit is approximate near boundaries.
  • Sticky routing: route each client consistently to the same node (by hashing the key at the LB), so its counter lives in exactly one place and a plain in-memory limiter is globally correct for that client. Simple and fast, but it fights load balancing and breaks when nodes are added, removed, or fail.
Go deeperImplementing it in Redis — INCR, sorted sets, GCRA, and Lua

Fixed window in Redis is two commands: INCR the key for the current window, and on the first increment set EXPIRE to the window length so the key self-cleans. If the returned count exceeds the limit, reject. It's O(1) and trivially shardable, but it inherits the fixed-window boundary-burst flaw, and the INCR-then-EXPIRE pair has a subtle bug: if the process dies between them the key never expires and leaks. Doing both atomically (a MULTI transaction, a Lua script, or SET ... EX semantics) is the fix.

Exact sliding-window log in Redis uses a sorted set (ZSET) per key, with the request timestamp as both member and score. Each request: ZREMRANGEBYSCORE to drop entries older than now − window, ZCARD to count what remains, and if under the limit ZADD the new timestamp (plus an EXPIRE). Exact, but O(limit) memory per key and several commands per request — wrap them in a Lua script (or MULTI) so the read-modify-write is atomic and one round-trip.

Token bucket in Redis is usually a small Lua script that reads the stored {tokens, lastRefill}, lazily refills by elapsed × rate, decrements if a token is available, and writes the new state back — all atomically inside Redis so two nodes can't both spend the last token. Lua matters here precisely because the operation is a read-modify-write that must not interleave.

GCRA (Generic Cell Rate Algorithm) is the elegant one. Instead of counting tokens, it stores a single value — the Theoretical Arrival Time (TAT), the earliest instant the next request is allowed. On each request it compares now to TAT: if now ≥ TAT (minus a burst tolerance) the request is compliant and TAT is pushed forward by the fixed inter-request interval; otherwise it's rejected, and the gap to TAT is exactly the Retry-After. One key, O(1) state, no separate counter to expire, and it's a leaky-bucket-equivalent that smooths rate precisely. The redis-cell module (a Rust Redis module) ships GCRA as a single atomic command, CL.THROTTLE, returning allowed/limit/remaining/retry-after in one call.

Approximate vs. exact at scaleEven with a shared store, very high-throughput limiters often go approximate on purpose: batch local decrements and flush to Redis periodically, or let each node hold a slice of the budget and rebalance. You trade a little accuracy near the limit for far fewer round-trips. The exact ZSET log is reserved for low-volume, high-stakes limits (login attempts, payment endpoints) where every request truly must be counted.

Failure modes: what breaks at the edges#

  • Boundary burst (fixed window): up to 2× the limit across a window boundary, covered above — the reason sliding windows exist.
  • Thundering herd at window reset: with a fixed window, every client's budget refreshes at the same instant (e.g. the top of the minute). Clients that were throttled all retry together the moment the window flips, producing a synchronized spike. Token bucket and GCRA avoid this because each client's budget refills on its own continuous schedule, not a shared grid.
  • Clock skew across nodes: window and timestamp math depends on time. If two nodes disagree on 'now' by even tens of milliseconds, they'll prune and bucket differently, and a request can be double-counted or missed. Centralize time at the store (use Redis's clock, or its TIME command) rather than each node's local clock.
  • Limiter store down: if Redis is unreachable, every request is suddenly undecidable — and you must have chosen a policy in advance.
Fail-open vs. fail-closedWhen the limiter store is down, fail-open means allow all traffic (favor availability — your API stays up but is briefly unprotected, risking overload). Fail-closed means reject all traffic (favor protection — nothing gets through, so a limiter outage becomes a full outage). Most public APIs fail open with a short local fallback limit, so a Redis blip degrades to approximate per-node limiting instead of a hard 503. Security-critical limits (login, payments) often fail closed. Decide deliberately; don't let the default be an accident.

The trade-offs, in one place#

Three axes run through every decision on this page. Pick where you sit on each and the algorithm mostly falls out.

  • Accuracy vs. memory: the sliding log is exact but O(limit) per key; the sliding counter is approximate but O(1). At millions of keys, approximate almost always wins.
  • Local (approximate) vs. central (exact): a per-node in-memory limiter is fast and has no dependency but can't enforce a global limit; a shared Redis counter is globally exact but adds a round-trip and a hard dependency on every request.
  • Burst tolerance vs. smoothness: token bucket forgives idle clients a burst; leaky bucket / GCRA force a steady rate. Choose by what's downstream — a cache-friendly API can absorb bursts, a fragile legacy backend cannot.
The pragmatic defaultFor most public APIs: token bucket (or GCRA) per API key, counters in Redis with EXPIRE, fail-open to a small local limit if Redis is unavailable, and return 429 + Retry-After + the RateLimit headers so well-behaved clients self-throttle before you have to reject them.

The five algorithms side by side#

AlgorithmAccuracyMemory / keyBurst behaviorBest for
Token bucketGoodO(1)Allows bursts up to capacityPublic APIs, burst-friendly clients
Leaky bucketGoodO(1)Smooths to a steady output rateProtecting a spike-intolerant backend
Fixed windowPoor (boundary burst)O(1)Up to 2× at boundariesSimple internal limits, coarse caps
Sliding window logExactO(limit)No boundary burstLow-volume, high-stakes limits
Sliding window counterNear-exact (~0.003% err)O(1)No meaningful boundary burstHigh-scale limiting (Cloudflare)

Where rate limiting runs in the wild#

  • Stripe — token bucket as the core, with four layered strategies (a request-rate limiter, a concurrent-request limiter, a fleet-usage load shedder, and a worker-utilization limiter) so one customer's surge can't take down a payments API.
  • GitHub — token-bucket-shaped REST limits (e.g. 5,000 req/hour authenticated) advertised via X-RateLimit-Limit / -Remaining / -Reset headers and a 403 (primary limit) or 429 (secondary limit) with Retry-After when exhausted.
  • Cloudflare — the approximate sliding-window counter described above, running at the edge across millions of domains; chosen for O(1) memory and ~0.003% error over a pure sliding log.
  • nginx limit_req — the classic leaky bucket (its docs say so explicitly): requests queue and drain at a fixed rate, with an optional burst parameter and nodelay mode.
  • AWS API Gateway — token bucket with a steady rate and a burst capacity per stage/route, returning 429 on exceed; Envoy offers both a local token-bucket filter and a global gRPC rate-limit service backed by a shared store.
  • Redis cell / GCRA — redis-cell's CL.THROTTLE brings exact, single-key GCRA limiting to anything that can talk to Redis, a common building block for homegrown limiters.
The response contract: 429, Retry-After, RateLimit headersOver budget returns HTTP 429 Too Many Requests, ideally with Retry-After (seconds, or an HTTP date) so the client knows when to come back. The IETF's RateLimit header fields draft standardizes proactive signaling — RateLimit (remaining quota + reset) and RateLimit-Policy (the limits in force) — so clients can slow down before they're rejected, not just after. Signaling the limit is as much a part of the design as enforcing it.

Common questions & gotchas#

What is a 429?

HTTP status code 429, 'Too Many Requests' — the response a server returns when you're over your rate limit. It often carries a Retry-After header telling the client how long to wait before trying again, and increasingly the standardized RateLimit headers describing remaining quota.

Token bucket or leaky bucket — which one allows bursts?

Token bucket. A client that's been idle banks tokens up to the cap and can spend them all at once. A leaky bucket drains at a fixed rate no matter what, smoothing bursts into a steady trickle — better in front of something that can't handle spikes.

Why reject over-budget requests instead of just queuing them?

An unbounded queue hides the overload: latency climbs until the system falls over. A fast 429 fails cheaply and hands the decision back to the client (back off, retry later). Limiting and queuing are different tools.

How do you rate-limit across many servers?

Per-node local counters don't compose — they can't see the global total. Production limiters keep the counter in a shared store (usually Redis: INCR+EXPIRE for fixed window, a sorted set for an exact sliding log, or GCRA via redis-cell), accepting a round-trip per request for exact global enforcement. High-scale systems often go approximate instead — local counters with periodic reconciliation, or sticky-routing each client to one node — trading a little accuracy near the limit for far fewer round-trips.

QuizYou run a fixed-window limiter of 100/min, identical local counters on 10 nodes behind a round-robin load balancer. A client sends a tight burst of 100 requests. What's the worst case?

  1. Exactly 100 are allowed — the limit holds globally
  2. At most 10 are allowed — each node caps at its 1/10 share
  3. Up to ~190 can be allowed across a window boundary, and even within one window the global view is fuzzy
  4. All 100 are rejected — no node has seen this client before
Show answer

Up to ~190 can be allowed across a window boundary, and even within one window the global view is fuzzyLocal counters can't see the global total, so the LB spreading the burst across nodes lets more through than a single global counter would; and the fixed-window boundary-burst flaw still applies, so straddling a reset can approach 2× the limit. This is exactly why per-node limits don't compose and production uses a shared store or sticky routing.

In an interview#

Lead with the trade-off, not the list: every algorithm here is trading accuracy against memory. Fixed window is O(1) but bursts at boundaries; the sliding log is exact but O(limit); the sliding counter approximates the log in O(1) by assuming even distribution within a window. Token bucket and leaky bucket are a different axis entirely — they shape *when* a budget can be spent (all at once vs. smoothed) rather than just how it's measured.

Be ready to say which one you'd pick and why: token bucket for a public API that should tolerate bursts, leaky bucket in front of something that genuinely can't handle spikes, sliding window counter when you need window-boundary correctness at scale. Then expect the follow-up that separates juniors from seniors: 'now make it work across 50 servers.' Answer with the shared-store-vs-local trade, name Redis (INCR+EXPIRE, ZSET log, or GCRA/redis-cell), and bring up fail-open vs. fail-closed when that store is down.

Then try it in the simulator: pick fixed window, BURST past the limit right before a window boundary, ADVANCE_TIME just past it, and BURST again — watch the count momentarily allow close to 2× the limit. Switch to sliding window log or counter and repeat the same sequence to see the boundary burst close up.

References & further reading#

References