Load Balancer · System Internals

The big picture#

TL;DRthe 30-second version

A load balancer fronts a pool of interchangeable backends and answers one question per request: which server handles this? The answer determines how evenly load spreads and how gracefully the system degrades when a backend is slow or dead.
Algorithms range from load-blind (round-robin, weighted round-robin) to load-aware (least-connections, least-response-time/EWMA) to affinity-preserving (consistent hashing, Maglev). Each trades balance quality against the state and coordination it needs.
Power-of-two-choices (P2C) is the sweet spot: sample two backends at random, pick the less-loaded one. It inspects O(1) servers yet drives the worst-case load down exponentially versus a single random pick — near-optimal balance, almost free.
Production balancers do more than choose: health checks eject dead backends, retries and outlier detection route around brownouts, and the balancer itself is made highly available (HA pairs, anycast, DNS) so it isn't a single point of failure.
Two big architectural axes: L4 (route by IP/port, fast, protocol-agnostic) vs L7 (understand HTTP, route by path/header/cookie, terminate TLS); and centralized middlebox (nginx, HAProxy, ALB) vs client-side (gRPC, Finagle) load balancing.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the math and edge cases you can skip on a first pass and return to later.

clientsmany, concurrent

requests →

LOAD BALANCERpick a backend per request

route by algorithm+ health checks

backend poolS1 ok · S2 ok · S3 ok · S4 SLOW

The balancer sits in front of the pool and chooses per request

Start here: the problem it solves#

One server can only handle so much traffic — a fixed budget of CPU, memory, connections, and bandwidth. Past that ceiling, latency climbs and requests start failing. The fix is horizontal scale: run many identical servers and put a balancer in front to spread work across them. That alone turns a single ceiling into a pool you can grow by adding machines.

But requests aren't uniform and servers aren't always equal. One backend might be a bigger machine, one might be garbage-collecting, one might be stuck on a slow downstream dependency, and one might have just crashed. A naive 'just rotate through them' scheme ignores all of that — it will keep handing requests to a struggling server while healthy ones sit idle, and keep sending traffic to a dead one until something notices.

So the balancer's job is really two jobs. First, a routing-around job: detect and avoid backends that are slow or failed, so a single bad server doesn't drag down the whole service. Second, a choosing job: for each request, pick the server that keeps overall load even and latency low — using only cheap, fast-to-compute information, because this decision runs on every single request at full traffic.

The core trade-offSmarter algorithms balance better but need more state (tracking in-flight counts per backend) or coordination (sharing that state across multiple balancer instances). The art is getting near-optimal balance with minimal overhead — which is exactly why power-of-two-choices is so popular.

How it works: one decision per request#

Mechanically, a load balancer maintains two things: a list of backends in the pool and, for load-aware algorithms, a small amount of live state about each (in-flight request count, recent latency, health status). For every incoming request it runs a selection function over that state, forwards the request to the chosen backend, and updates the state when the request completes.

A request arrives at the balancer (which owns the service's public IP / DNS name).
Filter the pool to backends currently marked healthy by the health-check subsystem.
Run the selection algorithm over the healthy set (round-robin pointer, least in-flight count, two random samples, hash of a key, …).
Forward the request to the chosen backend; increment its in-flight counter (for load-aware policies).
On response — or on failure/timeout — decrement the counter, record latency, and optionally retry on another backend if the request is idempotent.

Two structural choices shape everything else. The first is which layer the balancer operates at. The second is where the balancing logic lives — in a dedicated middlebox every request passes through, or inside each client.

Layer 4 (transport): the balancer forwards TCP/UDP by IP and port without parsing the payload. It's fast, cheap, and protocol-agnostic, but it can only balance whole connections, not individual HTTP requests, and can't route by URL or header. AWS NLB and IPVS work here.
Layer 7 (application): the balancer terminates the connection, parses HTTP, and can route by path, header, cookie, or method, multiplex many requests over pooled backend connections, terminate TLS, and retry at the request level. More capable, a bit more CPU per request. nginx, HAProxy, Envoy, and AWS ALB work here.
Centralized vs client-side: a middlebox (nginx/HAProxy/ALB) is one place to configure and observe, but it's an extra network hop and a thing to scale and keep available. Client-side balancing (gRPC, Twitter's Finagle, Netflix Ribbon) pushes the pick into each client using a service-discovery feed — no extra hop, but every client needs the logic and a fresh view of the pool.

The algorithms#

The selection function is where load balancers earn their name. The policies form a ladder from stateless-and-load-blind to stateful-and-load-aware to hash-based-for-affinity:

Round-robin: hand each request to the next server in rotation. Dead simple and stateless, but load-blind — a slow server keeps getting its turn and backs up.
Weighted round-robin (WRR): rotate in proportion to each server's capacity (a 2× server gets 2× the requests). Handles heterogeneous hardware, still ignores live load.
Least-connections: send each request to the backend with the fewest in-flight requests. Naturally steers away from slow/busy servers because their counts stay high — but requires tracking every server's load and scanning the whole pool per pick.
Power-of-two-choices (P2C): pick two backends at random and send to the less-loaded one. Almost as balanced as least-connections, but you only inspect two servers, not all of them — and it avoids the stampede that 'always pick the global minimum' causes across many balancers.
Least-response-time / EWMA: track an exponentially weighted moving average of each backend's latency and route to the fastest. Captures slowness that in-flight counts miss (e.g. a backend that answers but slowly), at the cost of more state and tuning. Envoy and Finagle use EWMA-based policies.
Consistent hashing / Maglev: hash a request key (client IP, session id, cache key) to a backend so the same key consistently lands on the same server — essential for cache affinity and sticky sessions. Adding or removing a backend remaps only a small fraction of keys. Maglev is Google's consistent-hashing scheme tuned for near-perfect even spread plus minimal disruption.

Why power-of-two-choices worksPicking the best of two random choices avoids the herd behavior of 'always pick the globally least-loaded' (where many balancers, each acting on the same slightly-stale view, stampede the same server) and improves the worst-case load exponentially versus a single random pick — for the cost of one extra comparison. The math is in the deep-dive below.

PredictYou run 4 servers; S4 is slow and its in-flight requests are piling up. Under plain round-robin, does S4 stop getting new requests until it catches up?

Hint: Round-robin counts turns, not load.

No. Round-robin is load-blind — it keeps handing S4 its turn regardless of how backed up it is, so the queue keeps growing. Least-connections or power-of-two-choices look at live load and route around S4.

Go deeperThe math behind power-of-two-choices

Model requests as balls and backends as bins. Throw n balls into n bins, each into a uniformly random bin. The fullest bin (the most overloaded server) ends up with about ln n / ln ln n balls with high probability — a meaningful imbalance that grows with the pool.

Now change one thing: for each ball, sample two random bins and put it in the less-full one. The maximum load collapses to about ln ln n / ln 2 + O(1). That's an exponential improvement — log n / log log n down to log log n — from a single extra sample and comparison. This is Azar, Broder, Karlin & Upfal's result, generalized and applied to load balancing in Mitzenmacher's thesis.

The crucial detail: the big jump is from one choice to two. Sampling d bins instead of 2 only changes the constant factor (max load ≈ ln ln n / ln d), so going from two to three or more buys very little. Two choices captures essentially all the benefit — which is why real systems (Envoy's default, Nginx Plus, many service meshes) sample exactly two.

Why not just always pick the global minimum? At scale you have many balancer instances (or many client-side balancers), each working from a slightly stale snapshot of load. If they all deterministically pick 'the least-loaded backend,' they all pick the same one and stampede it — converting an imbalance problem into a herd problem. The randomness in P2C breaks that synchronization while the 'best of two' still steers away from hotspots.

Cost per decision and balance quality#

Because the selection function runs on every request, its per-pick cost is a real budget. The policies differ both in that cost and in the state they must maintain:

Round-robin / weighted round-robin: O(1) per pick (advance a pointer), essentially no per-backend state. Load-blind, so balance quality is poor under heterogeneous or bursty load.
Least-connections: O(N) per pick to scan the pool for the minimum (or O(log N) with a heap), plus an in-flight counter per backend. Excellent balance, but the scan and the shared counters get expensive with large pools and multiple balancer instances.
Power-of-two-choices: O(1) per pick — two random indexes and one comparison — with the same per-backend counter as least-connections. Near-least-connections balance at constant cost, which is why it scales.
Least-response-time / EWMA: O(1)–O(N) depending on implementation, plus a running latency average per backend that must be updated on every completion.
Consistent hashing / Maglev: O(1) lookup into a precomputed hash ring or lookup table; the cost moves to building and updating the table when the pool changes. Optimizes for affinity and minimal remapping, not for live load balance.

Why P2C is the scaling winnerLeast-connections gives great balance but pays O(N) per request and needs a globally consistent view of in-flight counts — hard across many independent balancers. P2C gets within a constant factor of that balance using O(1) work and only two backends' state per pick, and its randomness tolerates stale, per-instance views. Near-optimal balance at constant cost is the whole reason it's everywhere.

Trade-offs: balance vs state vs coordination#

Every step up the algorithm ladder buys better balance with more state and more coordination. Round-robin needs nothing shared; least-connections needs an accurate, ideally global, count of in-flight requests per backend — which is exactly the thing that's hard to keep consistent when ten balancer instances are each making picks independently. P2C is popular precisely because it relaxes that requirement: approximate, per-instance state is good enough.

Smarter balancing ↔ more state: load-aware policies must track and update per-backend metrics on every request completion. That's memory, contention on shared counters, and a feedback loop that can oscillate if it reacts too fast.
Global optimum ↔ herd behavior: deterministically picking the single least-loaded backend looks ideal but synchronizes many balancers onto the same target. Randomized sampling (P2C) trades a little theoretical optimality for stability.
Affinity ↔ even load: hashing a key to a backend (sticky sessions, cache locality) means you no longer get to balance freely — a few hot keys can overload their backends regardless of the rest of the pool.
L4 simplicity ↔ L7 capability: L4 is cheap and protocol-agnostic but can't see requests; L7 can route, retry, and terminate TLS intelligently but costs more CPU and is a richer attack/parsing surface.

L4 vs L7A layer-4 balancer routes by IP/port (fast, protocol-agnostic, balances whole connections). A layer-7 balancer understands HTTP, so it can route by URL path, header, or cookie, retry individual requests, and terminate TLS — more capable, a bit more work per request. Many large deployments use both: an L4 layer for raw connection spreading and DDoS absorption, fronting an L7 layer for smart request routing.

Failure modes: the balancer and the pool#

A balancer concentrates traffic, so its failure modes are about what happens when something — a backend, the health-check loop, or the balancer itself — misbehaves. Each has a standard mitigation.

The balancer as a single point of failure: every request flows through it, so if it dies the whole service dies. Mitigation: run multiple balancer instances and fail traffic over between them — active/passive HA pairs sharing a virtual IP (VRRP/keepalived), an anycast IP advertised from many locations, or DNS returning several balancer addresses. The balancing tier must itself be redundant.
Health-check flapping: a backend hovering at the edge of its check threshold flips healthy→unhealthy→healthy repeatedly, and the pool churns with every flip. Mitigation: hysteresis — require several consecutive failures to eject and several successes to re-add, plus slow-start/ramp-up so a freshly re-added backend isn't instantly flooded.
Retry storms / thundering herd: when a backend slows, retries pile on top of the original load and can knock over the rest of the pool in a cascade. Mitigation: cap retries, use a retry budget (retries as a small percentage of total traffic), add jittered backoff, and shed load rather than queue it unboundedly.
Sticky-session imbalance: pinning clients to backends (for session or cache affinity) means load follows the keys, not the algorithm — a few heavy sessions or hot keys overload their backend while others idle, and removing that backend dumps all its sessions at once. Mitigation: prefer stateless backends with shared session storage; if you must pin, bound it and use consistent hashing so removals remap minimally.
Stale or partitioned view: a balancer acting on an out-of-date pool view can send traffic to a dead backend or, with many instances, herd onto one. Mitigation: fast health propagation, randomized policies (P2C), and outlier detection that ejects backends returning errors even if they pass the basic health check.

Active vs passive health checksActive checks probe each backend on a schedule (e.g. GET /healthz every few seconds) — clear signal, but extra traffic and a few seconds of detection lag. Passive checks (outlier detection) watch real request results and eject a backend that starts erroring or timing out — instant and free, but only sees backends currently receiving traffic. Production systems run both.

Algorithms compared#

Algorithm	Balance quality	Per-pick cost	State needed	Affinity
Round-robin	Poor (load-blind)	O(1)	None (a pointer)	No
Weighted round-robin	Fair (static weights)	O(1)	Per-backend weight	No
Least-connections	Excellent	O(N) scan	In-flight count per backend	No
Power-of-two-choices	Near-excellent	O(1)	In-flight count per backend	No
Least-response-time (EWMA)	Excellent (latency-aware)	O(1)–O(N)	Latency average per backend	No
Consistent hashing / Maglev	Fair (key-dependent)	O(1) lookup	Precomputed hash table	Yes

Read the table as a ladder: balance quality and affinity are bought with state and per-pick cost. P2C is the standout because it reaches near-least-connections balance while staying O(1) and tolerating approximate, per-instance state.

Where load balancers run in the wild#

Load balancing shows up at every tier — hardware appliances, software proxies, cloud services, and inside application clients. The vocabulary (L4/L7, the algorithm names, health checks) is shared; the implementations differ in where they run and what they optimize.

nginx / Nginx Plus — ubiquitous L7 reverse proxy; round-robin, least_conn, ip_hash, and (in Plus) least_time and a P2C 'random two' method. Often the first balancer engineers meet.
HAProxy — high-performance L4/L7 balancer; rich 'balance' algorithms (roundrobin, leastconn, source/uri hashing, random with draws=2 for P2C), detailed health checks and observability. A staple in front of web and database tiers.
Envoy — modern L7 proxy and the data plane of service meshes (Istio, Consul); weighted least-request is its O(1) P2C default, plus ring-hash, Maglev, and EWMA policies, with active health checks and passive outlier detection built in.
AWS ALB / NLB — managed cloud balancers: ALB is L7 (path/host routing, TLS, least-outstanding-requests), NLB is L4 (ultra-high throughput, static IPs). Both scale and self-heal as a service.
Google Maglev / Cloud Load Balancing — Maglev is Google's software L4 balancer using consistent hashing for connection affinity with minimal disruption on changes; it underpins Google Cloud's global load balancing (GCLB) with a single anycast VIP.
Client-side load balancing — gRPC, Twitter/X's Finagle, and Netflix Ribbon push the pick into the client using a service-discovery feed, removing the extra middlebox hop at the cost of putting balancing logic (and a fresh pool view) in every client.

Layers stackA real request often passes through several balancers: an anycast L4 layer (e.g. Maglev/NLB) absorbing connections and DDoS, an L7 layer (Envoy/ALB) doing smart request routing and TLS, and sometimes client-side balancing inside the mesh. Each layer balances at a different granularity.

Common misconceptions & gotchas#

What do L4 and L7 mean?

They're layers of the network stack. An L4 balancer routes by IP address and port — fast, protocol-agnostic, and it balances whole connections. An L7 balancer understands HTTP, so it can route by URL path, header, or cookie, retry individual requests, and terminate TLS — more capable, a bit more work per request.

What does power-of-two-choices actually buy you?

Sampling two random backends and picking the less-loaded one drops the worst-case (most overloaded) server from about log n / log log n requests down to about log log n — an exponential improvement over a single random pick — for one extra comparison. It gets near-least-connections balance at O(1) cost, and because it's randomized it avoids the herd effect where every balancer stampedes the same 'least-loaded' server.

Why not just always send to the single least-loaded server?

Two reasons. Computing the global minimum means scanning every backend (O(N)) and needing an accurate, shared view of all their loads — hard with many balancer instances. And it synchronizes them: each acts on a slightly stale snapshot, all pick the same server, and stampede it. P2C's randomness avoids both while still steering away from hotspots.

How is the load balancer itself made highly available?

By making the balancing tier redundant. Common patterns: an active/passive HA pair sharing a virtual IP that fails over (VRRP/keepalived), an anycast IP advertised from many machines so routing picks a live one, or DNS handing out several balancer addresses. The point is that no single balancer box is a single point of failure.

What makes servers 'heterogeneous'?

They differ in capacity — a bigger machine, more CPU, or some simply busier than others right now. Weighted round-robin handles fixed capacity differences; load-aware policies (least-connections, P2C, EWMA) handle live ones.

Do sticky sessions break load balancing?

They constrain it. Pinning a client to one backend means load follows the keys, so a few heavy sessions can overload their server while others idle, and losing that backend drops all its sessions at once. Prefer stateless backends with shared session storage; if you must pin, use consistent hashing so backend changes remap as few sessions as possible.

QuizYou run 50 balancer instances in front of 200 backends. Switching from 'pick the single least-loaded backend' to power-of-two-choices improves tail latency. Why?

P2C inspects all 200 backends, so it finds a better minimum
Each balancer acts on a stale load view, so all 50 deterministically stampede the same 'least-loaded' backend; P2C's randomness breaks that synchronization
P2C maintains a globally consistent in-flight count, eliminating staleness
Least-loaded selection is O(1) while P2C is O(N), so P2C does more work and balances better

Show answer

Each balancer acts on a stale load view, so all 50 deterministically stampede the same 'least-loaded' backend; P2C's randomness breaks that synchronization — With many balancers each working from a slightly stale snapshot, 'always pick the global minimum' makes them all choose the same backend and herd onto it — creating the very hotspot they were avoiding. P2C samples two backends at random and takes the better one: it still steers away from loaded servers, but the randomness desynchronizes the instances. It's also O(1), not O(N), and uses only approximate per-instance state.

In an interview#

Name the algorithms and their trade-offs: round-robin (simple, load-blind), weighted (for heterogeneous capacity), least-connections (load-aware, needs O(N) scan and shared state), power-of-two-choices (near-least-connections balance at O(1) sampling), least-response-time/EWMA (latency-aware), and consistent hashing/Maglev (for affinity). Stress that round-robin's weakness is a slow or struggling backend, which load-aware policies route around.

If you can land one crisp fact, make it P2C: sampling two random backends and taking the less-loaded one cuts the worst-case load exponentially (log n / log log n → log log n) at O(1) cost, and its randomness avoids the herd effect that plagues 'always pick the global minimum.' That single point shows you understand both the math and the systems reality of many independent balancers.

Be ready for the operational follow-ups: health checks (active probes + passive outlier detection) and ejecting bad nodes, sticky sessions vs stateless backends, L4 vs L7, retry budgets to avoid retry storms, and — almost always asked — how the balancer itself is made highly available (HA pairs, anycast, DNS). Consistent hashing as a balancing strategy for cache affinity is a common bridge to its own topic.

Then open the simulator: send REQUEST/BURST traffic and TICK to drain it under round-robin, watch the slow backend (S4) accumulate in-flight work, then switch the algorithm to least-connections or power-of-two-choices and watch the load spread even out.

References & further reading#

References

Mitzenmacher — The Power of Two Choices in Randomized Load Balancing (IEEE TPDS 2001) — the canonical analysis: two random samples cut max load to log log n
Eisenbud et al. — Maglev: A Fast and Reliable Software Network Load Balancer (NSDI 2016) — Google's consistent-hashing L4 balancer — even spread + minimal disruption
Envoy — Supported load balancers — weighted least-request (O(1) P2C), ring-hash, Maglev, EWMA in production
HAProxy — Configuration Manual (the 'balance' directive) — roundrobin, leastconn, source/uri hashing, random (draws=2) and health checks
nginx — HTTP Load Balancing guide — round-robin, least_conn, ip_hash with config examples

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Round-robin is load-blind →Round-robin hands every backend an equal share by count — including the slow one. So the slow backend's in-flight queue keeps growing even as the fast servers sit idle.Least-connections fixes it →The same traffic, but routing to whichever backend has the fewest in-flight requests — so the slow server simply stops being picked while it's busy, and load stays even.

Open the Load Balancer simulator →

Up nextClock Synchronization

← Back to the learning path