System Internals
Open the simulator →
Fault Tolerance

Circuit Breaker

How a service stops piling up timeouts against a failing dependency and fails fast instead — Closed, Open, and Half-Open, with canary requests probing for recovery.

Calling a slow or broken downstream dependency doesn't just fail that one request — every thread or connection waiting on a timeout is one your service can't use for anything else. Enough of those at once and a single failing dependency takes the whole caller down with it. A circuit breaker wraps the call: once it's seen enough failures, it stops calling the dependency at all and fails immediately, then periodically lets a trickle of traffic through to check if it's safe to resume.

The big picture#

TL;DRthe 30-second version
  • Named after the household electrical breaker that trips on a current spike to save the wiring: the software version trips when a dependency's failure rate spikes, so the caller's threads and connections don't burn waiting on doomed calls.
  • Three states: CLOSED (calls pass, failures counted) → OPEN (trip past a threshold; reject instantly, 'fail fast') → HALF-OPEN (after a cooldown, allow a few trial calls) → success closes it, failure reopens it.
  • The point is not to fix the dependency — it's to stop a slow/broken dependency from exhausting the caller's resources and cascading the outage upstream. It fails fast instead of failing slow.
  • It's one of a family of resilience patterns. It pairs with timeouts (bound each call), retries with backoff + jitter (handle blips), and bulkheads (isolate pools) — and is built into resilience4j, Polly, Envoy/Istio, and (historically) Netflix Hystrix.

Everything below expands on these points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the tuning math and edge cases you can skip on a first pass.

CLOSEDrequests pass through
failures ≥ threshold →over the window
OPENrequests rejected fast
cooldown elapsed →
HALF-OPENlet 1..k trial requests in
The state machine: a three-state cycle

Start here: why retrying isn't enough#

A naive client just calls the dependency every time and waits for a response or a timeout. If the dependency is down, every caller now blocks for the full timeout on every request — and if timeouts are long (they often are, to tolerate normal slowness), that's a lot of held threads, connections, and queued work for a call that was never going to succeed.

Worse than plain errors is slowness. A dependency that returns an error in 1 ms lets the caller move on. A dependency that hangs until a 10-second timeout holds a thread (and a connection) for those 10 seconds doing nothing. At even modest request rates, the caller's fixed-size thread pool fills with threads parked on doomed calls, and there are none left to serve healthy requests.

This is how outages cascadeA slow downstream service doesn't just degrade itself — it can exhaust the caller's thread pool or connection pool, which makes the caller slow for *everyone*, including requests that have nothing to do with the failing dependency. One unhealthy service can take down services that depend on it, and then services that depend on those. Retries make it worse: a struggling dependency that gets every failed call retried 3× is now handling 3× the load at its worst possible moment.

The circuit breaker's insight: once you have strong evidence the dependency is unhealthy, the most useful thing you can do is stop calling it. Failing instantly returns the thread immediately, frees the connection, and gives the dependency room to recover instead of piling on. You trade a few requests that *might* have succeeded for the survival of the whole caller.

The three states#

A circuit breaker sits in front of every call to the dependency and is always in exactly one of three states.

StateBehaviorWhat gets it out of this state
ClosedCalls pass through normally. The breaker watches a rolling window of recent outcomes.Failure rate in the window crosses the configured threshold → trips to Open.
OpenEvery call is rejected immediately — fails fast, no call to the dependency at all.A cooldown timer elapses → the next call attempt is treated as a Half-Open trial.
Half-OpenA small number of canary calls are let through to test the dependency for real.Enough consecutive successes → Closed. A single failure → back to Open, cooldown restarts.
  1. CLOSED: every call counts toward a sliding window of the last N outcomes. Once the window has enough calls to be statistically meaningful and the failure rate within it reaches the threshold, the breaker trips OPEN.
  2. OPEN: every call is rejected instantly with no network call at all — this is the 'fail fast' the pattern is named for. A cooldown timer starts the moment the breaker opens.
  3. Once the cooldown elapses, the breaker doesn't proactively probe on its own — the very next call attempt is treated as a canary trial and is actually allowed to reach the dependency. This is the HALF_OPEN state.
  4. If that canary call (or the configured number of consecutive canary calls) succeeds, the breaker assumes the dependency has recovered: it closes and resets its failure window from scratch. If any canary call fails, the dependency is still unhealthy — back to OPEN, and the cooldown timer restarts.
Why a failure rate over a window, not just 'the last call failed'Tripping on a single failure would make the breaker wildly oversensitive to one-off blips. Requiring both a minimum number of calls and a failure rate threshold means a couple of unlucky failures in a mostly-healthy window don't trip the breaker — but a dependency that's actually degrading does, and quickly.

One subtlety worth internalizing: a breaker does not need to understand *why* the dependency is unhealthy. It only observes outcomes — success, failure, slow-call — at its own edge. That's what makes it composable: the same breaker logic wraps an HTTP client, a database driver, a gRPC stub, or a message publisher without knowing anything about them.

Why Half-Open exists at all#

Without Half-Open, a breaker would have only two options once cooled down: stay Open forever (the dependency could have recovered minutes ago and nothing would know), or jump straight back to Closed (letting the full, possibly-still-unhealthy traffic volume slam the dependency the instant the cooldown ends — the exact pile-up the breaker exists to prevent).

Half-Open is the compromise: a small, controlled number of canary requests test reality before committing either way. It's deliberately asymmetric — one failure is enough to reopen, but recovery requires multiple consecutive successes — because the cost of being wrong about 'recovered' (slamming a still-broken dependency) is higher than the cost of being wrong about 'still broken' (one more cooldown cycle).

PredictThe cooldown just elapsed and 500 requests arrive in the same instant. How many should actually reach the still-unproven dependency?

Hint: What is Half-Open trying to learn, and what happens if all 500 go through?

Only the configured trial budget — typically 1 to a handful — should be permitted; the other ~495 are still rejected as if Open. Half-Open only needs a tiny sample to decide 'recovered or not.' If all 500 went through, a dependency that's still fragile would be hit with the very surge the breaker exists to prevent (a 'thundering herd' on Half-Open), likely re-breaking it instantly. Good implementations cap concurrent Half-Open trials (resilience4j's permittedNumberOfCallsInHalfOpenState) precisely to avoid this.

How it tracks failures: counts, windows, and timers#

Under the hood a breaker is cheap — a small amount of state updated on every call. The interesting design choice is how it remembers recent outcomes, because that determines how sensitive and how fair the tripping decision is.

  • Consecutive-count: keep a single counter of failures in a row; trip at, say, 5 consecutive failures; any success resets it to 0. Trivial (O(1) state) but blind to mixed traffic — 50% failures interleaved with successes never trip it.
  • Count-based sliding window: keep the last N outcomes (e.g. a ring buffer of 100 booleans) and trip when the failure rate over those N crosses a threshold. O(N) memory, O(1) update. Reacts to rate, not just streaks.
  • Time-based sliding window: bucket outcomes into the last T seconds (e.g. ten 1-second buckets) and trip on the failure rate within that window. Bounds how stale the evidence can be, independent of request rate — better for bursty or low-traffic services.

Most production breakers also count slow calls as failures: a call that returns successfully but took longer than a slow-call-duration threshold is recorded as a failure for tripping purposes. This is what catches the 'not erroring, just hanging' dependency — the most dangerous case — before its slowness exhausts the pool.

Go deeperGo deeper: the tuning knobs and their interactions

Four numbers define a breaker's personality, and they trade off against each other:

  • Failure-rate threshold (e.g. 50%): lower trips sooner and protects the caller more aggressively, but rejects more borderline-healthy traffic. Higher tolerates more failure before acting.
  • Minimum throughput / minimum number of calls: the window must contain at least this many calls before a rate can trip the breaker. Without it, the first 1-of-1 failure reads as 100% and trips instantly — disastrous on a low-traffic endpoint.
  • Window size (N calls or T seconds): bigger windows are smoother and less twitchy but slower to react; smaller windows react fast but flap. Time-based windows decouple reactivity from request rate.
  • Cooldown / wait duration in Open: too short re-probes a dependency that hasn't had time to recover (and risks flapping); too long keeps rejecting traffic long after recovery. Often paired with exponential backoff so repeated failures lengthen the cooldown.
Rule of thumbStart with values that need real evidence to trip and recover conservatively: e.g. 50% failure rate over a window of ≥20 calls, a 5–30s cooldown, and 1–3 Half-Open trials. Then tune against your actual latency and traffic. The failure to avoid is a breaker so twitchy it trips on normal variance — that turns a resilience tool into an availability bug.

Variants and the patterns it pairs with#

A circuit breaker is rarely deployed alone. It's the decision-maker in a small stack of resilience patterns, each handling a different part of the problem:

  • Timeout: bounds how long any single call may block. This is the prerequisite — a breaker can't protect you from a slow dependency if individual calls never time out. Set per-call timeouts first, then wrap a breaker around them.
  • Retry with exponential backoff + jitter: handles transient blips (one dropped packet, one GC pause). Backoff spaces attempts out; jitter (randomized delay) prevents synchronized retry storms. Critically, retries belong inside the breaker — a tripped breaker should suppress retries, because retrying against a known-dead dependency is exactly the pile-on you're avoiding.
  • Bulkhead: isolates resources (separate thread pools or concurrency limits per dependency) so a flood toward one dependency can't drain the pool shared with others. Named after a ship's watertight compartments. The breaker decides *whether* to call; the bulkhead bounds *how many* concurrent calls can exist.
  • Fallback: what you return when the breaker is open or a call fails — a cached value, a default, a degraded response, or a fast error. The breaker creates the opportunity to fail gracefully; the fallback decides what 'gracefully' means.

On the tripping logic itself, the main split is consecutive-count vs rate-based windows (covered above). The other axis is where the breaker lives:

  • In-process library: the breaker runs inside your application (resilience4j on the JVM, Polly on .NET, historically Hystrix). It sees real call outcomes and can run per-instance or per-dependency.
  • Out-of-process / mesh: a sidecar proxy trips on observed outcomes (Envoy's outlier detection ejects unhealthy upstream hosts; Istio configures this declaratively). No code change in the app, and it works across languages — at the cost of seeing only what crosses the proxy.

The trade-offs#

A circuit breaker is fundamentally a bet that recent failures predict near-future failures. When that bet is right, it saves the system. When it's wrong, it rejects requests that would have succeeded. Every design choice is about managing that bet.

You gainYou pay
Fail fast: held threads/connections are released immediately instead of parking on timeouts.False rejections: while Open, you reject requests that might have succeeded if the dependency was only partially degraded.
Protect the dependency: a struggling service gets breathing room instead of a retry pile-on.Tuning burden: thresholds, window size, and cooldown must match real traffic and latency, or the breaker mis-fires.
Contain blast radius: one unhealthy dependency can't cascade into the whole caller.Reduced visibility: a breaker that trips silently can mask an ongoing outage unless its state changes are monitored and alerted.
Graceful degradation: an Open breaker is a clean hook for a fallback (cache, default, degraded mode).Complexity: another stateful component with its own failure modes (flapping, herds) to reason about and test.
The core tensionAvailability of the caller vs. availability of the feature. An Open breaker keeps the *caller* healthy by sacrificing the *feature* that depends on the broken service. That's the right call when the feature is non-essential or has a fallback — and the wrong call when the dependency is critical and a few more attempts might have gotten through. There's no universal threshold; it's a per-dependency product decision.

Where it goes wrong#

  • Flapping: a breaker that closes too eagerly trips, recovers on one lucky trial, gets slammed, trips again — oscillating Open↔Closed instead of settling. Caused by too-short cooldowns or requiring too few Half-Open successes. Fix: require multiple consecutive trial successes and use exponential backoff on the cooldown.
  • Thundering herd on Half-Open: if Half-Open admits all waiting traffic instead of a small trial budget, the first instant after cooldown re-floods a still-fragile dependency and re-breaks it. Fix: cap concurrent Half-Open trials to a handful.
  • Mis-tuned thresholds: a threshold that's too low trips on normal variance (turning resilience into self-inflicted unavailability); too high never trips until the caller has already drowned. Both come from setting numbers without measuring real failure rates and latency.
  • Tripping on transient blips: counting one-off network hiccups as breaker-worthy failures opens the circuit for problems that retries would have absorbed. Distinguish 'retryable transient' from 'breaker-worthy sustained' — let retries handle the former.
  • Wrong failure classification: counting business errors (a 404, a validation 400) as breaker failures trips the circuit when the dependency is perfectly healthy. Only count failures that indicate the dependency itself is unhealthy (5xx, timeouts, connection errors).
  • Shared vs per-instance state confusion: an in-process breaker is per-instance, so 10 caller instances each learn independently and trip at different times. That's usually fine, but don't assume a single global view unless the breaker state is genuinely shared.

Circuit breaker vs the other resilience patterns#

These are complementary, not alternatives — a robust client uses several together. The table shows what each one is actually for.

PatternProblem it solvesWhen it actsRelationship to the breaker
Circuit breakerA sustained-unhealthy dependency exhausting the caller and cascadingAfter a threshold of failures, for a cooldownThe decision-maker: whether to even attempt the call
TimeoutA single call hanging indefinitelyOn every call, per attemptPrerequisite — gives the breaker bounded failures to count
Retry (backoff + jitter)Transient, self-healing blipsImmediately after a single failureHandles short faults; should be suppressed when the breaker is Open
BulkheadOne dependency draining a shared resource poolAlways, as a concurrency capBounds how many calls can be in flight; complements the whether-to-call decision
Rate limitingToo much traffic overwhelming a service (often the callee protecting itself)Always, against a configured rateLimits inbound load; the breaker limits outbound calls to a sick dependency
The one-line distinctionRetry asks 'should I try again?'. Timeout asks 'how long do I wait?'. Bulkhead asks 'how many at once?'. Rate limiter asks 'how fast is too fast?'. Circuit breaker asks 'should I even bother right now?' — and answers no when the evidence says the dependency is down.

Where circuit breakers run in the wild#

The pattern is standard equipment in every mature microservice stack. The implementations differ mainly in where they run and how they track failures.

  • Netflix Hystrix — the implementation that popularized the pattern at scale (each command ran on its own bulkheaded thread pool with a breaker). Netflix put it into maintenance mode in 2018: it stopped active development in favor of adaptive, lighter-weight approaches (concurrency-limiting like adaptive-concurrency/concurrency-limits) and pointed users toward resilience4j. Still everywhere in older codebases and interview questions.
  • resilience4j — the de facto JVM successor: a lightweight, functional, modular library (separate CircuitBreaker, Retry, RateLimiter, Bulkhead, TimeLimiter modules) with count-based and time-based sliding windows and slow-call detection. The reference implementation most people learn from today.
  • Envoy / Istio outlier detection — the service-mesh approach: the sidecar proxy observes upstream host health and ejects (temporarily removes) hosts that return too many consecutive 5xx or gateway errors, re-admitting them after a base-ejection time. Language-agnostic, no app code.
  • Polly (.NET) — the standard resilience library for .NET, with circuit-breaker, retry, timeout, bulkhead, and fallback policies that compose into resilience pipelines; integrated into Microsoft.Extensions.Http resilience.
  • AWS SDKs / cloud clients — retry behavior with backoff and (in newer SDKs) adaptive retry and circuit-breaker-style token buckets that stop retrying when a dependency is shedding load. The same idea baked into the client library.
Library vs meshIn-process libraries (resilience4j, Polly) see true call semantics — exceptions, business vs transport errors, slow calls — and can run per-dependency. Mesh-level breakers (Envoy outlier detection) need zero code and work across languages but only see what crosses the proxy (mostly HTTP status). Many large systems use both: the mesh ejects sick hosts, the library handles per-call fallbacks.

Common misconceptions & gotchas#

What problem does a circuit breaker actually solve?

Resource exhaustion and cascading failure — not the dependency's errors themselves. Its value is protecting the *caller*: by failing fast once a dependency is clearly unhealthy, it returns threads/connections immediately instead of letting them pile up on doomed, slow calls. It also gives the sick dependency room to recover by stopping the pile-on. It does not fix or heal the dependency.

What is Half-Open for?

It's the controlled recovery probe. After the cooldown, the breaker can't know if the dependency recovered without testing — but it daren't send full traffic. Half-Open lets a tiny trial budget through: success (a few in a row) closes the breaker, any failure reopens it. It avoids both staying Open forever and slamming a still-fragile dependency with the full load the instant the cooldown ends.

Circuit breaker vs retry — aren't they the same idea?

Opposite directions. A retry calls *more* (try again after a failure) and targets transient blips. A breaker calls *less* (stop calling once failures are sustained) and targets a dependency that's genuinely down. They compose: retry the occasional blip, but when retries keep failing, the breaker trips and suppresses further attempts so you stop hammering a dead service.

Count-based or rate-based threshold — which should I use?

Rate-based (failure percentage over a sliding window) is the safer default for most services: it reacts to how *unhealthy* the dependency is rather than to a raw streak, and it's robust to mixed traffic. Pure consecutive-count is simple and fine for low-traffic or strictly serial calls, but it can miss a 50%-failing dependency whose failures are interleaved with successes. Time-based windows are best for bursty or low-volume endpoints because they bound staleness independent of request rate.

Should a breaker count a 404 or validation error as a failure?

No. Only count failures that mean the dependency itself is unhealthy: 5xx, timeouts, connection refused. Business errors (404, 400, 422) are correct responses from a healthy service — counting them trips the breaker for the wrong reason and breaks healthy traffic.

QuizYour breaker trips Open correctly during an outage, but as soon as the dependency recovers it keeps oscillating Open↔Closed for several minutes before stabilizing. What's the most likely cause?

  1. The failure-rate threshold is too high
  2. The cooldown is too short and/or Half-Open closes after too few successes, so a fragile-but-recovering dependency gets re-slammed and re-trips
  3. The breaker is counting 404s as failures
  4. The timeout on each call is too long
Show answer

The cooldown is too short and/or Half-Open closes after too few successes, so a fragile-but-recovering dependency gets re-slammed and re-tripsThat oscillation is flapping. Right after recovery the dependency is fragile; if the cooldown is short and Half-Open closes on a single (or too few) successes, the breaker reopens the gates too fast, the still-weak dependency buckles, and it trips again — repeatedly. The fixes are requiring several consecutive Half-Open successes before closing, capping Half-Open trial concurrency, and lengthening the cooldown (often with exponential backoff) so each recovery attempt gets more breathing room.

In an interview#

Lead with the failure mode this prevents: resource exhaustion from cascading timeouts, not just 'the dependency returns errors.' A circuit breaker's value is in how fast it fails once it's open — it protects the caller's own resources, not just the user-facing error rate.

Be ready to name the three states and the specific trigger between each one — interviewers often probe whether you understand Half-Open is reached lazily (on the next call attempt after the cooldown), not via a background timer that proactively flips the state.

Score points by placing it in the resilience family: timeouts bound each call, retries with backoff+jitter absorb transient blips, bulkheads isolate pools, and the breaker decides whether to attempt at all — and suppresses retries while Open. Mention rate-based vs consecutive-count windows and slow-call detection, and name a real implementation (resilience4j, Polly, or Envoy outlier detection; Hystrix as the historical original, now in maintenance mode).

Then try it in the simulator: send a burst of healthy calls (stays Closed), then a burst of failures (trips Open), watch requests get rejected instantly, advance time past the cooldown, and send a failing canary followed by a healthy one to see the full Open → Half-Open → Open → Half-Open → Closed cycle.

References & further reading#

References