Message Queue (Kafka-style) · System Internals

The big picture#

TL;DRthe 30-second version

A log-based message queue is a durable, append-only log: producers append records to the end, consumers read forward at their own pace, and nothing is deleted when it is read.
A topic is split into partitions for parallelism. Each record gets a monotonically increasing offset within its partition, and ordering is guaranteed only inside a single partition — never across partitions.
A producer routes a record to a partition by hashing its key (same key → same partition → stays ordered); keyless records spread round-robin. Consumer groups share the work, with each partition owned by exactly one consumer in the group.
Consumers track a committed offset per partition; lag = log-end offset − committed offset. Default delivery is at-least-once, so consumers should be idempotent. Kafka, Pulsar, and Kinesis are log-based; RabbitMQ and SQS are traditional broker queues that delete on ack.

Everything below expands on these points. Read the core sections top to bottom for the full model; the collapsible "Go deeper" boxes hold the advanced internals (exactly-once, zero-copy, rebalancing protocols) you can skip on a first pass.

Topic “orders” · key → hash → partition · highlighted = last committed offset

P0012345log-end 6 · committed 4 · lag 2

P1012log-end 3 · committed 3 · lag 0

P201234log-end 5 · committed 2 · lag 3

A topic with 3 partitions, offsets, and one consumer group

Start here: the problem it solves#

Services need to hand work to each other without being tightly coupled or losing messages when a consumer is slow or down. A direct RPC call fails if the receiver is offline; an in-memory queue loses data on restart. You want a buffer that is durable, replayable, and lets producers and consumers run at completely different speeds.

Why not just call the downstream service directly? Synchronous calls couple availability — if the receiver is down or slow, the caller blocks, retries, and the failure propagates back up the chain. They couple capacity — a burst of orders that the email service can't keep up with either drops requests or topples it. And they couple topology — every new consumer of an event (fraud, analytics, search indexing) means another call the producer has to know about and wait on. The producer ends up owning the reliability and scaling problems of everyone downstream.

A log-based queue solves this by writing every message to a durable, ordered log. Producers just append; consumers read from wherever they left off, as fast or slow as they like, and can even rewind and reprocess. The log is the source of truth, not a transient pipe. Four properties fall out of that one decision:

Decoupling — the producer only knows the log. Add a new consumer (analytics, search, fraud) without touching the producer; it just starts reading the log.
Burst absorption / backpressure — a traffic spike piles up as lag in the log instead of overwhelming consumers. Consumers drain at their own rate; the log is the shock absorber.
Async work — the producer appends and returns immediately; slow processing (sending email, charging a card) happens off the request path.
Durability + replay — because messages aren't deleted on read, a consumer can reset its offset to reprocess history (reindex, backfill a new feature, recover from a bad deploy).

The trade-offYou get durability, replay, high throughput (sequential appends), and independent consumers — at the cost of only-within-partition ordering and the operational weight of partitions, offsets, and consumer-group rebalancing.

The mechanism: a partitioned, append-only log#

A topic is divided into partitions. Each partition is an independent append-only log, and every record appended to it gets a monotonically increasing offset — 0, 1, 2, 3, … — that uniquely identifies its position. The offset is the heart of the design: it is a stable address a consumer can bookmark, rewind to, or fast-forward past. Records are never modified in place; producers only ever append to the tail.

Order is guaranteed only within a partition — there is no global order across partitions. That is the central trade-off: more partitions means more parallel throughput but a weaker ordering guarantee. If you need a total order over everything, you are limited to a single partition (and thus a single consumer).

A producer chooses a record's partition by hashing its key. All records with the same key land on the same partition, so they stay in order relative to each other (e.g. all events for user:42 are ordered). Records with no key are spread round-robin (or by sticky batching) for balance, giving up ordering between them. This is the single most important design lever you control.

produce()key=user:42, value=…

route by keykey=null → round-robin instead

partitionerhash(key) % numPartitions

append →

P1 tailappend at offset 7

Producer append path: key routes to a partition, then appends

Pick the key for the ordering you needIf you need per-user ordering, key by user id. If you only care about throughput, go keyless. The partition key is how you tell the system which records must stay in order together. Note: changing the partition count later re-maps hashes, so existing keys can move partitions and break ordering for in-flight data — choose partition count with growth in mind.

Consumer groups and offsets#

Consumers read by remembering an offset per partition — their bookmark into the log. Crucially, the broker does not delete a message when it is read; the consumer just advances its committed offset. That is what makes replay possible and lets many independent consumers read the same log. The committed offset is stored durably (in Kafka, in an internal __consumer_offsets topic), so it survives consumer restarts.

A consumer group is a set of consumers that share the work of a topic. Each partition is assigned to exactly one consumer in the group, so the group processes every record once, in parallel. Because a partition has a single owner, the number of partitions caps how many consumers in a group can do useful work — extra consumers sit idle. Different groups have their own offsets and consume the same log completely independently — this is how the same event stream feeds billing, analytics, and search at once.

Each partition is owned by one consumer in the group.
A consumer reads from its partitions and advances the group's committed offset.
Lag = log-end offset − committed offset: how far behind the group is.
More consumers than partitions ⇒ the extras are idle.
Multiple groups ⇒ each reads the full log on its own offsets (pub/sub fan-out).

Lag is the metric that mattersLag — the gap between the newest offset and what the group has committed — is the single best health signal. Flat low lag means consumers keep up. Steadily rising lag means consumers are falling behind and latency is growing; it is the first thing to alert on.

Rebalancing: surviving consumer churn#

When a consumer joins or leaves (or crashes, detected by a missed heartbeat), the group rebalances: partitions are reassigned across the remaining consumers. Because committed offsets are stored per group per partition — not on the consumer — a new owner simply resumes from the last committed offset. No data is lost, though there may be a brief pause during the reassignment (a 'stop-the-world' moment in the classic eager protocol, mitigated by newer cooperative/incremental rebalancing).

At-least-once by defaultIf a consumer processes a record but crashes before committing the offset, the new owner reprocesses it — so delivery is at-least-once, and consumers should be idempotent. Committing before processing would instead risk losing records (at-most-once). The choice of when you commit relative to when you process is exactly what selects your delivery semantics.

Go deeperEager vs. cooperative (incremental) rebalancing

The original Kafka rebalance protocol is 'eager': on any membership change, every consumer revokes all of its partitions, the group leader computes a new full assignment, and everyone resumes. While this happens, the whole group stops consuming. With many consumers and frequent churn this causes 'rebalance storms' — the group spends more time rebalancing than working.

Cooperative (incremental) rebalancing, the modern default, only revokes the partitions that actually need to move. Consumers keep processing the partitions they retain, so the disruption is proportional to the change rather than total. Tuning session.timeout.ms and heartbeat.interval.ms, and using static group membership (group.instance.id) so a quick restart doesn't trigger a rebalance at all, are the standard mitigations.

Throughput, ordering, and retention#

The performance model is simple to reason about because the unit of parallelism is the partition. Producer and consumer throughput both scale roughly linearly with partition count, up to broker and network limits — want 2× the consumer parallelism, add partitions. The ceiling on a consumer group's parallelism is the partition count: P partitions means at most P working consumers.

Throughput ∝ partitions — each partition is an independent lane for appends and reads. Total parallelism across a group ≤ number of partitions.
Ordering scope = one partition — strict per-partition order, no cross-partition order. Total order requires a single partition (no parallelism).
Append/read cost — O(1) append at the tail; consumers read sequentially from an offset. No global index to update, unlike a B-tree-backed queue.
Retention — records persist by time (e.g. 7 days) or size, independent of whether they've been consumed. Log compaction is an alternative that keeps only the latest value per key.

PredictA topic has 6 partitions and a consumer group with 10 consumers. How many consumers are actually doing work, and what happens if you want more parallelism?

Hint: How many consumers can own a single partition?

Only 6 consumers do work — each owns one partition — and the other 4 sit idle, because a partition has exactly one owner within a group. To get more parallelism you must increase the partition count (e.g. to 10). Partition count is the hard cap on a group's consumer parallelism, so it's a capacity-planning decision you make up front.

Go deeperWhy a log is so fast: sequential I/O, the page cache, and zero-copy

Appending to a log is sequential disk I/O, which is an order of magnitude faster than the random I/O an in-place index incurs — a spinning disk or SSD sustains far higher sequential throughput. Because writes and reads are both sequential, the OS page cache does most of the work: recently produced records are usually still in memory when consumers read them, so reads rarely touch disk.

Kafka also uses zero-copy (the sendfile syscall): to ship a chunk of log to a consumer, the broker tells the kernel to copy bytes straight from the page cache to the network socket, skipping a round trip through user-space application buffers. Combined with batching and compression of records, this is how a single broker pushes hundreds of MB/s. The takeaway: the append-only log isn't just a nice abstraction, it's what makes the hardware fast.

Variants: logs vs. queues, and delivery semantics#

"Message queue" covers two genuinely different designs, and conflating them causes most of the confusion in this space.

Log-based (Kafka, Pulsar, Kinesis, Redpanda) — messages live in a partitioned, retained log. Reading does not delete; consumers track offsets and can replay. Optimised for high-throughput streaming and many independent consumers.
Traditional broker queues (RabbitMQ, AWS SQS, ActiveMQ) — the broker holds messages and deletes each one once a consumer acknowledges it. There is no replay and usually no per-key offset; the broker pushes messages to whichever worker is free. Optimised for task distribution and per-message routing.

Two more axes cut across both: pub/sub vs. work-queue, and push vs. pull. In pub/sub, every subscriber gets every message (Kafka's multiple consumer groups; RabbitMQ fanout exchanges). In a work-queue, a message goes to exactly one of a pool of workers (Kafka's single consumer group; an SQS queue). Push systems (RabbitMQ) have the broker shove messages at consumers; pull systems (Kafka) have consumers fetch on their own schedule, which gives consumers natural backpressure control.

Delivery semantics describe what happens around failures, and they come from where you commit the offset relative to processing:

At-most-once — commit the offset before processing. If you crash mid-process, the record is skipped on restart. Fast, lossy. Fine for disposable metrics.
At-least-once — process, then commit. A crash before commit means the record is redelivered and reprocessed. No loss, but duplicates — the common default. Make consumers idempotent.
Exactly-once — every record affects the result once, even across failures. Achieved with an idempotent producer (sequence numbers de-dupe retried appends) plus transactions that atomically tie the output writes and the offset commit together (read-process-write). Strongest guarantee, highest cost.

Go deeperHow exactly-once is actually achieved (and its limits)

Two mechanisms combine. (1) The idempotent producer tags each batch with a producer id and a per-partition sequence number; the broker rejects a duplicate sequence, so a producer retry after a network hiccup doesn't append the record twice. (2) Transactions let a producer write to multiple partitions and commit consumer offsets atomically — either all the outputs and the offset advance become visible together, or none do. Consumers reading with isolation.level=read_committed never see records from aborted or in-flight transactions.

The crucial caveat: exactly-once is end-to-end only within the Kafka boundary (Kafka-to-Kafka, as in Kafka Streams). The moment your consumer writes to an external system (charges a card, emails a user) that isn't part of the transaction, you're back to needing an idempotent side effect. In practice most teams pick at-least-once plus an idempotent consumer (dedupe on a business key) — it gets the same outcome with far less machinery.

The core trade-offs#

Every knob here is a balance between ordering, parallelism, durability, and delivery strength. There is no free lunch — you decide which property the workload needs most.

Ordering vs. parallelism — strict order needs a single partition (no parallelism); throughput needs many partitions (no cross-partition order). The partition key is where you draw the line: order only within a key.
Durability/replayability vs. storage — retaining the log enables replay and new consumers, but you pay for the disk. Time/size retention or log compaction bounds it.
Delivery guarantee vs. throughput — at-most-once is fastest, exactly-once is safest and slowest (extra coordination, transactions, lower batching efficiency). At-least-once sits in the pragmatic middle.
The common pragmatic choice — at-least-once delivery plus an idempotent consumer. You accept occasional duplicates and design processing so a duplicate is a no-op (upsert by key, dedupe table, conditional write). It captures most of exactly-once's safety at a fraction of the complexity.

Rule of thumbReach for a log-based queue when you need durability, replay, multiple independent consumers, or high streaming throughput. Reach for a traditional broker queue when you need per-message routing, priorities, or simple task distribution with the broker doing the load balancing and you don't need replay.

Failure modes & operational hazards#

Consumer lag — consumers fall behind producers and the gap grows without bound, inflating end-to-end latency and risking data aging out of retention before it's processed. Alert on lag; scale consumers (up to partition count) or add partitions.
Rebalance storms — frequent membership changes make the group spend its time reassigning partitions instead of consuming. Caused by flapping consumers, GC pauses, or short session timeouts. Mitigate with cooperative rebalancing and static membership.
Duplicate delivery — under at-least-once, a crash between processing and committing redelivers records. Expected behaviour, not a bug: handle it with idempotent processing.
Poison messages — a malformed record that throws every time it's processed blocks progress on its partition forever. Route repeated failures to a dead-letter queue (DLQ) after N retries so the partition can move on.
Head-of-line blocking — because a partition is strictly ordered and single-owned, one slow or stuck record holds up everything behind it in that partition. More partitions or a DLQ for stragglers helps.
Hot partitions — a skewed key (one celebrity user, one giant tenant) sends a disproportionate share of traffic to one partition, which becomes the bottleneck while others idle. Fix with a better key (add a sub-key/salt) or a custom partitioner.

Dead-letter queues are not optionalAny consumer that can encounter input it cannot process needs a DLQ (or a 'parking lot' topic). Without one, a single poison message stalls a partition indefinitely and silently grows your lag. With one, the bad record is set aside for inspection and the partition keeps flowing.

Kafka vs. RabbitMQ vs. SQS vs. Pulsar#

	Kafka	RabbitMQ	AWS SQS	Pulsar
Model	Partitioned log	Broker queue (AMQP)	Managed broker queue	Partitioned log + queues
Ordering	Per partition	Per queue (best-effort)	FIFO queues only	Per partition / key-shared
Replay	Yes — offsets, retention	No — deleted on ack	No — deleted on delete	Yes — cursors, retention
Delivery	At-least / exactly-once	At-least-once (+ acks)	At-least-once (FIFO: exactly-once-ish)	At-least / effectively-once
Best for	High-throughput streaming, many consumers	Flexible routing, RPC, task queues	Zero-ops cloud task queue	Streaming + queuing, multi-tenant

The split is fundamentally log vs. broker-queue. Kafka and Pulsar retain a replayable log and scale by partition; RabbitMQ and SQS are queues that delete on acknowledgement and excel at task distribution and routing. Pulsar notably offers both a streaming (log) and a queuing (shared-subscription) model on one platform, with storage separated from serving via Apache BookKeeper.

Where these run in the wild#

The log-and-queue family powers most modern event-driven and streaming architectures. The systems differ mainly in storage model, delivery options, and how much you operate yourself.

Apache Kafka — the de facto standard log. Partitioned topics, consumer groups, exactly-once via transactions, Kafka Streams / ksqlDB for stream processing. Runs LinkedIn, Netflix, Uber-scale pipelines.
Apache Pulsar — log + queue on one platform; compute (brokers) separated from storage (BookKeeper) for elastic scaling and geo-replication. Strong multi-tenancy.
RabbitMQ — the classic AMQP broker: exchanges, bindings, and queues give rich routing (direct, topic, fanout, headers). Great for task queues and request/response, not for replay.
AWS SQS / Kinesis — SQS is a fully managed broker queue (standard = at-least-once unordered, FIFO = ordered, dedup); Kinesis is a managed Kafka-like sharded log/stream. Zero servers to run.
Redis Streams — a lightweight append-only log inside Redis with consumer groups; good when you already run Redis and want modest streaming without standing up Kafka.
NATS JetStream — persistence and at-least-once on top of the very fast NATS messaging core; popular in cloud-native / edge stacks.
Google Cloud Pub/Sub — globally managed pub/sub with at-least-once delivery, push or pull, and large fan-out; the GCP analogue to SQS+Kinesis.

Common misconceptions & gotchas#

Is global ordering guaranteed across a topic?

No. Ordering is guaranteed only within a single partition. Across partitions there is no order at all. If you truly need a total order over every record, you must use a single partition — which means a single consumer and no parallelism. Usually you don't need global order; you need order within a key (per user, per account), which the partition key gives you for free.

At-least-once vs. exactly-once — which should I use?

Default to at-least-once with idempotent consumers. It's simple and fast: you accept that a record can be redelivered after a crash and design processing so a duplicate is harmless (upsert by id, dedupe table, conditional write). Exactly-once is real in Kafka (idempotent producer + transactions) but only end-to-end inside Kafka; as soon as a side effect hits an external system you need idempotency anyway. Most teams pick at-least-once and rarely regret it.

What exactly is consumer lag?

Lag = log-end offset − committed offset for a partition: how many records have been produced that the group hasn't processed yet. Low, flat lag means consumers keep up. Rising lag means they're falling behind — latency grows and, if it exceeds retention, data can be lost before it's read. It's the number-one metric to monitor and alert on.

Queue vs. log — what's the actual difference?

A traditional queue (RabbitMQ, SQS) deletes each message once a consumer acknowledges it — there's one logical reader and no history. A log (Kafka, Pulsar) keeps messages for a retention period regardless of consumption; consumers track offsets, can replay, and many independent groups can read the same data. Queues are for task distribution; logs are for durable, replayable event streams with multiple consumers.

QuizOrders for a few huge merchants dominate traffic, and you keyed the topic by merchant_id. One partition is now saturated while others idle. What's happening, and what's the fix?

Consumer lag; commit offsets more often
A hot partition from key skew; change the key (e.g. add a sub-key) or use a custom partitioner
A rebalance storm; increase the session timeout
Poison messages; add a dead-letter queue

Show answer

A hot partition from key skew; change the key (e.g. add a sub-key) or use a custom partitioner — Keying by merchant_id sends all of a giant merchant's records to one partition (hash of the key is fixed), so that partition becomes a hot spot while the rest sit idle — classic key skew. The fix is a higher-cardinality or salted key (e.g. merchant_id + order bucket) or a custom partitioner that spreads heavy keys, so load balances across partitions. Committing more often, timeouts, and DLQs address different problems.

In an interview#

Lead with the model: a durable, append-only, partitioned log where consumers track offsets. Partitions give parallelism but only per-partition ordering; key-based routing keeps related records ordered. Consumer groups share work, one owner per partition, with offsets stored per group. Name it: Kafka (also Pulsar, Kinesis, Redpanda).

Be ready to discuss: how to choose the partition key for the ordering you need, why partition count caps consumer parallelism, what lag means and how to monitor it, rebalancing on consumer churn, and delivery semantics (at-least-once by default, idempotent consumers, exactly-once via transactions). A common follow-up is retention — time/size-based, or log compaction that keeps the latest value per key. Know the queue-vs-log distinction cold, and have a story for hot partitions and dead-letter queues.

Then open the simulator: PRODUCE keyed records and watch them route to partitions, add CONSUMERs to a group and POLL to see offsets advance and lag fall, add another consumer to trigger a rebalance, and start a second group to see it consume the same log independently from the beginning.

References & further reading#

References

Jay Kreps — The Log: What every software engineer should know about real-time data's unifying abstraction — the foundational essay on the log abstraction (archived copy)
Apache Kafka — Documentation: Design & Implementation — ordering per partition, the log, and delivery semantics, from the source
Kreps, Narkhede & Rao — Kafka: a Distributed Messaging System for Log Processing (NetDB 2011) — the original Kafka paper
AWS — Amazon SQS Developer Guide — a traditional managed broker queue: standard vs. FIFO, visibility timeout, DLQs
Apache Pulsar — Messaging concepts — log + queue model, subscriptions, and cursors
RabbitMQ — AMQP 0-9-1 model concepts — exchanges, bindings, and queues — the broker-queue alternative

Ready to try it?

The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.

Try these in the simulator

Produce, consume, and rebalance →Produce keyed records across partitions, then watch a consumer group grow — partitions rebalance across consumers, and a second group reads the same log independently.Keys pin to partitions →Same-key records always land on the same partition (so they stay ordered); keyless records spread round-robin for balance.Two groups, independent offsets →Two consumer groups read the very same log without interfering — each tracks its own committed offset, so the second group re-reads from the start.

Open the Message Queue (Kafka-style) simulator →

Up nextHyperLogLog

← Back to the learning path