System Internals
Open the simulator →
Absolute basics

Concurrency vs Parallelism

Why one core taking turns is not the same as many cores working at once — and why more cores so often disappoints.

These two words get used interchangeably, and that confusion quietly breaks people's mental model of every system that does more than one thing at a time. Concurrency is about structure: dealing with many tasks by interleaving them so they all make progress. Parallelism is about execution: actually running tasks at the same instant, which needs more than one core. A single core can be highly concurrent and never once be parallel. Get this distinction straight and scheduling, async I/O, threads, and the limits of 'just add more cores' all fall into place.

The big picture#

TL;DRthe 30-second version
  • Concurrency = dealing with many tasks over the same period by interleaving them (taking turns). Parallelism = literally running tasks at the same instant. Concurrency is a structure; parallelism is a hardware capability.
  • One core can only run one instruction stream at a time, so it creates the illusion of many tasks by switching between them fast — a context switch. That's concurrency without any parallelism.
  • Parallelism needs multiple cores. With N cores you can truly overlap N tasks, so wall-clock time drops toward the length of the longest single task instead of the sum of all of them.
  • More cores rarely means proportional speedup: work that can't be split, uneven task sizes, and time wasted blocked on I/O all leave cores idle. That ceiling is Amdahl's law, and it's why the sim's 'speedup' almost never equals the core count.

Everything below expands on these four points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced bits (threads vs async, race conditions, the math of speedup) you can skip on a first pass and come back to later.

Each cell is one time slice · only one task runs at any instant

Core 1ABABABfinishes at tick 6 — the SUM
One core, concurrency: the two tasks take turns

Both cores run in the same instant

Core 1AAA
Core 2BBBfinishes at tick 3 — the MAX
Two cores, parallelism: the two tasks really overlap

Start here: why the distinction matters#

A computer is always asked to do more than one thing at a time: serve many users, download a file while the UI stays responsive, run a background job while answering requests. The naive picture is 'it does them all at once.' But a single processor core is fundamentally sequential — it fetches one instruction, runs it, fetches the next. It cannot literally do two things in the same instant. So how does one core appear to juggle hundreds of tasks?

The answer is that it doesn't run them simultaneously — it interleaves them. It runs a little of task A, saves where it was, switches to task B, runs a little of that, switches back. The switches happen thousands of times a second, far faster than a human notices, so it looks simultaneous. That illusion is concurrency. Whether any of it is actually parallel — happening in the same instant — depends entirely on how many cores you have.

The one-sentence versionConcurrency is about how a program is structured to handle many things; parallelism is about how many of them physically run at once. Rob Pike put it well: concurrency is dealing with lots of things at once; parallelism is doing lots of things at once. You can have either without the other.

How a single core takes turns#

The component that decides who runs next is the scheduler. It keeps a set of ready tasks and hands the core to one of them for a short slice of time called a quantum (often a few milliseconds). When the quantum expires — or the task voluntarily gives up the core — the scheduler performs a context switch and gives the core to the next ready task.

  1. A task runs on the core until its quantum runs out, it finishes, or it has to wait for something (like disk or network).
  2. The scheduler saves that task's state — its registers, its place in the code — so it can be resumed exactly where it left off.
  3. It picks the next ready task (round-robin scheduling just rotates through them fairly so none is starved) and loads its saved state.
  4. The core resumes that task. Repeat thousands of times a second. To a human it looks like everything runs at once.
Context switches aren't freeSaving one task's state and loading another's takes real time, and it trashes the CPU's caches (the new task's data isn't there yet). A few switches are negligible; millions of them — say, ten thousand threads all fighting over a few cores — burn a noticeable share of the CPU on pure bookkeeping instead of useful work. This overhead is one reason async/event-loop designs (which switch cheaply, in user space) often beat thread-per-task designs under heavy load.

In the simulator's one-core scenario you can watch this directly: tasks A and B alternate, one tick each, and the schedule fills A B A B A B. Only one runs at any instant — that's the defining property of concurrency on a single core. Add a second core and the picture changes completely.

The cost model: speedup, utilization, and Amdahl's law#

Two numbers describe how well a schedule uses its cores. Serial time is how long the work would take on a single core — the sum of all the CPU work. Wall-clock time is how long the schedule actually takes from the first tick to the last. Speedup is serial ÷ wall-clock: how many times faster you finished by spreading the work out.

  • On one core, wall-clock equals serial, so speedup is exactly 1× — interleaving rearranges the work but can't make it finish sooner. Concurrency alone buys responsiveness, not throughput.
  • Add cores and wall-clock drops toward the length of the longest single task — never below it. Two equal tasks on two cores finish in half the time (2× speedup); that's parallelism paying off.
  • Utilization is the fraction of available core-time actually spent doing useful work. Idle cores (no ready work) and blocked cores (stuck waiting on I/O) both drag it down — and low utilization is exactly why speedup falls short of the core count.
Amdahl's law: the hard ceiling on 'just add cores'The speedup from more cores is capped by the part of the work that can't be parallelized. If 10% of a job is inherently serial, then even with infinite cores you can never go faster than 10× — that serial 10% always has to run alone. In practice you hit the wall much sooner: uneven task sizes and coordination leave cores idle long before the theoretical limit. This is why doubling your cores almost never doubles your speed.
PredictYou run 5 tasks that each take 1 tick of CPU on a 4-core machine. Perfect parallelism would predict 5 ÷ 4 = 1.25 ticks. How long does it actually take, and what's the real speedup?

Hint: Can a single 1-tick task be split across two cores? What happens to the 5th task?

It takes 2 ticks, for a speedup of 2.5× — not 4×. Four tasks run together on tick 1, but the 5th can't be split, so it runs alone on tick 2 while the other three cores sit idle. The work didn't divide evenly across the cores, so utilization drops to about 62% and you get nowhere near the 4× the core count suggests. This uneven-split penalty is Amdahl's law in miniature, and it's the single most common reason adding cores disappoints.

Ways to actually achieve concurrency#

There's more than one way to structure a program to handle many tasks. They differ in who does the switching, how expensive a switch is, and whether they can use multiple cores. Treat this as a tour; the deep dive has the trade-offs.

  • Threads — the OS scheduler gives each thread time slices and switches between them (preemptively). Threads can run on different cores, so they give you real parallelism — but each one costs memory, and switching between thousands of them is expensive.
  • Async / event loop — a single thread juggles many tasks cooperatively: each task runs until it hits an I/O wait, then voluntarily yields so another can run. Switches are cheap (no OS involvement), so one thread can handle tens of thousands of connections — but it's concurrency on one core, not parallelism. This is Redis and Node.js.
  • Multiprocessing — separate processes, each with its own memory, often one per core. True parallelism with strong isolation (a crash in one doesn't take down the others), at the cost of pricier communication between them.
Go deeperGo deeper: blocking vs non-blocking I/O, and why the event loop won

The blocking-vs-async scenario in the simulator is the crux. When a task makes a blocking I/O call — read from disk, wait on the network — a synchronous, single-threaded program is stuck: the one thread sits in that syscall doing nothing useful, holding the core hostage, and no other task can run. The sim shows this as 'blocked' core-ticks: time wasted holding a core while waiting.

Non-blocking (async) I/O fixes this. Instead of waiting, the task hands the I/O off and says 'wake me when it's done.' The single thread is now free to run other ready tasks during the wait. When the I/O completes, the event loop resumes the original task. The core stays busy with useful work the whole time — same one core, dramatically more throughput. That's the entire reason event-loop servers exist, and why the sim's async core finishes the identical workload sooner than the blocking one.

The catch: async is concurrency, not parallelism. One event-loop thread still uses one core. To use all your cores you run one event loop per core (Node's cluster mode, Nginx workers). The modern high-performance recipe is both: async I/O to avoid wasting a core on waiting, times N processes to actually use all N cores.

What concurrency buys, and what it costs#

Concurrency buys responsiveness and throughput; parallelism buys raw speed on divisible work. But both add a cost that sequential code never has: the tasks can interfere with each other.

  • Strength — responsiveness: interleaving means no single slow task freezes everything. The UI stays alive while a download runs; a slow request doesn't block the other thousand.
  • Strength — throughput and speed: with real parallelism, divisible work finishes faster in proportion to the cores you can keep busy.
  • Cost — shared state is dangerous: when two tasks touch the same data, the order of their interleaving matters. A race condition is a bug that appears only on certain timings; a deadlock is two tasks each waiting forever for a lock the other holds. These are notoriously hard to reproduce because they depend on the scheduler.
  • Cost — overhead and diminishing returns: context switches, coordination (locks), and the serial fraction all eat into the speedup. Past a point, adding concurrency makes things slower, not faster.
The mental rule of thumbUse concurrency to stay responsive and to overlap waiting (I/O). Use parallelism to crunch divisible CPU-bound work faster. If your bottleneck is waiting on the network or disk, async concurrency on one core is often the win; if it's heavy computation, you need real parallelism across cores — and even then, only up to what Amdahl's law allows.

Comparisons at a glance#

Two comparisons worth holding in your head: concurrency versus parallelism as concepts, and the two main ways to implement concurrency (threads versus async).

ConcurrencyParallelism
What it isDealing with many tasks by interleavingRunning many tasks in the same instant
NeedsJust one core (a scheduler taking turns)Multiple cores (or machines)
Buys youResponsiveness, overlapping I/O waitsRaw speed on divisible work
Speedup on 1 core1× (rearranges, doesn't shorten)N/A — needs more cores
AnalogyOne cook juggling several dishesSeveral cooks, one dish each
ThreadsAsync / event loop
Who switchesThe OS, preemptively (time slices)The program, cooperatively (at I/O waits)
Switch costHigher (kernel, cache effects)Very low (user space)
Uses many cores?Yes — real parallelismNo — one core per loop
Best atCPU-bound parallel workMassive I/O-bound concurrency
ExamplesJava/Go threads, OS processesNode.js, Redis, Nginx, Python asyncio

Where this shows up#

Once you can tell concurrency from parallelism, a lot of systems design vocabulary stops being mysterious.

  • Redis is single-threaded — and fast. It's concurrent (an event loop juggling thousands of clients) but not parallel for command execution. Because it never blocks and never context-switches between OS threads, one core handles enormous load. This is the async story in production.
  • Node.js uses one event loop per process; to use all cores you run cluster mode — N processes, one per core. Async concurrency times multiprocessing parallelism.
  • Web servers (Nginx, Go services) combine both: a small pool of worker threads/processes (parallelism across cores) each handling many connections concurrently (so a slow client doesn't tie up a whole core).
  • Data processing (MapReduce, Spark, GPU work) is the parallelism-heavy end: split a big divisible job across many cores or machines. Here Amdahl's law is the daily reality — the serial setup/shuffle steps cap how much the parallel part can help.
Why interviewers love thisClaiming 'we'll just add more servers/cores to go faster' without mentioning the serial fraction or coordination cost is a classic red flag. Showing you know speedup is bounded by Amdahl's law — and that I/O-bound and CPU-bound problems need different tools (async vs parallelism) — signals real understanding.

Common questions & gotchas#

Is concurrency just slower parallelism?

No — they're different ideas. Concurrency is a way of structuring a program to handle many tasks by interleaving them; it works on a single core and buys responsiveness, not speed. Parallelism is physically running tasks at the same instant, which needs multiple cores and buys speed on divisible work. A single-core program can be highly concurrent (an event loop with 10,000 connections) and zero percent parallel.

If I have 8 cores, do I get an 8× speedup?

Almost never. Amdahl's law caps speedup by the serial (un-parallelizable) fraction of the work, and in practice uneven task sizes, coordination/locking, and I/O waits leave cores idle well before that. A realistic speedup might be 4–6× on 8 cores for a well-suited workload, and 1× for one that's inherently serial.

Why is single-threaded Redis so fast if it can't use multiple cores?

Because its bottleneck is I/O and coordination, not CPU. By being a single-threaded event loop it never pays for thread context switches or locks, and it never blocks — so one core stays almost fully utilized doing useful work. For a workload dominated by waiting on the network, avoiding overhead beats adding cores. (For CPU-heavy work, Redis would lose to a parallel design — different bottleneck, different tool.)

What's a race condition, in one line?

A bug whose outcome depends on the timing of how concurrent tasks interleave — e.g. two threads both read a counter as 5, both add 1, both write 6, and one increment is lost. Because it depends on the scheduler, it may appear only rarely and under load, which is what makes it so hard to debug. Locks, atomic operations, or not sharing state at all are the fixes.

QuizA program spends 90% of its time on work that can be parallelized and 10% on an inherently serial step. With infinitely many cores, what is the maximum possible speedup?

  1. Unlimited — more cores always means more speed
  2. 10× — the serial 10% can never be sped up, so it dominates the limit
  3. 90× — proportional to the parallel fraction
  4. 2× — there's always a fixed cap of 2×
Show answer

10× — the serial 10% can never be sped up, so it dominates the limitThis is Amdahl's law. If 10% of the work is serial, that portion always takes the same time no matter how many cores you add. In the best case the parallel 90% shrinks to nearly zero, leaving just the serial 10% — so the whole job can at most run 1 ÷ 0.10 = 10× faster. The serial fraction, not the core count, sets the ceiling. This is why reducing the serial part often matters more than adding hardware.

In an interview#

Lead with the definition that most people get wrong: concurrency is dealing with many tasks by interleaving them (a structure that works on a single core); parallelism is running them in the same instant (which needs multiple cores). State plainly that one core can be concurrent without ever being parallel.

Then show the cost model. Speedup is serial ÷ wall-clock, and it's bounded by Amdahl's law — the serial fraction caps how much more cores can help, and uneven work plus coordination make real speedup worse than the theoretical limit. Mentioning utilization (idle and blocked cores) shows you understand why adding cores so often underdelivers.

Close by matching the tool to the bottleneck: I/O-bound work wants async concurrency (one core, never blocking — Redis, Node) so no core is wasted waiting; CPU-bound divisible work wants real parallelism across cores or machines. The strongest answers note that high-performance servers do both: async I/O per core, times one process per core.

References & further reading#

References