The big picture#
TL;DRthe 30-second version
- Concurrency = dealing with many tasks over the same period by interleaving them (taking turns). Parallelism = literally running tasks at the same instant. Concurrency is a structure; parallelism is a hardware capability.
- One core can only run one instruction stream at a time, so it creates the illusion of many tasks by switching between them fast — a context switch. That's concurrency without any parallelism.
- Parallelism needs multiple cores. With N cores you can truly overlap N tasks, so wall-clock time drops toward the length of the longest single task instead of the sum of all of them.
- More cores rarely means proportional speedup: work that can't be split, uneven task sizes, and time wasted blocked on I/O all leave cores idle. That ceiling is Amdahl's law, and it's why the sim's 'speedup' almost never equals the core count.
Everything below expands on these four points. Read the core sections top to bottom for the full mental model; the collapsible "Go deeper" boxes hold the advanced bits (threads vs async, race conditions, the math of speedup) you can skip on a first pass and come back to later.
Each cell is one time slice · only one task runs at any instant
Both cores run in the same instant
Start here: why the distinction matters#
A computer is always asked to do more than one thing at a time: serve many users, download a file while the UI stays responsive, run a background job while answering requests. The naive picture is 'it does them all at once.' But a single processor core is fundamentally sequential — it fetches one instruction, runs it, fetches the next. It cannot literally do two things in the same instant. So how does one core appear to juggle hundreds of tasks?
The answer is that it doesn't run them simultaneously — it interleaves them. It runs a little of task A, saves where it was, switches to task B, runs a little of that, switches back. The switches happen thousands of times a second, far faster than a human notices, so it looks simultaneous. That illusion is concurrency. Whether any of it is actually parallel — happening in the same instant — depends entirely on how many cores you have.
How a single core takes turns#
The component that decides who runs next is the scheduler. It keeps a set of ready tasks and hands the core to one of them for a short slice of time called a quantum (often a few milliseconds). When the quantum expires — or the task voluntarily gives up the core — the scheduler performs a context switch and gives the core to the next ready task.
- A task runs on the core until its quantum runs out, it finishes, or it has to wait for something (like disk or network).
- The scheduler saves that task's state — its registers, its place in the code — so it can be resumed exactly where it left off.
- It picks the next ready task (round-robin scheduling just rotates through them fairly so none is starved) and loads its saved state.
- The core resumes that task. Repeat thousands of times a second. To a human it looks like everything runs at once.
In the simulator's one-core scenario you can watch this directly: tasks A and B alternate, one tick each, and the schedule fills A B A B A B. Only one runs at any instant — that's the defining property of concurrency on a single core. Add a second core and the picture changes completely.
The cost model: speedup, utilization, and Amdahl's law#
Two numbers describe how well a schedule uses its cores. Serial time is how long the work would take on a single core — the sum of all the CPU work. Wall-clock time is how long the schedule actually takes from the first tick to the last. Speedup is serial ÷ wall-clock: how many times faster you finished by spreading the work out.
- On one core, wall-clock equals serial, so speedup is exactly 1× — interleaving rearranges the work but can't make it finish sooner. Concurrency alone buys responsiveness, not throughput.
- Add cores and wall-clock drops toward the length of the longest single task — never below it. Two equal tasks on two cores finish in half the time (2× speedup); that's parallelism paying off.
- Utilization is the fraction of available core-time actually spent doing useful work. Idle cores (no ready work) and blocked cores (stuck waiting on I/O) both drag it down — and low utilization is exactly why speedup falls short of the core count.
PredictYou run 5 tasks that each take 1 tick of CPU on a 4-core machine. Perfect parallelism would predict 5 ÷ 4 = 1.25 ticks. How long does it actually take, and what's the real speedup?
Hint: Can a single 1-tick task be split across two cores? What happens to the 5th task?
It takes 2 ticks, for a speedup of 2.5× — not 4×. Four tasks run together on tick 1, but the 5th can't be split, so it runs alone on tick 2 while the other three cores sit idle. The work didn't divide evenly across the cores, so utilization drops to about 62% and you get nowhere near the 4× the core count suggests. This uneven-split penalty is Amdahl's law in miniature, and it's the single most common reason adding cores disappoints.
Ways to actually achieve concurrency#
There's more than one way to structure a program to handle many tasks. They differ in who does the switching, how expensive a switch is, and whether they can use multiple cores. Treat this as a tour; the deep dive has the trade-offs.
- Threads — the OS scheduler gives each thread time slices and switches between them (preemptively). Threads can run on different cores, so they give you real parallelism — but each one costs memory, and switching between thousands of them is expensive.
- Async / event loop — a single thread juggles many tasks cooperatively: each task runs until it hits an I/O wait, then voluntarily yields so another can run. Switches are cheap (no OS involvement), so one thread can handle tens of thousands of connections — but it's concurrency on one core, not parallelism. This is Redis and Node.js.
- Multiprocessing — separate processes, each with its own memory, often one per core. True parallelism with strong isolation (a crash in one doesn't take down the others), at the cost of pricier communication between them.
Go deeperGo deeper: blocking vs non-blocking I/O, and why the event loop won
The blocking-vs-async scenario in the simulator is the crux. When a task makes a blocking I/O call — read from disk, wait on the network — a synchronous, single-threaded program is stuck: the one thread sits in that syscall doing nothing useful, holding the core hostage, and no other task can run. The sim shows this as 'blocked' core-ticks: time wasted holding a core while waiting.
Non-blocking (async) I/O fixes this. Instead of waiting, the task hands the I/O off and says 'wake me when it's done.' The single thread is now free to run other ready tasks during the wait. When the I/O completes, the event loop resumes the original task. The core stays busy with useful work the whole time — same one core, dramatically more throughput. That's the entire reason event-loop servers exist, and why the sim's async core finishes the identical workload sooner than the blocking one.
The catch: async is concurrency, not parallelism. One event-loop thread still uses one core. To use all your cores you run one event loop per core (Node's cluster mode, Nginx workers). The modern high-performance recipe is both: async I/O to avoid wasting a core on waiting, times N processes to actually use all N cores.
What concurrency buys, and what it costs#
Concurrency buys responsiveness and throughput; parallelism buys raw speed on divisible work. But both add a cost that sequential code never has: the tasks can interfere with each other.
- Strength — responsiveness: interleaving means no single slow task freezes everything. The UI stays alive while a download runs; a slow request doesn't block the other thousand.
- Strength — throughput and speed: with real parallelism, divisible work finishes faster in proportion to the cores you can keep busy.
- Cost — shared state is dangerous: when two tasks touch the same data, the order of their interleaving matters. A race condition is a bug that appears only on certain timings; a deadlock is two tasks each waiting forever for a lock the other holds. These are notoriously hard to reproduce because they depend on the scheduler.
- Cost — overhead and diminishing returns: context switches, coordination (locks), and the serial fraction all eat into the speedup. Past a point, adding concurrency makes things slower, not faster.
Comparisons at a glance#
Two comparisons worth holding in your head: concurrency versus parallelism as concepts, and the two main ways to implement concurrency (threads versus async).
| Concurrency | Parallelism | |
|---|---|---|
| What it is | Dealing with many tasks by interleaving | Running many tasks in the same instant |
| Needs | Just one core (a scheduler taking turns) | Multiple cores (or machines) |
| Buys you | Responsiveness, overlapping I/O waits | Raw speed on divisible work |
| Speedup on 1 core | 1× (rearranges, doesn't shorten) | N/A — needs more cores |
| Analogy | One cook juggling several dishes | Several cooks, one dish each |
| Threads | Async / event loop | |
|---|---|---|
| Who switches | The OS, preemptively (time slices) | The program, cooperatively (at I/O waits) |
| Switch cost | Higher (kernel, cache effects) | Very low (user space) |
| Uses many cores? | Yes — real parallelism | No — one core per loop |
| Best at | CPU-bound parallel work | Massive I/O-bound concurrency |
| Examples | Java/Go threads, OS processes | Node.js, Redis, Nginx, Python asyncio |
Where this shows up#
Once you can tell concurrency from parallelism, a lot of systems design vocabulary stops being mysterious.
- Redis is single-threaded — and fast. It's concurrent (an event loop juggling thousands of clients) but not parallel for command execution. Because it never blocks and never context-switches between OS threads, one core handles enormous load. This is the async story in production.
- Node.js uses one event loop per process; to use all cores you run cluster mode — N processes, one per core. Async concurrency times multiprocessing parallelism.
- Web servers (Nginx, Go services) combine both: a small pool of worker threads/processes (parallelism across cores) each handling many connections concurrently (so a slow client doesn't tie up a whole core).
- Data processing (MapReduce, Spark, GPU work) is the parallelism-heavy end: split a big divisible job across many cores or machines. Here Amdahl's law is the daily reality — the serial setup/shuffle steps cap how much the parallel part can help.
Common questions & gotchas#
Is concurrency just slower parallelism?
No — they're different ideas. Concurrency is a way of structuring a program to handle many tasks by interleaving them; it works on a single core and buys responsiveness, not speed. Parallelism is physically running tasks at the same instant, which needs multiple cores and buys speed on divisible work. A single-core program can be highly concurrent (an event loop with 10,000 connections) and zero percent parallel.
If I have 8 cores, do I get an 8× speedup?
Almost never. Amdahl's law caps speedup by the serial (un-parallelizable) fraction of the work, and in practice uneven task sizes, coordination/locking, and I/O waits leave cores idle well before that. A realistic speedup might be 4–6× on 8 cores for a well-suited workload, and 1× for one that's inherently serial.
Why is single-threaded Redis so fast if it can't use multiple cores?
Because its bottleneck is I/O and coordination, not CPU. By being a single-threaded event loop it never pays for thread context switches or locks, and it never blocks — so one core stays almost fully utilized doing useful work. For a workload dominated by waiting on the network, avoiding overhead beats adding cores. (For CPU-heavy work, Redis would lose to a parallel design — different bottleneck, different tool.)
What's a race condition, in one line?
A bug whose outcome depends on the timing of how concurrent tasks interleave — e.g. two threads both read a counter as 5, both add 1, both write 6, and one increment is lost. Because it depends on the scheduler, it may appear only rarely and under load, which is what makes it so hard to debug. Locks, atomic operations, or not sharing state at all are the fixes.
QuizA program spends 90% of its time on work that can be parallelized and 10% on an inherently serial step. With infinitely many cores, what is the maximum possible speedup?
- Unlimited — more cores always means more speed
- 10× — the serial 10% can never be sped up, so it dominates the limit
- 90× — proportional to the parallel fraction
- 2× — there's always a fixed cap of 2×
Show answer
10× — the serial 10% can never be sped up, so it dominates the limit — This is Amdahl's law. If 10% of the work is serial, that portion always takes the same time no matter how many cores you add. In the best case the parallel 90% shrinks to nearly zero, leaving just the serial 10% — so the whole job can at most run 1 ÷ 0.10 = 10× faster. The serial fraction, not the core count, sets the ceiling. This is why reducing the serial part often matters more than adding hardware.
In an interview#
Lead with the definition that most people get wrong: concurrency is dealing with many tasks by interleaving them (a structure that works on a single core); parallelism is running them in the same instant (which needs multiple cores). State plainly that one core can be concurrent without ever being parallel.
Then show the cost model. Speedup is serial ÷ wall-clock, and it's bounded by Amdahl's law — the serial fraction caps how much more cores can help, and uneven work plus coordination make real speedup worse than the theoretical limit. Mentioning utilization (idle and blocked cores) shows you understand why adding cores so often underdelivers.
Close by matching the tool to the bottleneck: I/O-bound work wants async concurrency (one core, never blocking — Redis, Node) so no core is wasted waiting; CPU-bound divisible work wants real parallelism across cores or machines. The strongest answers note that high-performance servers do both: async I/O per core, times one process per core.
References & further reading#
- Rob Pike — Concurrency Is Not Parallelism — the canonical talk that nails the distinction
- Wikipedia — Amdahl's law — the math behind why more cores hit a ceiling
- MDN — Introducing asynchronous JavaScript — blocking vs non-blocking, in plain language
- Redis — Single-threaded design FAQ — why one thread can be fast, and where it isn't
- The Node.js Event Loop — how one thread juggles thousands of connections
Ready to try it?
The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.