The big picture#
TL;DRthe 30-second version
- There is no shared 'now' across machines. Quartz clocks drift at different rates, so timestamps from different hosts are never directly comparable.
- Cristian's algorithm asks a trusted time server, measures round-trip time (RTT), and estimates true time as the server's reply plus RTT/2 — with error bounded by RTT/2.
- The Berkeley algorithm has no authority: a coordinator polls everyone, averages the readings (discarding outliers), and tells each node a delta to apply — converging the cluster to its own internal average, not to real time.
- NTP scales Cristian's idea to the internet: a hierarchy of strata syncing down from atomic/GPS reference clocks, with statistical offset and delay filtering. PTP, GPS, and TrueTime push accuracy from milliseconds to microseconds.
- You can bound clock error but never eliminate it. When ordering must be exact, logical clocks (Lamport, vector) order events causally without relying on synchronized physical time.
Read the core sections top to bottom for the full mental model. The collapsible "Go deeper" box on logical clocks holds the alternative paradigm — ordering events without trustworthy physical time — that you can skip on a first pass and return to later.
Start here: there is no global clock#
If every machine's clock were perfectly accurate, distributed systems would be much simpler — timestamps could be compared directly, TTLs would expire exactly on time, 'last write wins' would actually mean the most recent write won. None of that holds in practice. There is no central clock every machine reads from; each host keeps its own time using a quartz crystal oscillator, and those crystals are imperfect.
A quartz clock drifts. Its frequency depends on temperature, supply voltage, and physical aging, so a typical computer clock gains or loses on the order of tens of parts per million — roughly a few milliseconds every few seconds in the worst case, and seconds per day even in normal conditions. Two machines that agree exactly at noon will visibly disagree by evening. Crucially, they drift at different rates and in different directions, so the gap between any two clocks keeps changing.
- Event ordering breaks: if machine A timestamps an event 'after' an event on machine B, but A's clock runs slow, the timestamps can claim the wrong order — A's later event looks earlier.
- TTLs and cache expiry misfire: a 60-second TTL set by one server and checked by another expires early or late depending on the skew between them.
- Leases and locks become unsafe: a node may believe it still holds a lease that the grantor already considers expired, so two nodes act as leader at once.
- 'Last write wins' picks the wrong winner: LWW conflict resolution (Cassandra, DynamoDB) compares wall-clock timestamps, so a write from a fast-running clock can clobber a genuinely newer write from a slow one.
Cristian's algorithm: sync to an authority#
Cristian's algorithm (Flaviu Cristian, 1989) assumes one thing the Berkeley algorithm doesn't: a server whose clock is trusted — synced to a GPS receiver or an atomic clock, for instance. A client wanting the correct time talks to it directly and corrects for the round trip.
- The client records T0 — the time on its OWN (possibly wrong) clock — and sends a request to the server.
- The server stamps its reply with Ts, its own current time, and sends it back immediately.
- The client records T1 — again on its own clock — the moment the reply arrives.
- Round-trip time RTT = T1 − T0. Assuming the network delay was symmetric (half the RTT each way), the server's time at the moment the reply is processed is Ts + RTT/2 — the server's clock adjusted forward for the trip back. The client sets its clock to that.
Because the accuracy is bounded by RTT/2, real implementations send several requests and keep the reply with the smallest RTT — the one least perturbed by queuing and congestion. They also slew rather than jump: instead of slamming the clock to the new value (which could make time run backward), they speed up or slow down the local clock until it converges, preserving monotonicity for code that assumes time only moves forward.
The Berkeley algorithm: sync to each other#
Sometimes there's no authoritative clock at all — just a set of peer machines, all equally fallible. The Berkeley algorithm (Gusella & Zatti, 1989) designates one of them a coordinator (the 'time daemon') not because it's more accurate, but because something has to collect the readings and do the math.
- The coordinator polls every node (including itself) for its current time, using the same RTT/2 correction Cristian's algorithm uses for each poll.
- It computes a robust average — typically discarding any reading too far from the others as a fault — so a node whose clock is wildly off (or just plain broken) can't drag the result toward nonsense.
- It averages the remaining, trusted readings into a single reference value.
- Instead of telling each node to jump to that absolute value, it sends each node a delta — 'advance by X' or 'go back by Y' — which the node applies to its own clock (by slewing, never a backward jump). Sending relative adjustments rather than absolute times keeps every node's correction tailored to its own current skew.
NTP: Cristian's idea, scaled to the internet#
The Network Time Protocol (NTP, currently NTPv4 in RFC 5905) is the algorithm that actually keeps the world's computers roughly in sync. At its core it is Cristian's algorithm — query a server, measure the round trip, correct by half — wrapped in a hierarchy and a pile of statistical filtering to make it robust over the messy public internet.
- Strata form a tree. Stratum 0 is a reference clock (atomic clock, GPS receiver). Stratum 1 servers are directly attached to those. Stratum 2 servers sync from stratum 1, stratum 3 from stratum 2, and so on — each level one hop further from the source and slightly less accurate.
- Four timestamps per exchange. NTP records the client's send time, the server's receive time, the server's transmit time, and the client's receive time. From these it computes both the clock offset (how far off the client is) and the round-trip delay — the server's processing time is subtracted out, not blamed on the network.
- Filtering and selection. NTP collects many samples, keeps the ones with the lowest delay (least jitter), and runs intersection/clustering algorithms across multiple servers to reject 'falsetickers' (lying or broken servers) and combine the survivors.
- Discipline, not jumps. NTP normally slews the local clock — adjusting its rate via a feedback loop — so time stays monotonic. It only steps the clock for large offsets at startup.
The payoff: NTP typically holds clocks within a few milliseconds to tens of milliseconds over the open internet, and well under a millisecond on a quiet LAN — all from the same RTT/2 estimate, made trustworthy by sampling and filtering.
Accuracy is bounded by the network#
The fundamental limit on synchronizing physical clocks is not the algorithm — it's the network. Because a client can only ever measure the round trip, not the one-way delay, the uncertainty in 'what time is it really?' is bounded by the asymmetry and variability of the path. You cannot sync two clocks more precisely than your uncertainty about how long messages take.
- WAN / open internet: round trips of tens of milliseconds, variable queuing → typical accuracy of a few to tens of milliseconds with NTP.
- LAN / same datacenter: sub-millisecond RTTs and low jitter → sub-millisecond accuracy with well-tuned NTP.
- PTP (IEEE 1588) with hardware timestamps: timestamps taken at the network-interface hardware remove software/OS jitter → microsecond or even sub-microsecond accuracy.
- GPS / atomic reference clocks: a local stratum-0 source removes the network from the critical path almost entirely → tight, externally-accurate time.
The accuracy ladder — and an escape hatch#
Once NTP is understood, the rest of the landscape is mostly 'how do we shrink the uncertainty further?' — plus one genuinely different idea: stop trying to sync physical clocks at all.
- NTP — software timestamps over IP; milliseconds on the WAN, sub-ms on a LAN. The default everywhere.
- PTP / IEEE 1588 — hardware timestamping at the NIC, designed for LANs; microsecond to sub-microsecond accuracy. Used where time is money: finance, telecom, datacenters.
- GPS / atomic reference clocks — local stratum-0 sources (GPS-disciplined oscillators, rubidium/cesium clocks) that give tightly bounded, externally-accurate time without depending on a distant server.
- Google TrueTime (Spanner) — combines GPS and atomic clocks across datacenters and, crucially, exposes time as an interval [earliest, latest] with a bounded uncertainty ε, rather than pretending to know the exact instant.
Go deeperGo deeper: logical clocks — ordering events without synced time
Physical clock sync can be bounded but never made perfect, so for many problems we don't actually need the real time — we need to know the ORDER of events, specifically which event could have caused which. Logical clocks (Lamport, 1978) provide that ordering directly, with no reference to wall-clock time at all.
A Lamport clock is just a counter per process. Each process increments it before every local event. When it sends a message it attaches the counter; when it receives one, it sets its counter to max(local, received) + 1. The rule guarantees: if event A happened-before event B (A could have influenced B), then timestamp(A) < timestamp(B). The converse does not hold — a smaller timestamp does NOT prove causality — so Lamport clocks give a consistent total order but cannot tell true concurrency from causality.
Vector clocks fix that. Each process keeps a vector with one counter per process. It increments its own entry on each event and, on receive, takes the element-wise max then increments its own. Now you can compare two vectors: if every entry of A is ≤ B's (and at least one is strictly less), A happened-before B; if neither dominates, the events are genuinely concurrent. The cost is size — the vector grows with the number of participants. This is the machinery behind causal consistency and conflict detection in systems like Dynamo and Riak.
The trade is clear: physical clocks tell you when (real time, but only approximately); logical clocks tell you what-before-what (exactly, but with no notion of real elapsed time). Hybrid Logical Clocks (HLC) and TrueTime blend the two — keeping a bounded relationship to real time while preserving causal ordering — which is why they back externally-consistent databases.
The unavoidable trade-offs#
Every approach to time in distributed systems is a choice among the same tensions: how close to real time, how strong the ordering guarantee, and how much latency or hardware you're willing to spend to get there.
- Bounded, never perfect: physical clocks can be kept within a known error, but that error is never zero. Designs that assume exact agreement (comparing raw timestamps for ordering) are bugs waiting to fire.
- Physical accuracy vs logical correctness: NTP/PTP give you approximate real time but can still misorder near-simultaneous events; logical/vector clocks give exact causal ordering but no real time. Pick the one the problem actually needs.
- External accuracy vs no authority: Cristian's/NTP pull toward a trusted source (accurate, but you must have and trust one); Berkeley needs no authority (robust to a missing reference, but only internally consistent).
- Latency for correctness — commit-wait: TrueTime makes time honest by exposing uncertainty ε, but Spanner must then WAIT out that uncertainty (a few milliseconds) before committing a transaction, so that no later transaction can possibly be assigned an earlier timestamp. You buy external consistency with a small, deliberate delay on every commit.
How clock sync goes wrong#
- Asymmetric latency: the RTT/2 assumption fails whenever the request and reply paths differ, silently biasing the estimate by half the asymmetry — undetectable from the round trip alone.
- Drift between syncs: a clock is only corrected at each sync; between syncs it free-runs and drifts, so accuracy decays until the next correction. Longer sync intervals mean larger worst-case skew.
- Leap seconds: UTC occasionally inserts a leap second, creating a 23:59:60 that naive code can't represent — historically the cause of real outages. Many operators 'smear' the leap second over hours (AWS, Google) so the clock never has to repeat or skip a second.
- Clock-skew bugs in algorithms: lease and lock protocols that compare timestamps across machines can let two nodes both believe they hold a lease if their clocks disagree — a classic source of split-brain.
- Jumping vs slewing: stepping a clock to a new value can make time go backward, breaking monotonic assumptions (durations measured as negative, ordering inverted). Slewing — adjusting the rate so the clock catches up gradually — avoids this but converges slowly.
Cristian vs Berkeley vs NTP vs logical clocks#
| Cristian's | Berkeley | NTP | Logical clocks | |
|---|---|---|---|---|
| Authority | One trusted server | None — peers only | Hierarchy of strata to a reference clock | None — counters only |
| Goal | Match the authority | Internal agreement | Match real (UTC) time | Order events causally |
| External accuracy | As good as the source | None (averages itself) | ms (WAN) to sub-ms (LAN) | Not applicable — no real time |
| Correction | Set to Ts + RTT/2 | Apply per-node delta | Slew via feedback loop | max() on send/receive |
| Best for | Client with a time source | Leaderless cluster, no source | The internet, general use | Exact causal ordering |
Where clock sync runs in the wild#
- NTP — everywhere. Every laptop, phone, and server runs an NTP (or chrony) client syncing to pool.ntp.org or a vendor service. It's the invisible default that keeps log timestamps and TLS certificate checks roughly honest.
- PTP in finance and datacenters — exchanges and trading systems use PTP with hardware timestamps to order trades to the microsecond (regulations like MiFID II mandate tight clock accuracy), and large datacenters use it for high-precision telemetry.
- Google Spanner / TrueTime — TrueTime's bounded-uncertainty interval plus commit-wait is what lets Spanner offer externally-consistent, globally-distributed transactions: it assigns commit timestamps and waits out ε so the order is provably correct across continents.
- AWS Time Sync Service — Amazon runs satellite-connected and atomic reference clocks in every region, reachable from EC2, with a PTP hardware clock for microsecond accuracy and automatic leap-second smearing.
PredictSpanner uses GPS and atomic clocks but still deliberately makes every transaction wait a few milliseconds before committing. Why pay that latency if the clocks are so accurate?
Hint: TrueTime returns an interval, not an instant.
Because even great clocks have uncertainty, TrueTime returns an interval [earliest, latest] with width ε, not an exact instant. To guarantee that no future transaction can be assigned an earlier timestamp (external consistency), Spanner picks a commit timestamp and then WAITS until that timestamp is guaranteed to be in the past everywhere — i.e., it waits out ε. The commit-wait turns 'we don't know the exact time' into 'we are certain the order is correct,' at the cost of a few ms per commit.
Common misconceptions & gotchas#
Why can't we just sync clocks perfectly?
Because you can only measure round-trip time, never the one-way delay, and the path out may not equal the path back. The true time always lies within a window bounded by the network uncertainty (RTT/2 plus asymmetry). More samples shrink that window but never close it — and between syncs the clock drifts again. Bounded, not zero, is the best physics allows.
Cristian's vs Berkeley — what's the real difference?
Cristian's assumes a trusted authority and pulls clients toward it, so it can be externally accurate. Berkeley assumes no authority and averages the peers, so it's only internally consistent — if the whole cluster drifts, Berkeley happily converges on the wrong time. Cristian's needs a source; Berkeley survives without one.
Physical vs logical clocks — when do I need which?
Use physical clocks (NTP/PTP) when you need approximate real time: log timestamps, TTLs, scheduling. Use logical clocks (Lamport/vector) when you need exact ordering of causally related events and can't trust wall-clock comparisons — conflict resolution, causal consistency, distributed debugging. Many systems use both.
How does Spanner use clocks?
Spanner's TrueTime exposes time as a bounded interval with uncertainty ε (backed by GPS + atomic clocks). It assigns each transaction a commit timestamp and then waits out ε (commit-wait) before releasing locks, guaranteeing that any transaction starting later gets a strictly later timestamp. That's how it achieves external consistency across datacenters.
QuizA client uses Cristian's algorithm. The request to the server consistently takes 2 ms but every reply takes 18 ms due to asymmetric routing. RTT is 20 ms. How far off is the client's clock estimate?
- 0 ms — RTT/2 cancels the error out
- About 8 ms — it assumes 10 ms each way but the reply took 18 ms
- 20 ms — the full round trip is the error
- Impossible to say without the server's stratum
Show answer
About 8 ms — it assumes 10 ms each way but the reply took 18 ms — Cristian's sets the clock to Ts + RTT/2 = Ts + 10 ms, assuming the reply took 10 ms. The reply actually took 18 ms, so the client's clock ends up ~8 ms behind the truth (half the asymmetry: (18 − 2)/2 = 8). Nothing in the round-trip measurement reveals this — it's the asymmetry trap, and it's why low-jitter, symmetric links matter.
In an interview#
Open with the headline: there is no global clock — quartz oscillators drift, so timestamps from different machines are never directly comparable, and that breaks ordering, TTLs, leases, and last-write-wins. Then name the fork in the road: is there a trusted time source? If yes, Cristian's algorithm (or NTP, its real-world descendant). If every node is a peer, Berkeley's averaging — internally consistent but not externally accurate.
Be ready to derive RTT/2 unprompted: it's the mechanism every physical-sync algorithm depends on, and the error bound (±RTT/2) is the whole point. The most common follow-up is 'what if the delay isn't symmetric?' — the estimate is off by half the asymmetry, undetectable from the round trip. Mention the accuracy ladder (NTP ms → PTP µs → TrueTime bounded interval) to show range.
Land the key insight: you can bound clock error but never eliminate it, so when ordering must be exact, reach for logical clocks (Lamport for total order, vector clocks for true concurrency detection) instead of wall-clock timestamps. If the conversation turns to databases, explain TrueTime + commit-wait as the way Spanner buys external consistency with a few milliseconds of latency. Then open the simulator: let drift build up, sync, and watch the skew close; switch to Berkeley and watch an outlier node get excluded from the average.
References & further reading#
- Cristian — Probabilistic Clock Synchronization (Distributed Computing, 1989) — the original paper behind 'Cristian's algorithm' and the RTT/2 estimate
- RFC 5905 — Network Time Protocol Version 4: Protocol and Algorithms — the authoritative NTPv4 spec: strata, four-timestamp exchange, filtering
- Lamport — Time, Clocks, and the Ordering of Events in a Distributed System (1978) — introduces happened-before and logical (Lamport) clocks
- Corbett et al. — Spanner: Google's Globally-Distributed Database (OSDI 2012) — TrueTime, bounded uncertainty ε, and commit-wait for external consistency
- Amazon Time Sync Service — EC2 precision clock & time synchronization — reference clocks per region, PTP hardware clock, leap-second smearing
- Kleppmann — Designing Data-Intensive Applications, Ch. 8 (Unreliable Clocks) — book-length treatment of clock skew, LWW hazards, and logical time
Ready to try it?
The simulator is a real, deterministic implementation — pick a scenario and step through it, scrubbing the timeline forward and backward through every change.