Chapter 8: The Trouble with Distributed Systems
Summary: This chapter explores the fundamental challenges of building reliable systems on top of unreliable components, focusing on partial failures, network delays, clock drift, and process pauses.
Sources: chapter8
Last updated: 2026-04-17
Key Themes
Faults and Partial Failures
In a single-node system, software usually either works or crashes. In a distributed system, partial-failures are common: some parts of the system are broken while others work fine. These failures are non-deterministic and can be difficult to detect (source: chapter8, p. 275).
Unreliable Networks
Distributed systems communicate via asynchronous networks, which are unreliable-networks. Packets can be lost, delayed, reordered, or duplicated. The only way to detect a failure is through a timeout, which cannot distinguish between a crashed node, a network fault, or a slow response (source: chapter8, p. 278).
Unreliable Clocks
Nodes in a distributed system have their own local unreliable-clocks (quartz oscillators) which drift at different rates. Time-of-day clocks can jump backward (e.g., due to NTP synchronization), making them dangerous for ordering events across nodes. logical-clocks are often a safer alternative for ordering (source: chapter8, p. 291).
Knowledge, Truth, and Lies
In a distributed system, a node cannot know anything for sure; it can only make inferences based on the messages it receives. Truth is often defined by a quorum—a majority of nodes must agree on a fact (source: chapter8, p. 300).
- fencing-tokens: Used to ensure that a node whose lease has expired cannot perform actions that interfere with its successor (source: chapter8, p. 303).
- byzantine-faults: Nodes that may lie or act maliciously, as opposed to simply crashing (source: chapter8, p. 304).
System Models
To reason about distributed algorithms, we use system-models that formalize assumptions about timing (synchronous, partially synchronous, asynchronous) and failures (crash-stop, crash-recovery, Byzantine) (source: chapter8, p. 306).