Partial Failures

Summary: A defining characteristic of distributed systems where some components are broken while others remain functional.

Sources: chapter8

Last updated: 2026-04-17


In a single-node system, software typically behaves in a deterministic way: either it works or it crashes completely. Hardware components are designed to hide their physical reality and present an idealized system model of mathematical perfection (source: chapter8, p. 274).

In a distributed system, we must confront the messy reality of the physical world. A partial-failures occurs when some parts of the system are broken in an unpredictable way, while other parts are working fine. These failures are non-deterministic: an operation might succeed, fail, or be delayed indefinitely, and the caller has no way of knowing which (source: chapter8, p. 275).

The difficulty of partial failures is that they are often indistinguishable from a slow network or a slow node. This makes building reliable distributed systems much harder than building software for a single computer (source: chapter8, p. 276).