Reliability
Summary: A system’s ability to continue to work correctly even when things go wrong (adversity).
Sources: chapter1
Last updated: 2026-04-15
Reliability means the system continues to perform the function that the user expected, can tolerate mistakes or unexpected usage, and has good enough performance for the required use case.
Things that can go wrong are called faults, and systems that anticipate and can cope with faults are called fault-tolerance or resilient. Note that a fault is not the same as a failure. A failure is when the entire system stops providing the required service to the user.
Types of Faults
- Hardware Faults: Hard disks crashing, RAM becoming faulty, power outages, etc. Traditionally solved by adding redundancy to individual hardware components (source: chapter1).
- Software Errors: Systematic errors within the system (e.g., a software bug, a runaway process). These are harder to anticipate and are often correlated across nodes (source: chapter1).
- Human Errors: Humans design and build software systems, and the operators who keep them running are also human. Even with the best intentions, humans are known to be unreliable (source: chapter1).
Dealing with Faults
To make systems reliable, we can:
- Design systems in a way that minimizes opportunities for error (good abstractions, APIs).
- Decouple the places where people make the most mistakes from the places where they can cause failures (sandboxes).
- Test thoroughly at all levels, from unit tests to whole-system integration tests.
- Allow quick and easy recovery from human errors (rollbacks).
- Set up detailed and clear monitoring (telemetry).