Reliability

Summary: A system’s ability to continue to work correctly even when things go wrong (adversity).

Sources: chapter1

Last updated: 2026-04-15

Reliability means the system continues to perform the function that the user expected, can tolerate mistakes or unexpected usage, and has good enough performance for the required use case.

Things that can go wrong are called faults, and systems that anticipate and can cope with faults are called fault-tolerance or resilient. Note that a fault is not the same as a failure. A failure is when the entire system stops providing the required service to the user.

Types of Faults

Hardware Faults: Hard disks crashing, RAM becoming faulty, power outages, etc. Traditionally solved by adding redundancy to individual hardware components (source: chapter1).
Software Errors: Systematic errors within the system (e.g., a software bug, a runaway process). These are harder to anticipate and are often correlated across nodes (source: chapter1).
Human Errors: Humans design and build software systems, and the operators who keep them running are also human. Even with the best intentions, humans are known to be unreliable (source: chapter1).

Dealing with Faults

To make systems reliable, we can:

Design systems in a way that minimizes opportunities for error (good abstractions, APIs).
Decouple the places where people make the most mistakes from the places where they can cause failures (sandboxes).
Test thoroughly at all levels, from unit tests to whole-system integration tests.
Allow quick and easy recovery from human errors (rollbacks).
Set up detailed and clear monitoring (telemetry).

Quartz 4

Explorer

reliability

Reliability

Types of Faults

Dealing with Faults

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

reliability

Reliability

Types of Faults

Dealing with Faults

Related pages

Graph View

Table of Contents

Backlinks