Fault Tolerance

Summary: A system’s ability to anticipate and cope with faults, avoiding a total failure.

Sources: chapter1

Last updated: 2026-04-15


Systems that are fault-tolerant or resilient can continue providing their service even when one or more components (faults) are deviating from their spec. It is usually best to design fault-tolerance mechanisms that prevent faults from causing failures (source: chapter1).

Types of Fault Tolerance

  • Hardware Redundancy: RAID configurations, dual power supplies, hot-swappable CPUs, etc. (source: chapter1).
  • Software Fault Tolerance: Systematic error handling, process isolation, and testing (source: chapter1).
  • replication: Keeping copies of data on multiple nodes so the system can continue working even if some parts fail (source: chapter5, p. 151).
  • Chaos Engineering: Deliberately triggering faults to ensure fault-tolerance machinery is continually exercised and tested (e.g., Netflix’s Chaos Monkey) (source: chapter1).

Fault Tolerance in Batch Processing

Batch systems like mapreduce and dataflow-engines handle faults by retrying failed tasks.

  • Determinism: For retries to be safe, operators must be deterministic (producing the same output for the same input) (source: chapter10, p. 422).
  • Materialization: MapReduce writes intermediate state to disk, allowing a failed task to be retried without restarting the entire job (source: chapter10, p. 413).
  • Lineage: Dataflow engines track the ancestry of data so that only lost partitions need to be recomputed (source: chapter10, p. 422).