Fault Tolerance
Summary: A system’s ability to anticipate and cope with faults, avoiding a total failure.
Sources: chapter1
Last updated: 2026-04-15
Systems that are fault-tolerant or resilient can continue providing their service even when one or more components (faults) are deviating from their spec. It is usually best to design fault-tolerance mechanisms that prevent faults from causing failures (source: chapter1).
Types of Fault Tolerance
- Hardware Redundancy: RAID configurations, dual power supplies, hot-swappable CPUs, etc. (source: chapter1).
- Software Fault Tolerance: Systematic error handling, process isolation, and testing (source: chapter1).
- replication: Keeping copies of data on multiple nodes so the system can continue working even if some parts fail (source: chapter5, p. 151).
- Chaos Engineering: Deliberately triggering faults to ensure fault-tolerance machinery is continually exercised and tested (e.g., Netflix’s Chaos Monkey) (source: chapter1).
Fault Tolerance in Batch Processing
Batch systems like mapreduce and dataflow-engines handle faults by retrying failed tasks.
- Determinism: For retries to be safe, operators must be deterministic (producing the same output for the same input) (source: chapter10, p. 422).
- Materialization: MapReduce writes intermediate state to disk, allowing a failed task to be retried without restarting the entire job (source: chapter10, p. 413).
- Lineage: Dataflow engines track the ancestry of data so that only lost partitions need to be recomputed (source: chapter10, p. 422).