Fault Tolerance

Summary: A system’s ability to anticipate and cope with faults, avoiding a total failure.

Sources: chapter1

Last updated: 2026-04-15

Systems that are fault-tolerant or resilient can continue providing their service even when one or more components (faults) are deviating from their spec. It is usually best to design fault-tolerance mechanisms that prevent faults from causing failures (source: chapter1).

Types of Fault Tolerance

Hardware Redundancy: RAID configurations, dual power supplies, hot-swappable CPUs, etc. (source: chapter1).
Software Fault Tolerance: Systematic error handling, process isolation, and testing (source: chapter1).
replication: Keeping copies of data on multiple nodes so the system can continue working even if some parts fail (source: chapter5, p. 151).
Chaos Engineering: Deliberately triggering faults to ensure fault-tolerance machinery is continually exercised and tested (e.g., Netflix’s Chaos Monkey) (source: chapter1).

Fault Tolerance in Batch Processing

Batch systems like mapreduce and dataflow-engines handle faults by retrying failed tasks.

Determinism: For retries to be safe, operators must be deterministic (producing the same output for the same input) (source: chapter10, p. 422).
Materialization: MapReduce writes intermediate state to disk, allowing a failed task to be retried without restarting the entire job (source: chapter10, p. 413).
Lineage: Dataflow engines track the ancestry of data so that only lost partitions need to be recomputed (source: chapter10, p. 422).

Quartz 4

Explorer

fault-tolerance

Fault Tolerance

Types of Fault Tolerance

Fault Tolerance in Batch Processing

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

fault-tolerance

Fault Tolerance

Types of Fault Tolerance

Fault Tolerance in Batch Processing

Related pages

Graph View

Table of Contents

Backlinks