Materialization

Summary: The process of eagerly computing the result of a sub-operation and writing it to storage (typically disk), allowing it to be reused by subsequent stages or jobs.

Sources: raw/chapter10

Last updated: 2026-04-18


In the context of batch processing, materialization is a key technique for ensuring durability and fault-tolerance. In MapReduce, the output of every job is materialized to the distributed filesystem (hdfs) before it can be used as input for the next job (source: chapter10, p. 419).

Benefits

  • Fault Tolerance: If a node fails, its output can be re-read from the materialized state on disk (source: chapter10, p. 419).
  • Separation of Concerns: Different teams can implement different stages of a workflow, communicating only through materialized files (source: chapter10, p. 396).

Drawbacks

  • Performance Overhead: Writing and reading from disk (and replicating over the network in HDFS) is much slower than in-memory communication (source: chapter10, p. 420).
  • Redundant Reads: Mappers often just read back what a previous reducer just wrote.

Alternatives

dataflow-engines (like Spark) avoid materialization between operators by keeping intermediate state in memory or on local disk, recomputing it from lineage only if a failure occurs (source: chapter10, p. 421).