MapReduce

Summary: A programming model for processing large amounts of data in bulk across many machines, sitting somewhere between a declarative query language and an imperative API.

Sources: raw/chapter2, raw/chapter10

Last updated: 2026-04-18


MapReduce is based on the map (collect) and reduce (fold/inject) functions found in functional programming (source: chapter2, p. 46). It is the foundational model for distributed batch-processing (source: chapter10, p. 389).

Characteristics

  • Pure Functions: The map and reduce functions must be pure, meaning they only use the data passed as input, have no side effects, and cannot perform additional database queries (source: chapter2, p. 48).
  • Distributed Execution: This restriction allows the framework to run functions anywhere, in any order, and rerun them on failure (source: chapter2, p. 48).
  • Low-Level: It is a fairly low-level programming model; higher-level languages like SQL can be implemented as a pipeline of MapReduce operations (source: chapter2, p. 48).
  • Storage: It typically reads from and writes to a distributed filesystem like hdfs (source: chapter10, p. 397).

Job Execution Process (source: chapter10)

  1. Read Input: The framework breaks input files into blocks and calls the Mapper for each record (p. 399).
  2. Map Phase: Mappers extract a key and value from each record.
  3. Shuffle Phase: The framework sorts the mapper output by key and partitions it so all records with the same key go to the same Reducer (p. 401).
  4. Reduce Phase: Reducers process the sorted key-value pairs and produce the final output.
  5. Write Output: Results are materialized back to HDFS (p. 402).

Fault Tolerance through Materialization

MapReduce provides a clean all-or-nothing guarantee. If a task fails, it is simply retried on another node. This is possible because intermediate state is materialized to disk, which acts as a checkpoint (source: chapter10, p. 413).