MapReduce
Summary: A programming model for processing large amounts of data in bulk across many machines, sitting somewhere between a declarative query language and an imperative API.
Sources: raw/chapter2, raw/chapter10
Last updated: 2026-04-18
MapReduce is based on the map (collect) and reduce (fold/inject) functions found in functional programming (source: chapter2, p. 46). It is the foundational model for distributed batch-processing (source: chapter10, p. 389).
Characteristics
- Pure Functions: The
mapandreducefunctions must be pure, meaning they only use the data passed as input, have no side effects, and cannot perform additional database queries (source: chapter2, p. 48). - Distributed Execution: This restriction allows the framework to run functions anywhere, in any order, and rerun them on failure (source: chapter2, p. 48).
- Low-Level: It is a fairly low-level programming model; higher-level languages like SQL can be implemented as a pipeline of MapReduce operations (source: chapter2, p. 48).
- Storage: It typically reads from and writes to a distributed filesystem like hdfs (source: chapter10, p. 397).
Job Execution Process (source: chapter10)
- Read Input: The framework breaks input files into blocks and calls the Mapper for each record (p. 399).
- Map Phase: Mappers extract a key and value from each record.
- Shuffle Phase: The framework sorts the mapper output by key and partitions it so all records with the same key go to the same Reducer (p. 401).
- Reduce Phase: Reducers process the sorted key-value pairs and produce the final output.
- Write Output: Results are materialized back to HDFS (p. 402).
Fault Tolerance through Materialization
MapReduce provides a clean all-or-nothing guarantee. If a task fails, it is simply retried on another node. This is possible because intermediate state is materialized to disk, which acts as a checkpoint (source: chapter10, p. 413).