MapReduce

Summary: A programming model for processing large amounts of data in bulk across many machines, sitting somewhere between a declarative query language and an imperative API.

Sources: raw/chapter2, raw/chapter10

Last updated: 2026-04-18

MapReduce is based on the map (collect) and reduce (fold/inject) functions found in functional programming (source: chapter2, p. 46). It is the foundational model for distributed batch-processing (source: chapter10, p. 389).

Characteristics

Pure Functions: The map and reduce functions must be pure, meaning they only use the data passed as input, have no side effects, and cannot perform additional database queries (source: chapter2, p. 48).
Distributed Execution: This restriction allows the framework to run functions anywhere, in any order, and rerun them on failure (source: chapter2, p. 48).
Low-Level: It is a fairly low-level programming model; higher-level languages like SQL can be implemented as a pipeline of MapReduce operations (source: chapter2, p. 48).
Storage: It typically reads from and writes to a distributed filesystem like hdfs (source: chapter10, p. 397).

Job Execution Process (source: chapter10)

Read Input: The framework breaks input files into blocks and calls the Mapper for each record (p. 399).
Map Phase: Mappers extract a key and value from each record.
Shuffle Phase: The framework sorts the mapper output by key and partitions it so all records with the same key go to the same Reducer (p. 401).
Reduce Phase: Reducers process the sorted key-value pairs and produce the final output.
Write Output: Results are materialized back to HDFS (p. 402).

Fault Tolerance through Materialization

MapReduce provides a clean all-or-nothing guarantee. If a task fails, it is simply retried on another node. This is possible because intermediate state is materialized to disk, which acts as a checkpoint (source: chapter10, p. 413).

Quartz 4

Explorer

mapreduce

MapReduce

Characteristics

Job Execution Process (source: chapter10)

Fault Tolerance through Materialization

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

mapreduce

MapReduce

Characteristics

Job Execution Process (source: chapter10)

Fault Tolerance through Materialization

Related pages

Graph View

Table of Contents

Backlinks