Chapter 10: Batch Processing

Summary: This chapter explores batch processing, an offline data processing model that operates on large, bounded datasets. It traces the lineage from Unix tools to MapReduce and modern dataflow engines.

Sources: raw/chapter10

Last updated: 2026-04-18


Batch processing systems take a large amount of input data, run a job to process it, and produce some output data. These systems are typically offline, meaning they don’t provide immediate responses to users but instead focus on high throughput (source: chapter10, p. 389).

Key Themes

The Unix Philosophy

Modern batch processing inherits the unix-philosophy: building small, composable tools that communicate via a uniform interface (files/pipes) and do one thing well (source: chapter10, p. 394).

MapReduce and Distributed Filesystems

mapreduce scaled Unix-like processing to thousands of machines. It relies on a distributed filesystem like hdfs for storage and uses a materialization strategy to ensure fault-tolerance by writing intermediate state to disk (source: chapter10, p. 397).

Distributed Joins

Joining large datasets in a batch context requires specialized algorithms like reduce-side-joins (sort-merge joins) and optimizations like map-side-joins (broadcast or partitioned hash joins) (source: chapter10, p. 403).

Evolution to Dataflow Engines

Modern dataflow-engines (like Spark and Flink) improve upon MapReduce by modeling entire workflows as directed acyclic graphs (DAGs), reducing the need for explicit materialization between stages (source: chapter10, p. 421).

Graph Processing

Iterative algorithms on graphs (like PageRank) are often handled by specialized systems like pregel, which use a “think like a vertex” model (source: chapter10, p. 424).