Dataflow Engines
Summary: A generation of distributed processing frameworks that model entire workflows as a single graph of operations, improving upon the rigid Map-Shuffle-Reduce model.
Sources: raw/chapter10
Last updated: 2026-04-18
Dataflow engines like Spark, Tez, and Flink were developed to overcome the limitations of MapReduce. They treat an entire workflow as a single job, rather than a chain of independent MapReduce jobs (source: chapter10, p. 421).
Advantages over MapReduce
- No Rigid Stages: Instead of strictly alternating between map and reduce, operators can be assembled in flexible ways. The entire workflow is modeled as a Directed Acyclic Graph (DAG) (source: chapter10, p. 421).
- Reduced Materialization: Intermediate state between operators is often kept in memory or written to local disk, rather than being replicated across the distributed filesystem. This significantly improves performance (source: chapter10, p. 421).
- Pipelining: An operator can start processing data as soon as its input is ready, without waiting for the preceding stage to finish entirely (source: chapter10, p. 423).
Fault Tolerance
Since dataflow engines avoid materialization to HDFS, they must have a different way to recover from failures. They track the lineage of each dataset (which operators and inputs were used to create it). If a partition is lost, it can be recomputed from its ancestors (source: chapter10, p. 422).