Dataflow Engines

Summary: A generation of distributed processing frameworks that model entire workflows as a single graph of operations, improving upon the rigid Map-Shuffle-Reduce model.

Sources: raw/chapter10

Last updated: 2026-04-18


Dataflow engines like Spark, Tez, and Flink were developed to overcome the limitations of MapReduce. They treat an entire workflow as a single job, rather than a chain of independent MapReduce jobs (source: chapter10, p. 421).

Advantages over MapReduce

  • No Rigid Stages: Instead of strictly alternating between map and reduce, operators can be assembled in flexible ways. The entire workflow is modeled as a Directed Acyclic Graph (DAG) (source: chapter10, p. 421).
  • Reduced Materialization: Intermediate state between operators is often kept in memory or written to local disk, rather than being replicated across the distributed filesystem. This significantly improves performance (source: chapter10, p. 421).
  • Pipelining: An operator can start processing data as soon as its input is ready, without waiting for the preceding stage to finish entirely (source: chapter10, p. 423).

Fault Tolerance

Since dataflow engines avoid materialization to HDFS, they must have a different way to recover from failures. They track the lineage of each dataset (which operators and inputs were used to create it). If a partition is lost, it can be recomputed from its ancestors (source: chapter10, p. 422).