Joins in Batch Processing

Summary: Algorithms for joining large datasets in a distributed batch environment, categorized by whether the join happens on the map side or the reduce side.

Sources: raw/chapter10

Last updated: 2026-04-18


In a distributed batch system, joins are more complex because the data is partitioned across many machines. Unlike a relational database that uses an index to perform a lookup, a batch job typically scans the entire input dataset (source: chapter10, p. 403).

Strategies

Reduce-Side Joins

The most general approach, where the join key is extracted by mappers and the join itself happens in the reducers.

  • reduce-side-joins: Also known as sort-merge joins. The mapper extracts the join key, and the shuffle phase sorts the records so that all records with the same key end up at the same reducer (source: chapter10, p. 405).

Map-Side Joins

Optimizations that perform the join in the mapper, avoiding the cost of sorting and shuffling.

  • map-side-joins: Requires certain assumptions about the input data, such as one dataset being small enough to fit in memory (broadcast hash join) or both datasets being partitioned and sorted in the same way (partitioned hash join or map-side merge join) (source: chapter10, p. 408).

Handling Skew

When a single join key has a disproportionately large number of records (hot-spots), it can lead to a “straggler” reducer that slows down the entire job. Techniques like skewed join (sampling to identify hot keys and then replicating them) can mitigate this (source: chapter10, p. 407).