Joins in Batch Processing
Summary: Algorithms for joining large datasets in a distributed batch environment, categorized by whether the join happens on the map side or the reduce side.
Sources: raw/chapter10
Last updated: 2026-04-18
In a distributed batch system, joins are more complex because the data is partitioned across many machines. Unlike a relational database that uses an index to perform a lookup, a batch job typically scans the entire input dataset (source: chapter10, p. 403).
Strategies
Reduce-Side Joins
The most general approach, where the join key is extracted by mappers and the join itself happens in the reducers.
- reduce-side-joins: Also known as sort-merge joins. The mapper extracts the join key, and the shuffle phase sorts the records so that all records with the same key end up at the same reducer (source: chapter10, p. 405).
Map-Side Joins
Optimizations that perform the join in the mapper, avoiding the cost of sorting and shuffling.
- map-side-joins: Requires certain assumptions about the input data, such as one dataset being small enough to fit in memory (broadcast hash join) or both datasets being partitioned and sorted in the same way (partitioned hash join or map-side merge join) (source: chapter10, p. 408).
Handling Skew
When a single join key has a disproportionately large number of records (hot-spots), it can lead to a “straggler” reducer that slows down the entire job. Techniques like skewed join (sampling to identify hot keys and then replicating them) can mitigate this (source: chapter10, p. 407).