Map-Side Joins

Summary: Join optimizations that perform the join within the mapper, avoiding the network and disk overhead of sorting and shuffling.

Sources: raw/chapter10

Last updated: 2026-04-18


Map-side joins are faster than reduce-side joins because they don’t require a shuffle phase. However, they make stricter assumptions about the input data (source: chapter10, p. 408).

Types of Map-Side Joins

Broadcast Hash Join

Used when one dataset is small enough to fit in memory on every mapper. The mapper loads the small dataset into a hash table and then scans the large dataset, performing a lookup for each record (source: chapter10, p. 409).

Partitioned Hash Join

Used when both join inputs are partitioned in the same way (using the same key and hash function). Each mapper only needs to load the partition of the small dataset that corresponds to the partition of the large dataset it is processing. This is also known as a bucketed map join (source: chapter10, p. 409).

Map-Side Merge Join

Used when both inputs are not only partitioned in the same way but also sorted by the join key. Each mapper can perform a concurrent scan of both input partitions and merge them (source: chapter10, p. 410).