Reduce-Side Joins
Summary: The default join strategy in MapReduce, where mappers extract join keys and the shuffle phase groups related records for the reducer to join.
Sources: raw/chapter10
Last updated: 2026-04-18
Reduce-side joins, often implemented as sort-merge joins, are robust and handle datasets of any size. They do not require any prior knowledge of the data’s partitioning or sorting (source: chapter10, p. 405).
Process
- Mapper: For each input record, the mapper extracts a join key and outputs it along with the record’s values.
- Shuffle: The MapReduce framework sorts the output by key and partitions it so that all records with the same key end up at the same reducer.
- Reducer: The reducer receives all records for a given key. It can then perform the join logic, such as iterating over a list of events and joining them with a user profile (source: chapter10, p. 405).
Secondary Sort
To ensure the reducer receives the “system of record” (e.g., the user profile) before the “activity events” for a given key, the framework can be configured to sort the values for each key as well. This is known as secondary sort (source: chapter10, p. 406).