Reduce-Side Joins

Summary: The default join strategy in MapReduce, where mappers extract join keys and the shuffle phase groups related records for the reducer to join.

Sources: raw/chapter10

Last updated: 2026-04-18


Reduce-side joins, often implemented as sort-merge joins, are robust and handle datasets of any size. They do not require any prior knowledge of the data’s partitioning or sorting (source: chapter10, p. 405).

Process

  1. Mapper: For each input record, the mapper extracts a join key and outputs it along with the record’s values.
  2. Shuffle: The MapReduce framework sorts the output by key and partitions it so that all records with the same key end up at the same reducer.
  3. Reducer: The reducer receives all records for a given key. It can then perform the join logic, such as iterating over a list of events and joining them with a user profile (source: chapter10, p. 405).

Secondary Sort

To ensure the reducer receives the “system of record” (e.g., the user profile) before the “activity events” for a given key, the framework can be configured to sort the values for each key as well. This is known as secondary sort (source: chapter10, p. 406).