Chapter 6: Partitioning

Summary: This chapter explores different ways of breaking a large dataset into smaller subsets (partitions) to scale beyond a single machine.

Sources: chapter6

Last updated: 2026-04-15


Key Takeaways

Why Partition?

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster. This allows a large dataset to be distributed across many disks and the query load to be distributed across many processors. (source: chapter6)

Partitioning and Replication

Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes. A node may store more than one partition. (source: chapter6)

Partitioning Strategies

  • key-range-partitioning: Assigning a continuous range of keys to each partition. This supports efficient range queries but can lead to hot-spots if access patterns are skewed. (source: chapter6)
  • hash-partitioning: Using a hash function to determine the partition for a given key. This distributes keys more evenly and reduces the risk of skew and hot spots, but range queries become inefficient. (source: chapter6)
  • consistent-hashing: A specific approach to hash partitioning designed to minimize data movement when nodes are added or removed. (source: chapter6)

Partitioning and Secondary Indexes

Secondary indexes add complexity to partitioning because they don’t map neatly to partitions:

  • local-index (Document-partitioned): Each partition maintains its own secondary index. Queries require “scatter/gather” across all partitions. (source: chapter6)
  • global-index (Term-partitioned): The secondary index is itself partitioned, allowing more targeted reads but making writes more complex and potentially involving multiple partitions. (source: chapter6)

rebalancing

As the cluster grows or the dataset changes, partitions need to be moved between nodes. Strategies include:

  • Fixed number of partitions: Creating many more partitions than nodes.
  • Dynamic partitioning: Splitting or merging partitions as they grow or shrink.
  • Partitioning proportionally to nodes: Number of partitions is a fixed multiple of the number of nodes. (source: chapter6)

Request Routing

Clients need to know which node holds which partition. This is often handled by:

  • service-discovery: Using a separate coordination service like ZooKeeper to keep track of partition assignments.
  • gossip-protocol: Nodes communicating among themselves to disseminate cluster state changes (e.g., Cassandra, Riak). (source: chapter6)