Chapter 6: Partitioning
Summary: This chapter explores different ways of breaking a large dataset into smaller subsets (partitions) to scale beyond a single machine.
Sources: chapter6
Last updated: 2026-04-15
Key Takeaways
Why Partition?
The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster. This allows a large dataset to be distributed across many disks and the query load to be distributed across many processors. (source: chapter6)
Partitioning and Replication
Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes. A node may store more than one partition. (source: chapter6)
Partitioning Strategies
- key-range-partitioning: Assigning a continuous range of keys to each partition. This supports efficient range queries but can lead to hot-spots if access patterns are skewed. (source: chapter6)
- hash-partitioning: Using a hash function to determine the partition for a given key. This distributes keys more evenly and reduces the risk of skew and hot spots, but range queries become inefficient. (source: chapter6)
- consistent-hashing: A specific approach to hash partitioning designed to minimize data movement when nodes are added or removed. (source: chapter6)
Partitioning and Secondary Indexes
Secondary indexes add complexity to partitioning because they don’t map neatly to partitions:
- local-index (Document-partitioned): Each partition maintains its own secondary index. Queries require “scatter/gather” across all partitions. (source: chapter6)
- global-index (Term-partitioned): The secondary index is itself partitioned, allowing more targeted reads but making writes more complex and potentially involving multiple partitions. (source: chapter6)
rebalancing
As the cluster grows or the dataset changes, partitions need to be moved between nodes. Strategies include:
- Fixed number of partitions: Creating many more partitions than nodes.
- Dynamic partitioning: Splitting or merging partitions as they grow or shrink.
- Partitioning proportionally to nodes: Number of partitions is a fixed multiple of the number of nodes. (source: chapter6)
Request Routing
Clients need to know which node holds which partition. This is often handled by:
- service-discovery: Using a separate coordination service like ZooKeeper to keep track of partition assignments.
- gossip-protocol: Nodes communicating among themselves to disseminate cluster state changes (e.g., Cassandra, Riak). (source: chapter6)