Partitioning
Summary: The process of breaking a large dataset into smaller subsets, also known as sharding, to distribute data and query load across multiple nodes.
Sources: chapter6
Last updated: 2026-04-15
Partitioning is a fundamental technique for achieving scalability in data-intensive applications. It is often used in conjunction with replication for fault tolerance. (source: chapter6)
Key Concepts
- Sharding: Another term for partitioning, commonly used in MongoDB, Elasticsearch, and SolrCloud. (source: chapter6)
- hot-spots: Partitions with disproportionately high load due to skewed data or access patterns. (source: chapter6)
- rebalancing: The process of moving load from one node to another in the cluster. (source: chapter6)
Strategies
key-range-partitioning
Assigns a continuous range of keys to each partition.
- Pros: Efficient range queries.
- Cons: Risk of skew and hot spots (e.g., if keys are timestamps). (source: chapter6)
hash-partitioning
Uses a hash function on the key to determine the partition.
- Pros: Good at distributing load evenly and avoiding hot spots.
- Cons: Range queries require searching all partitions (scatter/gather). (source: chapter6)
Secondary Indexes in Partitioned Databases
- local-index: Each partition is independent; indexes only the documents in that partition.
- global-index: A global index that is itself partitioned (term-partitioned). (source: chapter6)