Partitioning

Summary: The process of breaking a large dataset into smaller subsets, also known as sharding, to distribute data and query load across multiple nodes.

Sources: chapter6

Last updated: 2026-04-15


Partitioning is a fundamental technique for achieving scalability in data-intensive applications. It is often used in conjunction with replication for fault tolerance. (source: chapter6)

Key Concepts

  • Sharding: Another term for partitioning, commonly used in MongoDB, Elasticsearch, and SolrCloud. (source: chapter6)
  • hot-spots: Partitions with disproportionately high load due to skewed data or access patterns. (source: chapter6)
  • rebalancing: The process of moving load from one node to another in the cluster. (source: chapter6)

Strategies

key-range-partitioning

Assigns a continuous range of keys to each partition.

  • Pros: Efficient range queries.
  • Cons: Risk of skew and hot spots (e.g., if keys are timestamps). (source: chapter6)

hash-partitioning

Uses a hash function on the key to determine the partition.

  • Pros: Good at distributing load evenly and avoiding hot spots.
  • Cons: Range queries require searching all partitions (scatter/gather). (source: chapter6)

Secondary Indexes in Partitioned Databases

  • local-index: Each partition is independent; indexes only the documents in that partition.
  • global-index: A global index that is itself partitioned (term-partitioned). (source: chapter6)