Chapter 3: Storage and Retrieval
Summary: An exploration of how databases store and retrieve data, focusing on the trade-offs between different storage engine architectures and the divergence between transactional (OLTP) and analytical (OLAP) workloads.
Sources: chapter3
Last updated: 2026-04-15
Overview
At its most fundamental level, a database performs two tasks: storing data and retrieving it when requested. Choosing an appropriate storage engine requires understanding how it handles these tasks under the hood (source: chapter3).
Key Storage Engine Families
The chapter identifies two main families of storage engines:
- Log-Structured Engines: These engines only permit appending to files and deleting obsolete files. They never update a file that has been written. Examples include hash-indexes, sstables, and lsm-trees. They rely on compaction to reclaim space (source: chapter3).
- Page-Oriented Engines: These engines treat the disk as a set of fixed-size pages that can be overwritten. The most common example is the b-trees architecture (source: chapter3).
OLTP vs. OLAP
There is a significant difference between storage engines optimized for transactional workloads and those optimized for analytics:
- oltp (Online Transaction Processing): User-facing, low-latency, small number of records per query. Disk seek time is often the bottleneck (source: chapter3).
- olap (Online Analytic Processing): Used by analysts, high-volume aggregates, demanding many millions of records to be scanned. Disk bandwidth is often the bottleneck, often employing column-oriented-storage (source: chapter3).
Other Concepts
- bloom-filters: Used to optimize lookups for non-existent keys in log-structured engines (source: chapter3).
- write-ahead-log: A mechanism for crash recovery in update-in-place engines (source: chapter3).
- data-warehousing: A separate database for analytics that extracts data from OLTP systems (source: chapter3).
- materialized-views: Cached aggregates used to speed up common analytical queries (source: chapter3).