Batch Processing
Summary: An offline data processing model where jobs run on large, bounded datasets to produce results, typically optimized for high throughput rather than low latency.
Sources: raw/chapter10
Last updated: 2026-04-18
Batch processing systems (offline systems) operate on a fixed-size input dataset that is bounded—it has a known start and end. This contrasts with online systems (services) that wait for requests and stream processing (near-real-time) that handles unbounded data (source: chapter10, p. 389).
Characteristics
- Throughput over Latency: The primary performance metric is the time it takes to process a dataset of a certain size, rather than the response time for a single request (source: chapter10, p. 390).
- Immutability: Input data is typically treated as immutable. A batch job reads input and produces new output without modifying the original (source: chapter10, p. 413).
- Materialization: Intermediate state is often written to disk (materialized) to provide fault-tolerance (source: chapter10, p. 419).
Use Cases
- Search Indexing: Google’s original use for MapReduce was building its search index (source: chapter10, p. 411).
- Analytics (OLAP): Aggregating logs to produce reports or business intelligence (source: chapter10, p. 411).
- Machine Learning: Training models on historical data.