Batch Processing

Summary: An offline data processing model where jobs run on large, bounded datasets to produce results, typically optimized for high throughput rather than low latency.

Sources: raw/chapter10

Last updated: 2026-04-18


Batch processing systems (offline systems) operate on a fixed-size input dataset that is bounded—it has a known start and end. This contrasts with online systems (services) that wait for requests and stream processing (near-real-time) that handles unbounded data (source: chapter10, p. 389).

Characteristics

  • Throughput over Latency: The primary performance metric is the time it takes to process a dataset of a certain size, rather than the response time for a single request (source: chapter10, p. 390).
  • Immutability: Input data is typically treated as immutable. A batch job reads input and produces new output without modifying the original (source: chapter10, p. 413).
  • Materialization: Intermediate state is often written to disk (materialized) to provide fault-tolerance (source: chapter10, p. 419).

Use Cases

  • Search Indexing: Google’s original use for MapReduce was building its search index (source: chapter10, p. 411).
  • Analytics (OLAP): Aggregating logs to produce reports or business intelligence (source: chapter10, p. 411).
  • Machine Learning: Training models on historical data.