Batch Processing

Summary: An offline data processing model where jobs run on large, bounded datasets to produce results, typically optimized for high throughput rather than low latency.

Sources: raw/chapter10

Last updated: 2026-04-18

Batch processing systems (offline systems) operate on a fixed-size input dataset that is bounded—it has a known start and end. This contrasts with online systems (services) that wait for requests and stream processing (near-real-time) that handles unbounded data (source: chapter10, p. 389).

Characteristics

Throughput over Latency: The primary performance metric is the time it takes to process a dataset of a certain size, rather than the response time for a single request (source: chapter10, p. 390).
Immutability: Input data is typically treated as immutable. A batch job reads input and produces new output without modifying the original (source: chapter10, p. 413).
Materialization: Intermediate state is often written to disk (materialized) to provide fault-tolerance (source: chapter10, p. 419).

Use Cases

Search Indexing: Google’s original use for MapReduce was building its search index (source: chapter10, p. 411).
Analytics (OLAP): Aggregating logs to produce reports or business intelligence (source: chapter10, p. 411).
Machine Learning: Training models on historical data.

Quartz 4

Explorer

batch-processing

Batch Processing

Characteristics

Use Cases

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

batch-processing

Batch Processing

Characteristics

Use Cases

Related pages

Graph View

Table of Contents

Backlinks