Lambda Architecture

Summary: A data-processing architecture designed to handle massive quantities of data by providing both a batch layer for historical accuracy and a speed layer for real-time updates.

Sources: chapter12

Last updated: 2026-04-18


The lambda architecture, popularized by Nathan Marz, is based on the idea that incoming data should be recorded by appending immutable events to an always-growing dataset, similar to event-sourcing. It consists of two parallel systems (source: chapter12):

  1. Batch Layer: Processes the entire historical dataset (usually with Hadoop) to produce accurate, read-optimized views. This layer is simple but has high latency.
  2. Speed Layer: Processes only recent updates (using a stream processor like Storm) to provide low-latency results. It is more complex and may be less accurate than the batch layer.

Queries are answered by merging the results from both the batch and speed layers (source: chapter12).

Critique and Unification

While influential, the lambda architecture has several practical problems:

  • Maintenance Burden: Developers must maintain the same logic in two different systems (e.g., MapReduce for batch and Storm for stream processing), which is error-prone.
  • Complexity of Merging: Merging results from the two layers can be difficult, especially for complex operations like joins.

More recent work has focused on unifying batch and stream processing in a single system (e.g., Apache Flink or Google Cloud Dataflow). This approach allows the same code to run in both modes, reducing the maintenance burden and ensuring consistent results (source: chapter12).