Lambda Architecture
Summary: A data-processing architecture designed to handle massive quantities of data by providing both a batch layer for historical accuracy and a speed layer for real-time updates.
Sources: chapter12
Last updated: 2026-04-18
The lambda architecture, popularized by Nathan Marz, is based on the idea that incoming data should be recorded by appending immutable events to an always-growing dataset, similar to event-sourcing. It consists of two parallel systems (source: chapter12):
- Batch Layer: Processes the entire historical dataset (usually with Hadoop) to produce accurate, read-optimized views. This layer is simple but has high latency.
- Speed Layer: Processes only recent updates (using a stream processor like Storm) to provide low-latency results. It is more complex and may be less accurate than the batch layer.
Queries are answered by merging the results from both the batch and speed layers (source: chapter12).
Critique and Unification
While influential, the lambda architecture has several practical problems:
- Maintenance Burden: Developers must maintain the same logic in two different systems (e.g., MapReduce for batch and Storm for stream processing), which is error-prone.
- Complexity of Merging: Merging results from the two layers can be difficult, especially for complex operations like joins.
More recent work has focused on unifying batch and stream processing in a single system (e.g., Apache Flink or Google Cloud Dataflow). This approach allows the same code to run in both modes, reducing the maintenance burden and ensuring consistent results (source: chapter12).