Data Integration

Summary: The process of combining multiple specialized data systems (e.g., OLTP, full-text search, and analytics) into a cohesive architecture through the use of derived data and dataflow reasoning.

Sources: chapter12

Last updated: 2026-04-18


In complex applications, data often needs to be stored and used in multiple ways simultaneously. For example, a single piece of information might be stored in a relational database for transactional consistency, a search index for full-text search, and a data warehouse for analytics. Data integration is the practice of keeping these different representations in sync (source: chapter12).

Combining Specialized Tools

Since no single software package can optimally satisfy all data requirements (OLTP, OLAP, full-text search, etc.), developers must cobble together multiple tools. This can be achieved through:

  • Derived Data: One system (the source of record) captures the primary data, and changes are propagated to other systems (derived datasets) through mechanisms like change-data-capture or event-sourcing (source: chapter12).
  • Dataflow Reasoning: Architects must reason about the flow of data through the organization. This involves identifying the primary write path (where user input enters the system) and the read path (how data is served to users) (source: chapter12).

Approaches to Integration

  • Distributed Transactions: Traditionally, 2PC was used to keep different data systems in sync. However, this is often slow and fragile in a distributed environment (source: chapter12).
  • Log-Based Integration: Asynchronous logs (e.g., Kafka) provide an ordering of events that multiple consumers can use to update their respective states. This approach is more robust because it decouples the availability of the source system from the derived systems (source: chapter12).