Change Data Capture

Summary: The process of observing all data changes written to a database and extracting them in a form that can be replicated to other systems.

Sources: chapter11

Last updated: 2026-04-18


Purpose

Change Data Capture (CDC) is primarily used to keep derived data systems—such as caches, search indexes, and data warehouses—in sync with a primary “system of record” database.

Implementation

  • Database Triggers: Registering triggers that observe all changes and add entries to a changelog table. This is often fragile and has performance overhead.
  • Parsing Replication Logs: The more robust approach where a tool reads the database’s replication log (e.g., MySQL binlog, MongoDB oplog) to extract changes. Tools include LinkedIn’s Databus, Facebook’s Wormhole, and Debezium.

Key Features

  • Asynchronous: CDC is usually asynchronous; the primary database does not wait for consumers to process the change.
  • Ordering: It is crucial that changes are applied to derived systems in the same order they occurred in the primary database to avoid inconsistency.
  • Log Compaction: If the CDC stream is stored in a log-based-message-brokers, compaction can be used to keep only the latest value for each key, reducing storage requirements.

Role in Unbundling Databases

As discussed in Chapter 12, CDC is a key technology for “unbundling the database.” By providing a stream of data changes, it allows the synchronization of disparate storage technologies (e.g., keeping an Elasticsearch index in sync with a Postgres database) without the need for distributed transactions. This enables a “database-inside-out” architecture where the application’s write path and various read paths are decoupled and can scale independently (source: chapter12).