Change Data Capture (CDC)

Core Concept

intermediate
25-30 minutes
cdcreplicationstreamingdebeziumevent-sourcingreal-time

Capturing database changes for real-time replication and streaming

Change Data Capture (CDC)

Overview

Change Data Capture (CDC) is a technique for tracking and capturing changes in a database so that downstream systems can be updated in real-time. CDC enables building reactive architectures where changes propagate immediately to dependent systems, like having a notification system that instantly alerts all interested parties when something important happens.

System Architecture Diagram

CDC Approaches

CDC Approaches

Log-Based CDC

Log-based CDC reads the database's write-ahead log (WAL) to capture changes, similar to reading a detailed diary of everything that happened. This approach is non-intrusive because it doesn't require any application changes - it simply reads the logs that the database already maintains. It provides complete capture of all changes including DDL (Data Definition Language) operations like table creation or schema modifications. The latency is very low, providing near real-time change detection because it reads directly from the database's internal logs.

Trigger-Based CDC

Trigger-based CDC uses database triggers that execute automatically on INSERT, UPDATE, and DELETE operations, like having automatic alarms that go off whenever something changes. This approach requires application integration and database schema changes to set up the triggers. It allows selective capture by filtering specific changes based on your requirements. However, it adds performance overhead to transactions because every operation must execute the trigger code.

Timestamp-Based CDC

Timestamp-based CDC tracks the last update time using modified timestamps, like checking when someone last updated their profile. This polling approach periodically checks for changes by comparing timestamps. It's simple to implement and easy to understand, making it a good starting point for CDC implementations. However, it has limitations - it cannot capture deletes because deleted records disappear, and there's inherent polling lag between when changes occur and when they're detected.

Dual Writing

Dual writing involves application-level changes where you write to both the database and a message system simultaneously, like sending a letter to two different addresses at the same time. This provides immediate propagation with no additional latency since changes are sent directly to downstream systems. However, it presents consistency challenges because there's a risk of partial failures where one write succeeds and the other fails. This requires complex error handling and compensation logic to ensure data consistency.

Implementation Patterns

Implementation Patterns

Outbox Pattern

The Outbox pattern provides transactional safety by writing events to an outbox table in the same transaction as the business data, like putting a copy of an important letter in a special mailbox before sending the original. A separate process then relays events from the outbox to the message system, ensuring guaranteed delivery. This approach provides exactly-once semantics by preventing duplicate events and ensures that events are eventually published even if the initial message system call fails.

Event Sourcing

Event sourcing treats events as the source of truth, storing all changes as events rather than just the current state, like keeping a detailed diary of everything that happened rather than just a summary. This creates an append-only log that provides an immutable event history. The system has replay capability, allowing you to rebuild the current state from events, similar to rewinding a movie to see how things got to where they are. This provides a complete audit trail with full change history.

Saga Pattern

The Saga pattern coordinates distributed transactions across multiple services, like having a conductor who ensures all musicians in an orchestra play in harmony. It uses compensating actions to handle failures with rollback logic, similar to having a backup plan for when things go wrong. The pattern accepts eventual consistency, recognizing that temporary inconsistencies are acceptable as long as the system eventually reaches a consistent state. However, this requires complex orchestration and careful design to handle all possible failure scenarios.

Debezium

Debezium is a Kafka Connect connector framework specifically designed for CDC, providing connectors for multiple databases including MySQL, PostgreSQL, MongoDB, and SQL Server. It uses log-based CDC by reading transaction logs directly from the database. The system is fault-tolerant and handles failures gracefully, making it suitable for production environments where reliability is critical.

Apache Kafka Connect

Apache Kafka Connect provides source connectors for extracting data from databases and sink connectors for loading data into target systems. It offers a scalable, distributed connector framework that can handle high-volume data processing. The system is pluggable with an extensible connector ecosystem, allowing you to add custom connectors for specialized use cases.

AWS DMS (Database Migration Service)

AWS DMS is a managed service that provides AWS-hosted CDC solutions for real-time replication. It offers continuous data replication with built-in schema conversion capabilities to transform data between different database types. The service includes comprehensive monitoring and alerting features to help you track the health and performance of your CDC processes.

MongoDB Change Streams

MongoDB Change Streams provide native CDC capabilities built directly into MongoDB. They allow you to watch collection changes in real-time, providing immediate notification when data is modified. The streams are resumable, allowing you to resume from a specific point in time if your application restarts. You can also filter the streams to watch only specific types of changes that interest you.

Challenges and Solutions

Schema Evolution

  • Problem: Database schema changes break downstream consumers
  • Solution: Schema registry and compatibility rules
  • Versioning: Maintain multiple schema versions
  • Migration: Gradual rollout of schema changes

Large Transactions

  • Problem: Large transactions can overwhelm message systems
  • Solution: Batch processing and message splitting
  • Buffering: Temporary storage for large changesets
  • Rate limiting: Control message production rate

Initial State Synchronization

  • Problem: Need complete data snapshot plus ongoing changes
  • Solution: Snapshot + CDC combination
  • Consistency: Ensure snapshot and CDC align properly
  • Performance: Minimize impact on source system

Ordering and Dependencies

  • Problem: Changes must be applied in correct order
  • Solution: Partition by entity ID or use single partition
  • Dependencies: Handle foreign key relationships
  • Constraints: Ensure referential integrity

Use Cases

Real-Time Analytics

  • Stream to data warehouse: Keep analytics up-to-date
  • Dashboards: Real-time business metrics
  • Alerting: Immediate notification of changes
  • Machine learning: Fresh data for models

Microservices Data Sync

  • Service boundaries: Keep related data in sync
  • Eventual consistency: Accept temporary inconsistencies
  • Event-driven: React to changes in other services
  • Decoupling: Reduce direct service dependencies

Cache Invalidation

  • Cache consistency: Invalidate caches when data changes
  • Performance: Maintain fast read access
  • Automatic: No manual cache management
  • Selective: Invalidate only affected cache entries

Search Index Updates

  • Elasticsearch sync: Keep search indexes current
  • Document updates: Reflect database changes in search
  • Real-time search: Immediate searchability of new data
  • Incremental updates: Avoid full reindexing

Best Practices

Effective CDC implementation requires careful planning and attention to several key areas. Choose the appropriate method based on your requirements - log-based CDC provides minimal impact on your source system. Handle schema changes by planning for database evolution and using schema registries with compatibility rules. Monitor lag to track replication delay and ensure your downstream systems stay current. Implement robust error handling with retry mechanisms to handle temporary failures. Secure access to transaction logs to prevent unauthorized access to sensitive data. Test your CDC pipeline thoroughly to ensure it works correctly under various conditions. Document data flow and dependencies to help with troubleshooting and maintenance.

CDC enables building responsive, event-driven architectures but requires careful planning for schema evolution and error handling.

Related Concepts

message-brokers
replication-strategies
event-sourcing

Used By

debeziumconfluentmongodbmysqlpostgresql