Imagine this scenario: your analytics dashboard shows yesterday’s sales figures, your recommendation engine serves product suggestions based on last week’s clicks, and your fraud detection system flags a suspicious transaction four hours after the money has already moved. Welcome to the painful reality of batch ETL. For decades, the standard way to move data between systems was to run scheduled jobs at midnight, extract everything, transform it, and load it into the warehouse by breakfast. That worked fine when “data” meant monthly financial reports. It does not work when your microservices need to stay in sync, your search index must reflect inventory changes instantly, and your customers expect real-time personalization.
Change Data Capture, or CDC, flips the model. Instead of asking the database “what changed since yesterday?”, CDC taps directly into the database’s transaction log and streams every insert, update, and delete as it happens. Combine that with Apache Kafka as a durable event bus and Debezium as the connector that reads those logs, and you suddenly have a real-time nervous system for your entire data stack. This guide walks through CDC from first principles to production-grade Debezium deployments, including full Postgres and MySQL examples, schema evolution strategies, the outbox pattern, and the operational pitfalls nobody warns you about.
Why CDC Matters
Before we dive into Debezium specifics, it’s worth understanding what problem CDC actually solves. Three forces pushed the industry toward log-based change capture, and each one corresponds to a category of pain you may already be feeling.
The Latency Tax of Batch ETL
Traditional ETL pipelines run on schedules. A nightly job queries a source database with something like SELECT * FROM orders WHERE updated_at > :last_run, dumps the results to a file, transforms them, and loads them into the warehouse. This approach has three problems: it is slow (data is stale between runs), it is expensive (full scans of large tables hammer your primary), and it misses deletes entirely unless you add soft-delete columns or complicated reconciliation logic. If a row is deleted between two ETL runs, the warehouse never knows it existed. You end up with subtle data quality bugs that take weeks to track down.
The Dual-Write Problem
In a microservices world, a single business event often needs to update multiple systems. An order is placed, so you must save it to Postgres, publish an event to Kafka, update a cache, and send a notification. The naive solution writes to each system sequentially inside the application code. But what happens if the database write succeeds and the Kafka publish fails? You have an order in your database that no other service knows about. Retry logic helps, but now consumers might see duplicate events. This is the classic dual-write problem, and it has no clean solution at the application layer. CDC solves it by making the database the single source of truth: write once to Postgres, and Debezium guarantees the event gets to Kafka.
Keeping Microservices in Sync
When you split a monolith into services, each service owns its own data. But services still need information from each other. The order service needs product details from the catalog service. The shipping service needs addresses from the customer service. You can make synchronous REST calls, but that creates tight coupling and cascading failures. The better pattern is eventual consistency via events: the catalog service publishes product change events, and every other service maintains its own read model. CDC automates the publishing half of this pattern without requiring the catalog service to explicitly emit events.
How CDC Works Under the Hood
Every serious relational database writes a transaction log before it modifies the actual table files. This log goes by different names depending on the vendor. MySQL calls it the binary log or binlog. Postgres calls it the Write-Ahead Log or WAL. MongoDB has the oplog. SQL Server has the transaction log. Oracle has redo logs. The purpose is the same: if the database crashes halfway through a transaction, the log lets it recover by replaying or rolling back operations.
CDC tools piggyback on this infrastructure. They connect to the database using the same protocols used by replication slaves, stream the log entries, parse them into row-level change events, and forward those events somewhere useful. Because the log is written synchronously as part of every transaction, nothing can slip past a CDC tool. Every insert, update, and delete shows up, in the same order the database applied them, with full before-and-after values.
The key insight is that CDC is non-invasive from the database’s perspective. You are not adding triggers that fire on every write. You are not running queries that scan tables. You are reading a log that the database is writing anyway for its own recovery and replication purposes. The overhead is minimal because the work was already being done.
Log-Based vs Trigger-Based vs Query-Based CDC
There are three general approaches to capturing changes from a database, and understanding why log-based won is helpful context for everything that follows.
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Query-based | Poll tables with WHERE updated_at > :cursor |
Simple, no DB privileges needed | Misses deletes, high load, latency |
| Trigger-based | Database triggers write change records to an audit table | Captures all changes including deletes | Adds write overhead to every transaction, schema changes break triggers |
| Log-based | Read the transaction log directly | Low overhead, captures everything, preserves order | Requires DB configuration and privileges |
Query-based CDC is what Kafka Connect JDBC and Airbyte’s incremental sync mode do by default. It works, but it has fundamental limitations. Deletes are invisible unless you add a soft-delete column. High-frequency updates can be missed if multiple changes happen to a row between polls. And running SELECT * FROM big_table WHERE updated_at > ? every minute is punishing for the source database.
Trigger-based CDC was the dominant approach in the 2000s. You would write database triggers that copied changed rows into a shadow table, then an ETL job would drain the shadow table. It works, but the triggers add synchronous overhead to every write, they live inside the database schema (so they must be maintained alongside application migrations), and they can fail in ways that are hard to diagnose.
Log-based CDC is the modern standard because it has none of these drawbacks. The database is already writing the log. You are just reading it. Debezium, GoldenGate, AWS DMS, and most other professional CDC tools all use the log-based approach.
Debezium Architecture
Debezium is an open-source project originally created at Red Hat. It is not a standalone application but a set of source connectors that run inside Kafka Connect. If you have not worked with Kafka Connect before, think of it as a distributed framework specifically designed for moving data between Kafka and external systems. It handles the boring operational concerns (offset tracking, failure recovery, REST API, distributed workers) and lets connector developers focus on the protocol-specific logic for each source or sink.
A typical Debezium deployment has these components:
- Kafka cluster — durable event storage. See our guide to building a Kafka producer pipeline for the fundamentals of topic design and partitioning.
- Kafka Connect cluster — one or more worker processes running the Debezium connector JARs.
- Schema Registry (typically Confluent Schema Registry) — stores Avro or JSON Schema definitions for change events, enabling schema evolution.
- Source database — configured for logical replication with a dedicated CDC user.
- Downstream consumers — Flink jobs, ksqlDB queries, microservices, sink connectors to warehouses or search engines.
Debezium provides connectors for Postgres, MySQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, Vitess, and Spanner. Each one translates the vendor-specific log format into a common event structure, so downstream consumers can treat events uniformly regardless of which database produced them.
Complete Postgres Setup Walkthrough
Let’s set up CDC from a Postgres database to Kafka end-to-end. I will use Docker Compose for the infrastructure because it is the fastest way to have a working cluster on your laptop. If containers are new to you, our Docker primer for development and production covers the basics.
Infrastructure with Docker Compose
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: inventory
command:
- "postgres"
- "-c"
- "wal_level=logical"
- "-c"
- "max_wal_senders=10"
- "-c"
- "max_replication_slots=10"
ports:
- "5432:5432"
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on: [zookeeper]
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
schema-registry:
image: confluentinc/cp-schema-registry:7.5.0
depends_on: [kafka]
ports:
- "8081:8081"
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:29092
connect:
image: debezium/connect:2.5
depends_on: [kafka, schema-registry]
ports:
- "8083:8083"
environment:
BOOTSTRAP_SERVERS: kafka:29092
GROUP_ID: connect-cluster
CONFIG_STORAGE_TOPIC: connect_configs
OFFSET_STORAGE_TOPIC: connect_offsets
STATUS_STORAGE_TOPIC: connect_statuses
KEY_CONVERTER: io.confluent.connect.avro.AvroConverter
VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081
The critical Postgres flags are wal_level=logical, max_wal_senders=10, and max_replication_slots=10. Without logical WAL level, Debezium cannot decode individual row changes. It would only see opaque binary blocks meant for physical replication.
Preparing the Database
-- init.sql: runs on first container start
CREATE SCHEMA inventory;
-- A dedicated replication user with minimal privileges
CREATE ROLE debezium WITH REPLICATION LOGIN PASSWORD 'dbz_secret';
GRANT CONNECT ON DATABASE inventory TO debezium;
GRANT USAGE ON SCHEMA inventory TO debezium;
GRANT SELECT ON ALL TABLES IN SCHEMA inventory TO debezium;
ALTER DEFAULT PRIVILEGES IN SCHEMA inventory
GRANT SELECT ON TABLES TO debezium;
-- Sample tables
CREATE TABLE inventory.customers (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
full_name TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE inventory.orders (
id BIGSERIAL PRIMARY KEY,
customer_id INT REFERENCES inventory.customers(id),
total_cents BIGINT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Publication tells Postgres which tables to stream
CREATE PUBLICATION dbz_publication
FOR TABLE inventory.customers, inventory.orders;
-- REPLICA IDENTITY FULL ensures UPDATE/DELETE events include
-- the complete before-image, not just the primary key
ALTER TABLE inventory.customers REPLICA IDENTITY FULL;
ALTER TABLE inventory.orders REPLICA IDENTITY FULL;
Two things here deserve extra attention. First, the debezium role has REPLICATION privilege, which is required to attach to a replication slot. Second, REPLICA IDENTITY FULL tells Postgres to include every column’s previous value in the WAL when a row is updated or deleted. Without it, UPDATE events only carry the new values plus the primary key, which is often insufficient for downstream processing. The tradeoff is slightly larger WAL files.
Registering the Postgres Connector
With the infrastructure running, register the connector by POSTing its configuration to the Kafka Connect REST API:
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "inventory-postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "dbz_secret",
"database.dbname": "inventory",
"topic.prefix": "inv",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication",
"slot.name": "debezium_slot",
"schema.include.list": "inventory",
"table.include.list": "inventory.customers,inventory.orders",
"snapshot.mode": "initial",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}'
A few parameters deserve explanation. The plugin.name is set to pgoutput, which is Postgres’s built-in logical decoding plugin (available since Postgres 10). The alternative is wal2json, which is a third-party extension. Use pgoutput unless you have a specific reason not to. The topic.prefix becomes the first part of every topic name, so events from inventory.customers will land in the topic inv.inventory.customers. The snapshot.mode set to initial means the connector will perform a consistent snapshot of existing data on first startup, then switch to streaming mode. The Single Message Transform (SMT) at the end unwraps the Debezium envelope to emit just the new row state, which is easier for downstream consumers that do not need the full change event metadata.
Verify the connector is running:
curl http://localhost:8083/connectors/inventory-postgres-connector/status | jq
# Expected output:
# {
# "name": "inventory-postgres-connector",
# "connector": {"state": "RUNNING", "worker_id": "..."},
# "tasks": [{"id": 0, "state": "RUNNING"}],
# "type": "source"
# }
MySQL Connector Configuration
MySQL follows the same pattern but with different prerequisites. You need binary logging enabled with binlog_format=ROW and binlog_row_image=FULL, and the CDC user needs REPLICATION SLAVE and REPLICATION CLIENT privileges.
-- MySQL preparation
CREATE USER 'debezium'@'%' IDENTIFIED BY 'dbz_secret';
GRANT SELECT, RELOAD, SHOW DATABASES,
REPLICATION SLAVE, REPLICATION CLIENT
ON *.* TO 'debezium'@'%';
FLUSH PRIVILEGES;
And the connector registration:
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "inventory-mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz_secret",
"database.server.id": "184054",
"topic.prefix": "inv_mysql",
"database.include.list": "inventory",
"table.include.list": "inventory.customers,inventory.orders",
"schema.history.internal.kafka.bootstrap.servers": "kafka:29092",
"schema.history.internal.kafka.topic": "schema-history.inventory",
"include.schema.changes": "true",
"snapshot.mode": "initial"
}
}'
The database.server.id must be unique across everything that reads the MySQL binlog, including replica servers. Pick any number that is not already in use. The schema.history.internal.kafka.topic is a Debezium-specific concept: because MySQL DDL statements are replicated through the binlog, Debezium maintains its own history of schema changes to correctly parse events for historical rows. You do not need this for Postgres because the pgoutput plugin sends fully-resolved column information with every event.
The Anatomy of a Debezium Event
Every Debezium event follows the same envelope structure regardless of database. Understanding this structure is essential because downstream consumers will process it, and mistakes at this layer cause subtle bugs that only appear during updates or deletes.
A concrete example. Suppose a customer with id=7 updates their email from alice@old.com to alice@new.com. The resulting Debezium event (JSON format, without the full schema envelope) looks like this:
{
"before": {
"id": 7,
"email": "alice@old.com",
"full_name": "Alice Johnson",
"created_at": "2024-01-15T09:23:11.000Z"
},
"after": {
"id": 7,
"email": "alice@new.com",
"full_name": "Alice Johnson",
"created_at": "2024-01-15T09:23:11.000Z"
},
"source": {
"version": "2.5.0.Final",
"connector": "postgresql",
"name": "inv",
"ts_ms": 1714212031000,
"snapshot": "false",
"db": "inventory",
"schema": "inventory",
"table": "customers",
"txId": 48291,
"lsn": 34298192,
"xmin": null
},
"op": "u",
"ts_ms": 1714212031142,
"transaction": null
}
Notice that consumers can detect exactly what changed by diffing before and after. They can also use the source.lsn or source.ts_ms to establish causal ordering across tables, which matters when you are maintaining a read model that depends on joins.
Here is a minimal Python consumer that processes these events. For a deeper dive into consumer patterns, see our Kafka consumer implementation guide.
from confluent_kafka import Consumer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer
from confluent_kafka.serialization import SerializationContext, MessageField
sr_client = SchemaRegistryClient({"url": "http://localhost:8081"})
value_deser = AvroDeserializer(sr_client)
consumer = Consumer({
"bootstrap.servers": "localhost:9092",
"group.id": "customer-sync-service",
"auto.offset.reset": "earliest",
"enable.auto.commit": False,
})
consumer.subscribe(["inv.inventory.customers"])
try:
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
print(f"Consumer error: {msg.error()}")
continue
event = value_deser(
msg.value(),
SerializationContext(msg.topic(), MessageField.VALUE),
)
op = event["op"]
if op == "c":
insert_into_read_model(event["after"])
elif op == "u":
handle_update(event["before"], event["after"])
elif op == "d":
delete_from_read_model(event["before"])
elif op == "r":
# "r" = snapshot read; treat as upsert
upsert_read_model(event["after"])
consumer.commit(message=msg, asynchronous=False)
finally:
consumer.close()
Handling Schema Evolution
Production databases are not static. Columns get added, renamed, dropped, and retyped. A CDC pipeline that cannot handle schema evolution will break the first time a developer runs a migration. Debezium handles schema changes gracefully, but you need to understand the rules of the game.
When you add a nullable column, everything just works. Debezium notices the new column in the next log event, updates the schema in the Schema Registry (which validates compatibility), and consumers pick up the change. If the new column is non-nullable without a default, older events in the topic will not have a value for it, and compatibility rules will reject the schema update. The fix is to always add columns as nullable first, backfill values, then tighten constraints in a later migration.
Renaming a column is harder. From Debezium’s perspective, a rename looks like a drop followed by an add of a new column with the same values. Consumers that were using the old name will suddenly see nulls. The safest path for renames is a three-step dance: add the new column, update application code to write both old and new, migrate consumers, then drop the old column once nothing depends on it.
Schema Registry compatibility modes matter here. The default BACKWARD compatibility means new schemas can be used to read old data. That is what you want for consumers. If you need producers to also tolerate schema changes, use FULL compatibility, which requires both forward and backward compatibility. For CDC pipelines, BACKWARD is usually the right choice.
Common CDC Patterns
Now that you have a working Debezium pipeline, what do you actually do with the events? Here are the four patterns I see most often in production.
CDC to Data Warehouse
The classic use case. Instead of nightly batch loads, you stream database changes into Snowflake, BigQuery, or Redshift continuously. Your BI dashboards are never more than a few seconds behind production. The simplest implementation uses a Kafka sink connector: Confluent provides sink connectors for Snowflake and BigQuery, and the S3 sink connector is popular for landing events in a data lake where engines like Apache Iceberg can make them queryable. Our InfluxDB to Iceberg pipeline guide walks through a similar architecture.
The tricky part is reconstructing the current state from change events. A sink connector appends every event as a row, so a single customer with 100 updates becomes 100 rows in the warehouse. You typically resolve this with a MERGE statement that upserts into a “current state” table, or you use a tool like dbt to materialize the latest snapshot on a schedule. dbt’s snapshot feature handles this elegantly.
CDC to Search Index
Keeping an Elasticsearch or OpenSearch index in sync with your primary database is a classic dual-write problem. CDC solves it. A sink connector (or a custom consumer) reads change events from Kafka and indexes them into Elasticsearch, handling creates, updates, and deletes. Products like Amazon appear in search results within seconds of their creation in the primary catalog. For complex event-time logic that joins CDC streams with other data, consider Flink complex event processing between Kafka and the search backend.
Microservice Event Sourcing
In event-sourced microservices, each service publishes domain events that other services consume. CDC automates the publishing step: you write your changes to your database as usual, and Debezium emits the corresponding events to Kafka. Consumer services maintain local read models optimized for their queries. The catalog service owns product data, but the order service keeps a denormalized copy so it can render order summaries without cross-service calls.
Cache Invalidation
Cache invalidation is famously hard because you must update the cache whenever the underlying data changes. CDC makes it trivial: a tiny consumer listens for change events and deletes (or refreshes) the corresponding cache keys. No more stale cache bugs from developers forgetting to invalidate after updates.
The Outbox Pattern
CDC solves the dual-write problem for simple cases, but what if you need to publish domain events that are not just mirrors of database rows? For example, an OrderPlaced event might include computed fields, references to other aggregates, or data that does not live in any single table. Publishing a straight row-change event from the orders table loses that richness.
The outbox pattern solves this. Instead of publishing directly to Kafka from your application code, you write the event to an outbox table in the same transaction as your business data. Debezium captures the outbox inserts and publishes them to Kafka. You get transactional guarantees (the event is published if and only if the business data is committed) without any of the dual-write hazards.
CREATE TABLE outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
ALTER TABLE outbox REPLICA IDENTITY FULL;
ALTER PUBLICATION dbz_publication ADD TABLE outbox;
In application code (here using FastAPI and SQLAlchemy; see our FastAPI REST API guide for the full stack):
async def place_order(session, customer_id: int, items: list[dict]):
async with session.begin():
order = Order(customer_id=customer_id, status="pending")
session.add(order)
await session.flush() # assigns order.id
for item in items:
session.add(OrderItem(order_id=order.id, **item))
# Outbox event in the SAME transaction
session.add(Outbox(
aggregate_type="order",
aggregate_id=str(order.id),
event_type="OrderPlaced",
payload={
"order_id": order.id,
"customer_id": customer_id,
"total_cents": sum(i["price_cents"] * i["quantity"] for i in items),
"items": items,
},
))
return order
Debezium’s EventRouter SMT can then route these outbox events to topics based on the aggregate_type column, extract the payload, and use aggregate_id as the Kafka message key for partitioning. Configuration:
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.route.topic.replacement": "events.${routedByValue}",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload"
To keep the outbox table from growing forever, run a periodic cleanup job that deletes rows older than your Kafka topic retention. Because consumers read from Kafka, not from the outbox, old rows are safe to remove.
Snapshots and Backfills
A question that comes up immediately in any real deployment: how does Debezium handle the data that existed before CDC was turned on? The answer is snapshots.
When you first start a connector with snapshot.mode=initial, Debezium takes a consistent snapshot by opening a transaction, reading every row from the included tables, and emitting them as events with op=r (for “read”). Once the snapshot completes, it switches to streaming mode and picks up from the log position it recorded at snapshot start. The result is a complete event stream covering both historical and new data, with no gaps or duplicates.
The problem with the initial snapshot mode is that it reads every row in a single long-running transaction. For a 500 GB table, this can take hours and hold replication slot state for the entire duration, causing WAL buildup on the source. Newer Debezium versions (1.6+) support incremental snapshots, which chunk the snapshot into small windows that run concurrently with log streaming. You can even trigger ad-hoc snapshots for specific tables by inserting into a signal table:
-- Create the signal table
CREATE TABLE debezium_signal (
id VARCHAR(42) PRIMARY KEY,
type VARCHAR(32) NOT NULL,
data VARCHAR(2048) NULL
);
-- In connector config:
-- "signal.data.collection": "inventory.debezium_signal",
-- "incremental.snapshot.chunk.size": "1024"
-- Trigger an incremental snapshot for a specific table
INSERT INTO debezium_signal (id, type, data) VALUES (
'snapshot-orders-2024-04',
'execute-snapshot',
'{"data-collections": ["inventory.orders"], "type": "incremental"}'
);
Incremental snapshots are the right choice for large tables or for re-snapshotting after schema changes. They hold no long transactions, can be paused and resumed, and do not block the log streaming pipeline.
Operational Concerns
Running Debezium in production means caring about a handful of operational details that do not matter in development. Here are the ones that have bitten teams I have worked with.
Replication Slot Buildup
This is the single most common production incident. In Postgres, a replication slot tells the server to retain WAL files until the consumer (Debezium) has acknowledged them. If the Debezium connector stops consuming, WAL accumulates on the primary. WAL files are stored on the primary’s data volume. If the volume fills, the database stops accepting writes. Outage.
The mitigations are layered. First, monitor the lag of every replication slot with a query like SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots. Alert if lag exceeds a threshold (say, 10 GB). Second, configure max_slot_wal_keep_size in Postgres 13+ to cap how much WAL can be retained before the slot is invalidated. An invalidated slot requires re-snapshotting but is preferable to a full disk. Third, treat Debezium as a production-critical service: page on connector failures, run it with redundancy, and practice recovery drills.
Offset Management
Debezium stores its offsets (the log position it last processed) in a Kafka topic called connect_offsets by default. If you accidentally delete this topic, or if the offset gets corrupted, the connector will either restart from scratch (re-snapshotting and re-emitting everything) or fail to start. Back up the offsets topic and make it immune to casual deletion via ACLs. Confluent and Debezium both provide tooling to export and inspect offsets.
Transaction Log Retention
Set log retention high enough to tolerate the longest realistic Debezium downtime. If your primary only keeps 1 GB of WAL and Debezium goes down for 6 hours during a high-write period, the logs needed to resume will have been recycled. The connector will fail to restart, and you will need to re-snapshot. For production systems, 24-48 hours of log retention is a reasonable starting point.
Connector Scaling
A single Debezium Postgres connector can only run one task because logical replication is inherently sequential. You cannot shard log reading across multiple workers. If throughput becomes a bottleneck, the solutions are to scale the downstream (more Kafka partitions, more consumer parallelism) or to split the source database into multiple logical publications with separate connectors. MySQL has similar constraints. This is a real limit for very high-volume systems, and it is the main reason some teams eventually move to specialized CDC platforms.
For orchestrating the surrounding workflows (snapshot scheduling, DR drills, schema migration automation), many teams use Apache Airflow for pipeline orchestration.
Troubleshooting Real Problems
When things go wrong, they tend to go wrong in predictable ways. Here is a debugging checklist that covers 90% of the Debezium incidents I have seen.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Connector status FAILED after restart | Source log position no longer exists | Re-snapshot or recover from older offset backup |
| Events missing for a table | Table not in publication or include.list | ALTER PUBLICATION … ADD TABLE, restart connector |
| UPDATE events missing before state | REPLICA IDENTITY not set to FULL | ALTER TABLE … REPLICA IDENTITY FULL |
| Kafka lag growing unbounded | Downstream consumer slower than source writes | Add partitions, scale consumers, batch writes |
| Postgres disk filling up | Inactive replication slot holding WAL | Drop unused slot, check Debezium health |
| Schema Registry rejects new schema | Non-backward-compatible change | Make column nullable first, or bump subject compatibility |
| Duplicate events in Kafka | Connector restart mid-batch | Consumer-side idempotency on primary key |
The “consumer-side idempotency” row deserves extra emphasis. Debezium provides at-least-once delivery, not exactly-once. A connector restart or network blip can cause events to be re-emitted. Any consumer that modifies external state must be idempotent, typically by using the primary key as the upsert key.
Alternative Tools
Debezium is my default recommendation for self-hosted CDC, but it is not the only option. Here is a quick survey of alternatives and when each makes sense.
Fivetran is a managed SaaS that supports CDC for many sources and loads directly into cloud warehouses. It is fast to set up and handles operational concerns for you, but it is expensive (pricing is per monthly active row) and you give up fine-grained control. Good choice if you want warehouse sync and nothing else.
AWS DMS (Database Migration Service) offers CDC as part of its migration tooling. It is cheaper than Fivetran for large volumes and integrates with Kinesis and S3 rather than Kafka. Operational UX is less polished than Debezium, but if you are already in the AWS ecosystem, it is a reasonable default.
Airbyte is an open-source data integration platform that supports CDC for Postgres, MySQL, and SQL Server via Debezium under the hood. It adds a friendlier UI and connector marketplace on top. Good choice if you want a batteries-included platform without building Kafka infrastructure yourself.
Kafka Connect JDBC source is the query-based CDC option built into Kafka Connect. It polls with SQL. Use it only for small, append-only tables where query-based CDC’s limitations do not bite. For anything else, prefer Debezium.
If you are choosing a source database for a CDC-heavy workload, our database comparison guide evaluates CDC ergonomics across Postgres, MySQL, MongoDB, and specialty time-series engines.
Frequently Asked Questions
How does Debezium compare to Fivetran and AWS DMS?
Debezium is open-source and self-hosted, which gives you maximum flexibility and zero per-row costs but requires operating Kafka and Kafka Connect yourself. Fivetran is a fully managed SaaS with excellent warehouse connectors but pricing that scales with data volume and limited customization. AWS DMS is a middle ground: managed service, AWS-only integrations, cheaper than Fivetran for high volumes but less polished operationally. Pick Debezium if you have Kafka already or need CDC to feed multiple downstream systems. Pick Fivetran for warehouse-only sync when speed of setup matters more than cost. Pick AWS DMS for AWS-centric migrations and simple CDC into Kinesis or S3.
Does CDC work without Kafka?
Yes. Debezium has an embedded mode that lets a Java application read change events directly without a Kafka cluster. There is also Debezium Server, which can publish to Kinesis, Pulsar, Redis Streams, Google Pub/Sub, and other destinations. Most non-Debezium CDC tools (AWS DMS, Fivetran) do not use Kafka at all. That said, Kafka’s durability and fan-out semantics make it the most common pairing because it lets many consumers read the same change stream independently without hammering the source database.
How do you handle schema changes in the source database?
Additive changes (new nullable columns) work automatically: Debezium detects them and updates the Schema Registry. For renames, drops, or type changes, use a multi-step migration: add the new structure first, update application code to write both old and new, drain consumers onto the new structure, then remove the old. Schema Registry compatibility modes (typically BACKWARD) enforce the rules. For deeply incompatible changes, you may need to re-snapshot the table, which Debezium can do on demand via signal tables without restarting the connector.
What is the performance impact of Debezium on the source database?
Low, but not zero. Debezium reads the transaction log that the database was already writing, so there is no extra query load for normal operation. The main overheads are: the replication slot holds some memory on the server, REPLICA IDENTITY FULL slightly increases WAL size because full row images are written, and the initial snapshot performs a long-running read transaction. In steady state on a well-tuned Postgres instance, I have seen Debezium add less than 5% CPU overhead on the primary. The big risk is the replication slot backing up during outages, which is an operational concern, not a steady-state performance issue.
How do you handle initial snapshots for huge tables?
Use incremental snapshots (Debezium 1.6+). Instead of one long transaction reading every row, incremental snapshots chunk the work into small windows that run concurrently with log streaming. This eliminates WAL buildup from long-running transactions and lets you pause and resume the snapshot without starting over. You can also pre-populate the target system from a database export (like pg_dump) and then start Debezium in never or schema_only snapshot mode to pick up only new changes, though you must carefully align the log position to avoid missing events during the cutover.
Wrapping Up
Change Data Capture with Debezium and Kafka is one of those technologies that feels like infrastructure magic once you get it working. Batch ETL jobs that used to run for hours get replaced with real-time streams. Dual-write bugs that haunted your microservices architecture disappear because the database is now the single source of truth. Analytics dashboards that showed yesterday’s data update within seconds of a transaction. The tradeoff is operational complexity: you need to run Kafka, you need to understand replication slots, and you need consumers that are idempotent. That complexity pays off quickly for any organization with more than a handful of data consumers, and Debezium’s maturity means you are not pioneering on the rough edges.
If you are just starting, my advice is to set up the Docker Compose stack from this guide, point it at a test Postgres database, and watch events flow into Kafka as you insert and update rows. Then think about which of your current pain points (stale dashboards, dual writes, cache invalidation) would benefit most, and build one CDC consumer for that use case. Expand from there. You will be surprised how quickly it becomes a foundational piece of your data platform.
Leave a Reply