Technique: Outbox + CDC for Reliable Event Publishing¶
Quick facts
- Category: Backend & Distributed Systems
- Type: Technique
- Parent style: Event-Driven Architecture
- Also used by: CQRS with CDC-Driven Read Models, Saga Pattern
- Maturity: Adopt
- Typical team size: 1-2 engineers
- Typical timeline to MVP: 2-4 weeks (first working connector; 1 week per additional source table)
- Last reviewed: 2026-05-21 by Architecture Team
1. Context¶
Use this technique when:
- A service must write to its database and publish an event to a message broker as a single logical operation — losing either half is unacceptable
- Two-phase commit between the database and the message broker is ruled out due to availability, latency, or vendor incompatibility concerns
- The event source is a database you do not fully control (legacy core banking system, packaged software, third-party data store) — there is no application-level event hook to intercept
- You need a replayable, durable record of every state change that downstream services can consume at their own pace
Do NOT use this technique when:
- At-most-once delivery is acceptable for the use case — if dropping occasional events is tolerable, a direct Kafka produce call in the application is simpler
- The source database does not support Change Data Capture (no WAL access, no binary log, no CDC extension) — verify CDC support before committing to this approach
- The event volume is very low and the operational overhead of a CDC connector cannot be justified — a polling publisher scanning the outbox table on a cron schedule is a simpler alternative for low-throughput sources (see microservices.io/patterns/data/polling-publisher)
2. Problem it solves¶
Every service that publishes events faces the dual-write problem: writing to its own database and publishing to a message broker are two separate I/O operations, and there is no atomic way to do both. If the application writes to the database and then crashes before publishing, the event is silently lost — downstream services never learn that the order was placed, the payment posted, or the account opened. If the application publishes to Kafka first and then the database write fails and rolls back, a ghost event is emitted for a transaction that never committed — consumers act on data that does not exist in the source. The Outbox + CDC technique resolves this by making the event part of the database transaction: the application writes both its business data and an outbox event record in a single local commit, then a CDC connector captures that committed row from the database's replication log and publishes it to Kafka. The broker never receives an event unless the database has durably committed it.
3. Solution overview¶
System context (C4 Level 1)¶
flowchart LR
App([Producer Service]) -->|1 single local transaction| DB[(Source DB\nbusiness tables\n+ outbox table)]
DB -->|2 WAL / binlog\nreplication stream| CDC[CDC Connector\nDebezium]
CDC -->|3 publish outbox event| Kafka[Kafka\nEvent Bus]
Kafka -->|4 consume| ConsumerA[Consumer A]
Kafka -->|4 consume| ConsumerB[Consumer B]
Container view (C4 Level 2)¶
flowchart TB
subgraph Producer["Producer Service"]
AppCode[Application Code]
OutboxTable[(Outbox Table\nevent_id, aggregate_type\naggregate_id, payload, created_at)]
BizTable[(Business Tables\norders, accounts, payments)]
end
subgraph CDC["CDC Layer"]
RepSlot[Replication Slot\nwal_level = logical]
Debezium[Debezium Connector\nPostgres / MySQL]
SchemaReg[Schema Registry\nAvro / Protobuf]
end
subgraph Bus["Event Bus"]
KafkaTopic[Kafka Topic\nper aggregate type]
DLQ[Dead-letter topic\nfailed events]
end
subgraph Cleanup["Outbox Cleanup"]
CleanupJob[Cleanup Job\ndelete processed rows]
end
AppCode -->|single DB transaction| BizTable
AppCode -->|same transaction| OutboxTable
OutboxTable -->|WAL stream| RepSlot
RepSlot --> Debezium
Debezium --> SchemaReg
Debezium -->|INSERT events| KafkaTopic
Debezium -->|on failure| DLQ
CleanupJob -->|DELETE WHERE processed| OutboxTable
4. Technology stack¶
| Layer | Primary choice | Alternatives | Notes |
|---|---|---|---|
| CDC connector | Debezium | AWS DMS, Maxwell (MySQL only), PGLogical | Debezium is the open-source standard; runs as a Kafka Connect connector; supports PostgreSQL WAL, MySQL binlog, Oracle LogMiner, SQL Server CDC — deploy on the Kafka Connect cluster already present in the EDA stack |
| Source database | PostgreSQL (wal_level = logical) |
MySQL (binlog enabled), Oracle (LogMiner), SQL Server (CDC feature) | PostgreSQL is the primary choice; verify wal_level = logical is permitted in your managed DB offering before starting — RDS, Cloud SQL, and Azure Database for PostgreSQL all support it |
| Outbox table | Application-managed table in source DB | Separate outbox schema | Co-locate the outbox table in the same database and same schema as the business tables — atomicity depends on a single local transaction |
| Message bus | Apache Kafka | AWS MSK, Confluent Cloud | Debezium is designed around Kafka; use one topic per aggregate type (e.g. order.events, payment.events) with the aggregate ID as the partition key for ordering |
| Schema format | Avro + Confluent Schema Registry | Protobuf | Debezium natively integrates with Schema Registry; the outbox event payload should use Avro for schema evolution enforcement |
| Outbox cleanup | Scheduled DELETE job (retain rows for 1–7 days after creation) | Kafka log compaction on outbox topic, TTL column | Delete rows only after confirming the connector LSN has advanced past them; never delete before the connector has confirmed delivery |
| Observability | Debezium connector status (Kafka Connect REST API) + replication slot lag (PostgreSQL pg_replication_slots) |
Datadog Debezium integration | Replication slot lag and connector status are the two primary health signals — monitor both; a stopped connector with an active slot is a disk-fill risk |
5. Non-functional characteristics¶
| Concern | Profile |
|---|---|
| Delivery guarantee | Exactly-once write to the source database; at-least-once publish to Kafka (CDC can re-deliver on connector restart). Every consumer of outbox-sourced events must be idempotent. |
| Event propagation latency | CDC capture from committed WAL record to Kafka message: p95 < 100 ms under normal load. Total end-to-end (application write to consumer processing): p95 < 500 ms. Spikes during rebalance or high-throughput batch writes can extend this to seconds. |
| Availability impact on producer | Zero — the CDC connector reads from the replication log asynchronously; it does not participate in the application's write path. A connector outage pauses event publishing but does not affect the producer service's ability to write to its database. |
| Ordering guarantee | Within a single partition (keyed by aggregate ID): strict order. Across partitions: no order guarantee. All events for the same aggregate (e.g. all events for orderId=123) arrive in commit order at the consumer if the aggregate ID is used as the Kafka partition key. |
| Security posture | The replication slot grants the CDC connector read access to the full WAL, which may contain data beyond the outbox table. Use a dedicated PostgreSQL role with REPLICATION privilege scoped to the outbox publication only (CREATE PUBLICATION outbox_pub FOR TABLE outbox). The outbox payload may contain PII — apply field-level encryption or pseudonymisation before writing to the outbox table, not after. |
| Compliance fit | GDPR — outbox events are retained in Kafka; apply the same PII retention and erasure policy as transaction records. A crypto-shredding approach (encrypt PII fields with a per-entity key; delete the key on erasure) is the standard mechanism. SOC 2 — every committed event is captured and timestamped by the WAL, providing an immutable audit trail. |
6. Cost ballpark¶
The CDC connector runs on the Kafka Connect cluster. Incremental cost above an existing EDA infrastructure is low.
| Scale | Events / day | Incremental monthly cost | Cost drivers |
|---|---|---|---|
| Small | < 500,000 | $0 - $100 | Debezium worker on existing Kafka Connect cluster; replication slot adds negligible DB load |
| Medium | 500k - 20M | $100 - $600 | Dedicated Kafka Connect worker(s) for outbox connectors; Schema Registry overhead; outbox table storage |
| Large | 20M+ | $600 - $3,000 | Multiple dedicated Kafka Connect workers, high-throughput Kafka topics, Schema Registry cluster, WAL storage overhead on primary DB |
7. LLM-assisted development fit¶
| Aspect | Rating | Notes |
|---|---|---|
| Outbox table DDL (schema, indexes) | ★★★★★ | Excellent — the standard outbox schema is well-represented and generated correctly |
| Debezium connector JSON configuration | ★★★★ | Good — gets the core config right; verify publication.autocreate.mode, slot.name, and snapshot.mode manually against your DB version |
| Application-side outbox insert (same transaction as business write) | ★★★★★ | Excellent — the pattern of inserting into the outbox table within the same DB transaction is well-understood |
| Outbox cleanup job | ★★★ | Gets the concept right; the safe deletion condition (only delete rows the connector has confirmed) requires manual implementation and testing |
| Replication slot management and monitoring | ★★ | Knows the concepts; pg_replication_slots queries and slot drop/recreate runbooks require human review and testing against your specific DB version |
| Architecture decisions | ★ | Don't outsource. The choice of snapshot mode, publication scope, and cleanup strategy have long-term operational consequences. |
Recommended workflow: Create the outbox table and write the application-side insert before configuring Debezium. Verify the application correctly inserts into the outbox within the same transaction as the business write by testing with a forced rollback. Only then configure the connector — starting with snapshot.mode = never on a non-production database to validate the connector configuration before touching the production WAL.
8. Reference implementations¶
- Public reference: microservices.io/patterns/data/transactional-outbox — Chris Richardson's authoritative pattern definition covering the dual-write problem, outbox mechanics, and comparison with the polling publisher alternative (200 OK ✓)
- Public reference: github.com/debezium/debezium-examples — outbox — Debezium's own reference implementation of the outbox pattern; demonstrates the connector configuration, outbox table schema, and event routing transformer (
io.debezium.transforms.outbox.EventRouter) (200 OK ✓) - Public reference: postgresql.org/docs — Logical Replication — PostgreSQL official documentation for the WAL logical replication mechanism that underpins Debezium's PostgreSQL connector; covers publications, replication slots, and
wal_levelconfiguration (200 OK ✓) - Internal case study: Payment Hub — see Event-Driven Architecture Section 8 for the banking context; the outbox pattern was applied to the core banking → Kafka path on every payment posting
9. Related decisions (ADRs)¶
- ADR-0001: Tenant isolation via PostgreSQL Row-Level Security — the outbox table is a tenant-scoped table and must carry a
tenant_idcolumn subject to the same RLS policy - Candidate ADR: Outbox cleanup retention window — the trade-off between disk cost (shorter retention) and the ability to replay or audit recent events (longer retention); financial systems often require 7-day minimum
- Candidate ADR: PostgreSQL publication scope — whether to use
FOR TABLE outbox(safer, scoped) orFOR ALL TABLES(simpler, broader WAL access); scoped is the recommended default
10. Known risks & gotchas¶
- Replication slot accumulates WAL during connector downtime — a stopped or crashed Debezium connector holds its replication slot open; PostgreSQL retains all WAL segments since the slot's last confirmed LSN, filling the WAL directory. At high write volume, a 4-hour outage can consume tens of gigabytes. Mitigation: monitor
pg_replication_slots.lagandpg_replication_slots.activeas first-class database health metrics; alert when lag exceeds a threshold (e.g. 1 GB or 30 minutes); have a tested runbook for slot drop and full projection rebuild. wal_levelnot set tologicalin the managed database — the Debezium connector requireswal_level = logicalon the source database. Many managed database offerings (RDS, Cloud SQL) support it but require an explicit parameter group change and an instance restart. On some older managed tiers it is not available at all. Discovering this post-architecture-decision is painful. Mitigation: verifywal_levelsupport as a precondition before adopting this technique; runSHOW wal_level;and confirmlogicalbefore committing to the design.- Outbox table grows unbounded without a cleanup job — if no process deletes processed outbox rows, the table accumulates indefinitely, degrading query performance and increasing storage costs. Mitigation: run a scheduled cleanup job that deletes outbox rows older than your retention window (typically 1–7 days); ensure the job only deletes rows whose
created_atis safely behind the connector's confirmed LSN — never delete rows the connector has not yet read. - Outbox table not included in the PostgreSQL publication — Debezium uses a PostgreSQL publication (
CREATE PUBLICATION) to define which tables are replicated. If the outbox table is not in the publication, no events are ever emitted — the connector runs, the slot advances, but the outbox changes are silently ignored. Mitigation: explicitly define the publication (CREATE PUBLICATION outbox_pub FOR TABLE outbox) and verify the connector'spublication.nameconfig matches; add a smoke test that inserts a test row and confirms it arrives in Kafka within 5 seconds. - Schema change on the source table breaks the CDC stream — a developer adds a NOT NULL column to the outbox table without updating the Avro schema in the Schema Registry; the connector fails to serialise the new row format and stops. Mitigation: treat outbox table schema changes as a coordinated deployment — update the Schema Registry schema, redeploy the connector configuration, and redeploy the application in that order; enforce backward-compatible schema evolution (new fields must have defaults); add a connector health check to CI.
- Initial snapshot blocks the outbox table — Debezium's default
snapshot.mode = initialtakes a shared lock on each table during the initial snapshot. On a busy outbox table this causes a brief write pause for the producer application. Mitigation: usesnapshot.mode = neverif the outbox table is empty at connector start (the normal case for a fresh deployment); or run the initial snapshot against a read replica rather than the primary. - Cleanup job deletes rows before the connector processes them — a misconfigured or overly aggressive cleanup job deletes outbox rows that the connector has not yet read (e.g. the connector is lagging and the cleanup window is too short). The connector skips these rows — events are permanently lost. Mitigation: make the cleanup condition time-based with a generous buffer (
DELETE WHERE created_at < NOW() - INTERVAL '7 days'); never base deletion on a processed flag set by the application, as the application cannot know whether the connector has confirmed delivery.
11. Related patterns¶
Primary parent¶
Event-Driven Architecture — Outbox + CDC is the primary mechanism for solving the producer reliability problem in EDA: guaranteeing that a committed database write always results in exactly one Kafka event, with no possibility of loss or duplication at the publish step.
Also used by¶
| Style | Why it uses this technique |
|---|---|
| CQRS with CDC-Driven Read Models | CDC is the stream that populates read-model projections from the source database; the outbox table is used when the source DB does not support direct WAL replication of business tables, or when a cleaner event schema is required than raw row changes |
| Saga Pattern | Orchestrated saga steps write their result to the local database and must reliably emit a command or event to advance the saga; the outbox guarantees the saga orchestrator receives the step completion event even if the worker crashes immediately after the DB commit |
Related techniques¶
| Technique | Relationship |
|---|---|
| Schema Registry (TODO) | Governs the Avro schema of outbox event payloads; Schema Registry and Outbox + CDC are almost always deployed together — the connector writes Avro-encoded events, Schema Registry enforces backward compatibility across producer and consumer versions |
| Idempotent API | CDC delivers events at least once; every consumer of outbox-sourced events must be idempotent to handle redeliveries safely — the Idempotent API pattern provides the server-side key-store mechanism that consumers use to detect and suppress duplicate processing |
| Dead Letter Topic (TODO) | Events published from the outbox that fail consumer processing land on a dead letter topic; the two techniques together cover the full reliability chain — Outbox + CDC guarantees reliable publish, Dead Letter Topic handles reliable consume |