Skip to content

Technique: Outbox + CDC for Reliable Event Publishing

Quick facts

  • Category: Backend & Distributed Systems
  • Type: Technique
  • Parent style: Event-Driven Architecture
  • Also used by: CQRS with CDC-Driven Read Models, Saga Pattern
  • Maturity: Adopt
  • Typical team size: 1-2 engineers
  • Typical timeline to MVP: 2-4 weeks (first working connector; 1 week per additional source table)
  • Last reviewed: 2026-05-21 by Architecture Team

1. Context

Use this technique when:

  • A service must write to its database and publish an event to a message broker as a single logical operation — losing either half is unacceptable
  • Two-phase commit between the database and the message broker is ruled out due to availability, latency, or vendor incompatibility concerns
  • The event source is a database you do not fully control (legacy core banking system, packaged software, third-party data store) — there is no application-level event hook to intercept
  • You need a replayable, durable record of every state change that downstream services can consume at their own pace

Do NOT use this technique when:

  • At-most-once delivery is acceptable for the use case — if dropping occasional events is tolerable, a direct Kafka produce call in the application is simpler
  • The source database does not support Change Data Capture (no WAL access, no binary log, no CDC extension) — verify CDC support before committing to this approach
  • The event volume is very low and the operational overhead of a CDC connector cannot be justified — a polling publisher scanning the outbox table on a cron schedule is a simpler alternative for low-throughput sources (see microservices.io/patterns/data/polling-publisher)

2. Problem it solves

Every service that publishes events faces the dual-write problem: writing to its own database and publishing to a message broker are two separate I/O operations, and there is no atomic way to do both. If the application writes to the database and then crashes before publishing, the event is silently lost — downstream services never learn that the order was placed, the payment posted, or the account opened. If the application publishes to Kafka first and then the database write fails and rolls back, a ghost event is emitted for a transaction that never committed — consumers act on data that does not exist in the source. The Outbox + CDC technique resolves this by making the event part of the database transaction: the application writes both its business data and an outbox event record in a single local commit, then a CDC connector captures that committed row from the database's replication log and publishes it to Kafka. The broker never receives an event unless the database has durably committed it.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    App([Producer Service]) -->|1 single local transaction| DB[(Source DB\nbusiness tables\n+ outbox table)]
    DB -->|2 WAL / binlog\nreplication stream| CDC[CDC Connector\nDebezium]
    CDC -->|3 publish outbox event| Kafka[Kafka\nEvent Bus]
    Kafka -->|4 consume| ConsumerA[Consumer A]
    Kafka -->|4 consume| ConsumerB[Consumer B]

Container view (C4 Level 2)

flowchart TB
    subgraph Producer["Producer Service"]
        AppCode[Application Code]
        OutboxTable[(Outbox Table\nevent_id, aggregate_type\naggregate_id, payload, created_at)]
        BizTable[(Business Tables\norders, accounts, payments)]
    end

    subgraph CDC["CDC Layer"]
        RepSlot[Replication Slot\nwal_level = logical]
        Debezium[Debezium Connector\nPostgres / MySQL]
        SchemaReg[Schema Registry\nAvro / Protobuf]
    end

    subgraph Bus["Event Bus"]
        KafkaTopic[Kafka Topic\nper aggregate type]
        DLQ[Dead-letter topic\nfailed events]
    end

    subgraph Cleanup["Outbox Cleanup"]
        CleanupJob[Cleanup Job\ndelete processed rows]
    end

    AppCode -->|single DB transaction| BizTable
    AppCode -->|same transaction| OutboxTable
    OutboxTable -->|WAL stream| RepSlot
    RepSlot --> Debezium
    Debezium --> SchemaReg
    Debezium -->|INSERT events| KafkaTopic
    Debezium -->|on failure| DLQ
    CleanupJob -->|DELETE WHERE processed| OutboxTable

4. Technology stack

Layer Primary choice Alternatives Notes
CDC connector Debezium AWS DMS, Maxwell (MySQL only), PGLogical Debezium is the open-source standard; runs as a Kafka Connect connector; supports PostgreSQL WAL, MySQL binlog, Oracle LogMiner, SQL Server CDC — deploy on the Kafka Connect cluster already present in the EDA stack
Source database PostgreSQL (wal_level = logical) MySQL (binlog enabled), Oracle (LogMiner), SQL Server (CDC feature) PostgreSQL is the primary choice; verify wal_level = logical is permitted in your managed DB offering before starting — RDS, Cloud SQL, and Azure Database for PostgreSQL all support it
Outbox table Application-managed table in source DB Separate outbox schema Co-locate the outbox table in the same database and same schema as the business tables — atomicity depends on a single local transaction
Message bus Apache Kafka AWS MSK, Confluent Cloud Debezium is designed around Kafka; use one topic per aggregate type (e.g. order.events, payment.events) with the aggregate ID as the partition key for ordering
Schema format Avro + Confluent Schema Registry Protobuf Debezium natively integrates with Schema Registry; the outbox event payload should use Avro for schema evolution enforcement
Outbox cleanup Scheduled DELETE job (retain rows for 1–7 days after creation) Kafka log compaction on outbox topic, TTL column Delete rows only after confirming the connector LSN has advanced past them; never delete before the connector has confirmed delivery
Observability Debezium connector status (Kafka Connect REST API) + replication slot lag (PostgreSQL pg_replication_slots) Datadog Debezium integration Replication slot lag and connector status are the two primary health signals — monitor both; a stopped connector with an active slot is a disk-fill risk

5. Non-functional characteristics

Concern Profile
Delivery guarantee Exactly-once write to the source database; at-least-once publish to Kafka (CDC can re-deliver on connector restart). Every consumer of outbox-sourced events must be idempotent.
Event propagation latency CDC capture from committed WAL record to Kafka message: p95 < 100 ms under normal load. Total end-to-end (application write to consumer processing): p95 < 500 ms. Spikes during rebalance or high-throughput batch writes can extend this to seconds.
Availability impact on producer Zero — the CDC connector reads from the replication log asynchronously; it does not participate in the application's write path. A connector outage pauses event publishing but does not affect the producer service's ability to write to its database.
Ordering guarantee Within a single partition (keyed by aggregate ID): strict order. Across partitions: no order guarantee. All events for the same aggregate (e.g. all events for orderId=123) arrive in commit order at the consumer if the aggregate ID is used as the Kafka partition key.
Security posture The replication slot grants the CDC connector read access to the full WAL, which may contain data beyond the outbox table. Use a dedicated PostgreSQL role with REPLICATION privilege scoped to the outbox publication only (CREATE PUBLICATION outbox_pub FOR TABLE outbox). The outbox payload may contain PII — apply field-level encryption or pseudonymisation before writing to the outbox table, not after.
Compliance fit GDPR — outbox events are retained in Kafka; apply the same PII retention and erasure policy as transaction records. A crypto-shredding approach (encrypt PII fields with a per-entity key; delete the key on erasure) is the standard mechanism. SOC 2 — every committed event is captured and timestamped by the WAL, providing an immutable audit trail.

6. Cost ballpark

The CDC connector runs on the Kafka Connect cluster. Incremental cost above an existing EDA infrastructure is low.

Scale Events / day Incremental monthly cost Cost drivers
Small < 500,000 $0 - $100 Debezium worker on existing Kafka Connect cluster; replication slot adds negligible DB load
Medium 500k - 20M $100 - $600 Dedicated Kafka Connect worker(s) for outbox connectors; Schema Registry overhead; outbox table storage
Large 20M+ $600 - $3,000 Multiple dedicated Kafka Connect workers, high-throughput Kafka topics, Schema Registry cluster, WAL storage overhead on primary DB

7. LLM-assisted development fit

Aspect Rating Notes
Outbox table DDL (schema, indexes) ★★★★★ Excellent — the standard outbox schema is well-represented and generated correctly
Debezium connector JSON configuration ★★★★ Good — gets the core config right; verify publication.autocreate.mode, slot.name, and snapshot.mode manually against your DB version
Application-side outbox insert (same transaction as business write) ★★★★★ Excellent — the pattern of inserting into the outbox table within the same DB transaction is well-understood
Outbox cleanup job ★★★ Gets the concept right; the safe deletion condition (only delete rows the connector has confirmed) requires manual implementation and testing
Replication slot management and monitoring ★★ Knows the concepts; pg_replication_slots queries and slot drop/recreate runbooks require human review and testing against your specific DB version
Architecture decisions Don't outsource. The choice of snapshot mode, publication scope, and cleanup strategy have long-term operational consequences.

Recommended workflow: Create the outbox table and write the application-side insert before configuring Debezium. Verify the application correctly inserts into the outbox within the same transaction as the business write by testing with a forced rollback. Only then configure the connector — starting with snapshot.mode = never on a non-production database to validate the connector configuration before touching the production WAL.

8. Reference implementations

  • Public reference: microservices.io/patterns/data/transactional-outbox — Chris Richardson's authoritative pattern definition covering the dual-write problem, outbox mechanics, and comparison with the polling publisher alternative (200 OK ✓)
  • Public reference: github.com/debezium/debezium-examples — outbox — Debezium's own reference implementation of the outbox pattern; demonstrates the connector configuration, outbox table schema, and event routing transformer (io.debezium.transforms.outbox.EventRouter) (200 OK ✓)
  • Public reference: postgresql.org/docs — Logical Replication — PostgreSQL official documentation for the WAL logical replication mechanism that underpins Debezium's PostgreSQL connector; covers publications, replication slots, and wal_level configuration (200 OK ✓)
  • Internal case study: Payment Hub — see Event-Driven Architecture Section 8 for the banking context; the outbox pattern was applied to the core banking → Kafka path on every payment posting
  • ADR-0001: Tenant isolation via PostgreSQL Row-Level Security — the outbox table is a tenant-scoped table and must carry a tenant_id column subject to the same RLS policy
  • Candidate ADR: Outbox cleanup retention window — the trade-off between disk cost (shorter retention) and the ability to replay or audit recent events (longer retention); financial systems often require 7-day minimum
  • Candidate ADR: PostgreSQL publication scope — whether to use FOR TABLE outbox (safer, scoped) or FOR ALL TABLES (simpler, broader WAL access); scoped is the recommended default

10. Known risks & gotchas

  • Replication slot accumulates WAL during connector downtime — a stopped or crashed Debezium connector holds its replication slot open; PostgreSQL retains all WAL segments since the slot's last confirmed LSN, filling the WAL directory. At high write volume, a 4-hour outage can consume tens of gigabytes. Mitigation: monitor pg_replication_slots.lag and pg_replication_slots.active as first-class database health metrics; alert when lag exceeds a threshold (e.g. 1 GB or 30 minutes); have a tested runbook for slot drop and full projection rebuild.
  • wal_level not set to logical in the managed database — the Debezium connector requires wal_level = logical on the source database. Many managed database offerings (RDS, Cloud SQL) support it but require an explicit parameter group change and an instance restart. On some older managed tiers it is not available at all. Discovering this post-architecture-decision is painful. Mitigation: verify wal_level support as a precondition before adopting this technique; run SHOW wal_level; and confirm logical before committing to the design.
  • Outbox table grows unbounded without a cleanup job — if no process deletes processed outbox rows, the table accumulates indefinitely, degrading query performance and increasing storage costs. Mitigation: run a scheduled cleanup job that deletes outbox rows older than your retention window (typically 1–7 days); ensure the job only deletes rows whose created_at is safely behind the connector's confirmed LSN — never delete rows the connector has not yet read.
  • Outbox table not included in the PostgreSQL publication — Debezium uses a PostgreSQL publication (CREATE PUBLICATION) to define which tables are replicated. If the outbox table is not in the publication, no events are ever emitted — the connector runs, the slot advances, but the outbox changes are silently ignored. Mitigation: explicitly define the publication (CREATE PUBLICATION outbox_pub FOR TABLE outbox) and verify the connector's publication.name config matches; add a smoke test that inserts a test row and confirms it arrives in Kafka within 5 seconds.
  • Schema change on the source table breaks the CDC stream — a developer adds a NOT NULL column to the outbox table without updating the Avro schema in the Schema Registry; the connector fails to serialise the new row format and stops. Mitigation: treat outbox table schema changes as a coordinated deployment — update the Schema Registry schema, redeploy the connector configuration, and redeploy the application in that order; enforce backward-compatible schema evolution (new fields must have defaults); add a connector health check to CI.
  • Initial snapshot blocks the outbox table — Debezium's default snapshot.mode = initial takes a shared lock on each table during the initial snapshot. On a busy outbox table this causes a brief write pause for the producer application. Mitigation: use snapshot.mode = never if the outbox table is empty at connector start (the normal case for a fresh deployment); or run the initial snapshot against a read replica rather than the primary.
  • Cleanup job deletes rows before the connector processes them — a misconfigured or overly aggressive cleanup job deletes outbox rows that the connector has not yet read (e.g. the connector is lagging and the cleanup window is too short). The connector skips these rows — events are permanently lost. Mitigation: make the cleanup condition time-based with a generous buffer (DELETE WHERE created_at < NOW() - INTERVAL '7 days'); never base deletion on a processed flag set by the application, as the application cannot know whether the connector has confirmed delivery.

Primary parent

Event-Driven Architecture — Outbox + CDC is the primary mechanism for solving the producer reliability problem in EDA: guaranteeing that a committed database write always results in exactly one Kafka event, with no possibility of loss or duplication at the publish step.

Also used by

Style Why it uses this technique
CQRS with CDC-Driven Read Models CDC is the stream that populates read-model projections from the source database; the outbox table is used when the source DB does not support direct WAL replication of business tables, or when a cleaner event schema is required than raw row changes
Saga Pattern Orchestrated saga steps write their result to the local database and must reliably emit a command or event to advance the saga; the outbox guarantees the saga orchestrator receives the step completion event even if the worker crashes immediately after the DB commit
Technique Relationship
Schema Registry (TODO) Governs the Avro schema of outbox event payloads; Schema Registry and Outbox + CDC are almost always deployed together — the connector writes Avro-encoded events, Schema Registry enforces backward compatibility across producer and consumer versions
Idempotent API CDC delivers events at least once; every consumer of outbox-sourced events must be idempotent to handle redeliveries safely — the Idempotent API pattern provides the server-side key-store mechanism that consumers use to detect and suppress duplicate processing
Dead Letter Topic (TODO) Events published from the outbox that fail consumer processing land on a dead letter topic; the two techniques together cover the full reliability chain — Outbox + CDC guarantees reliable publish, Dead Letter Topic handles reliable consume