Skip to content

Pattern: Saga Pattern (Orchestrated & Choreographed)

Quick facts

  • Category: Backend & Distributed Systems
  • Maturity: Trial
  • Typical team size: 3-6 engineers
  • Typical timeline to MVP: 8-14 weeks
  • Last reviewed: 2026-05-19 by Architecture Team

1. Context

Use this pattern when:

  • A business transaction spans multiple services and must either complete fully or roll back cleanly — but two-phase commit (2PC) is unacceptable due to availability, latency, or vendor lock-in concerns
  • Each step of the transaction has a well-defined compensating action that can undo its effect
  • The requirement is specifically distributed transaction semantics: all-or-nothing completion across service boundaries, with explicit rollback on failure
  • Auditability of each step's outcome and its compensation is a first-class requirement (financial, compliance, regulated industries)

Saga vs Workflow Orchestration

The Saga pattern is a specific pattern for distributed transactions with compensating actions. Workflow Orchestration is the broader architectural style it is usually built on. If your requirement is a long-running process but rollback semantics are not the primary concern, reach for Workflow Orchestration directly. If rollback correctness across service boundaries is the core requirement, add the Saga pattern on top.

Do NOT use this pattern when:

  • The transaction touches a single database — use a local ACID transaction instead; sagas add complexity with no benefit
  • No compensating action is possible for a step (e.g. an SMS has been sent, a payment dispatched to an external network with no recall API) — design the process to make that step last, or accept the risk explicitly
  • The team has not yet adopted event-driven or workflow-orchestration patterns — start with Workflow Orchestration first; the Saga pattern builds on it

2. Problem it solves

A business operation such as customer onboarding touches KYC, sanctions screening, core account creation, card issuance, and channel entitlement — each owned by a different service with its own database. Running these as a single synchronous chain means the slowest or least-available service determines the latency and reliability of the whole flow. Using 2PC locks rows across multiple databases for the duration, which breaks at scale and conflicts with cloud-managed data stores. The Saga pattern runs each step as an independent local transaction and, on any failure, runs compensating transactions in reverse order to restore a consistent state — without any cross-service locking.

3. Solution overview

Sagas come in two flavours. Choose based on team size and coordination needs.

Orchestrated saga

A central coordinator (the orchestrator) issues commands to participants and reacts to their replies. State lives in the orchestrator. Easier to reason about, debug, and visualise; preferred for complex or regulated flows.

flowchart TD
    Client([Client]) -->|start saga| Orchestrator[Saga Orchestrator]

    Orchestrator -->|1 run KYC check| KYC[KYC Service]
    KYC -->|success / fail| Orchestrator

    Orchestrator -->|2 run sanctions screen| Sanctions[Sanctions Service]
    Sanctions -->|clear / hit| Orchestrator

    Orchestrator -->|3 create core account| Core[Core Banking]
    Core -->|account created / fail| Orchestrator

    Orchestrator -->|4 issue card| Card[Card Service]
    Card -->|card issued / fail| Orchestrator

    Orchestrator -->|5 grant channel access| Channel[Channel Service]
    Channel -->|entitled / fail| Orchestrator

    Orchestrator -->|compensate: close account| Core
    Orchestrator -->|compensate: void card| Card

Choreographed saga

Participants react to domain events published by the previous step. No central coordinator; each service knows only its own trigger event and what to publish next. Lower coupling; harder to trace end-to-end.

flowchart LR
    KYC[KYC Service] -->|kyc.passed| Sanctions[Sanctions Service]
    Sanctions -->|sanctions.cleared| Core[Core Banking]
    Core -->|account.created| Card[Card Service]
    Card -->|card.issued| Channel[Channel Service]
    Channel -->|onboarding.complete| Notify[Notification Service]

    Core -->|account.creation.failed| Compensate1[Compensate:\nCancel KYC reservation]
    Card -->|card.issuance.failed| Compensate2[Compensate:\nClose account]

Choosing orchestrated vs choreographed

Factor Prefer Orchestrated Prefer Choreographed
Number of steps 4+ 2-3
Regulatory auditability Required Not required
Teams owning participants Multiple One or two
Debugging and observability needs High Low
Coupling tolerance Central coordinator acceptable Fully decoupled preferred

For financial and regulated workflows, default to orchestrated. The visibility and explicit state machine outweigh the coupling cost.

4. Technology stack

Layer Primary choice Alternatives Notes
Orchestrator runtime Temporal AWS Step Functions, Conductor, Axon Server Temporal provides durable execution, built-in retry, timeout, compensation, and full history — the production default for orchestrated sagas; Step Functions for AWS-native teams who prefer managed infrastructure
Event bus (choreographed) Apache Kafka AWS EventBridge, NATS JetStream Kafka guarantees durability and ordering per partition; use sagaId as the partition key so all events for one saga instance arrive in order
State store (orchestrated, if not Temporal) PostgreSQL (saga state table) Redis, DynamoDB Persist the current saga step and all step outcomes; this is your audit log — never store it in memory
Idempotency Idempotency key on every command + database unique constraint Redis SETNX Every participant must be idempotent; the orchestrator will retry commands on timeout; a duplicate command must not double-post a charge or create a second account
Schema / contract AsyncAPI (choreographed events) + Protobuf Avro, JSON Schema Contract-first; publish the AsyncAPI spec before the first consumer is built
Observability OpenTelemetry trace propagation through saga context Datadog, Grafana Tempo Propagate sagaId and correlationId through every command and event; without this, debugging a failed saga in production is extremely painful

5. Non-functional characteristics

Concern Profile
Consistency model Eventually consistent. Each step commits locally; the system converges to a consistent state only after all steps (or all compensations) complete. Design UIs and downstream consumers to handle intermediate states explicitly.
Scalability Scales horizontally at each participant independently. The orchestrator (Temporal) scales via worker pools; Kafka scales via partition count. A single long-running saga instance does not block others.
Availability target The saga framework (Temporal / Kafka) targets 99.9%+. A participant outage pauses in-flight sagas at that step; the orchestrator retries with backoff until the participant recovers or a timeout fires the compensation path.
Latency target End-to-end saga duration is the sum of participant latencies plus queue/retry overhead. Design for p95 completion time based on the slowest participant. Happy-path SLAs should be set per saga type, not per step.
Security posture Each participant validates the command source (mTLS or signed JWT from the orchestrator). Saga state tables contain financial / PII data — encrypt at rest, access-control by saga type. Compensating commands require the same auth as forward commands.
Compliance fit GDPR — events may contain PII; apply the same retention and erasure policy as transaction records. SOC 2 — every state transition is timestamped and immutable. PCI-DSS — scope each saga carefully; avoid flowing raw card data through the orchestrator.

6. Cost ballpark

Indicative monthly USD cost. Temporal cluster compute and Kafka are the dominant costs.

Scale Saga instances / day Monthly cost Cost drivers
Small < 10,000 $400 - $1,200 Temporal cluster (3 nodes), Kafka (3 nodes), participant compute
Medium 10k - 500k $1,500 - $7,000 Larger Temporal and Kafka clusters, dedicated worker pools, full observability stack
Large 500k+ $8,000 - $35,000 Multi-region Temporal, high-throughput Kafka, Confluent Platform, dedicated SRE capacity

7. LLM-assisted development fit

Aspect Rating Notes
Temporal workflow and activity boilerplate ★★★★ Good — Temporal Go and Java SDKs are well-represented; verify timeout and retry policy values manually
Kafka consumer / producer for choreographed sagas ★★★★★ Excellent — standard Kafka patterns apply directly
Compensation logic design ★★ Understands the concept; correctness of compensating chains for domain-specific invariants requires human design and extensive testing
Idempotency key implementation ★★★ Gets the pattern right; subtle edge cases (retry within a DB transaction vs outside) need manual review
Architecture decisions Don't outsource. Use ADRs. Saga design decisions are expensive to reverse.

Recommended workflow: Map the happy path first, then map every failure mode and its compensating action before writing a line of code. Run chaos/fault injection tests against the compensation path before launch — it is the path you rely on when things go wrong.

8. Reference implementations

  • Public reference: microservices.io/patterns/data/saga — Chris Richardson's authoritative pattern documentation covering both orchestration and choreography with sequence diagrams and implementation guidance (200 OK ✓)
  • Public reference: github.com/microservices-patterns/ftgo-application — the reference implementation for Richardson's Microservices Patterns book; demonstrates orchestrated sagas, event sourcing, and CQRS in a food-delivery domain (200 OK ✓)
  • Public reference: learn.microsoft.com — Saga design pattern — Microsoft Azure Architecture Center reference including orchestration vs choreography decision guidance and failure-mode catalogue (200 OK ✓)
  • Public reference: github.com/eventuate-tram/eventuate-tram-sagas — Java saga framework from Chris Richardson; useful for implementation patterns even without adopting the framework directly (200 OK ✓)
  • Internal case studies: Digital banking — customer onboarding and cross-border payment (see below)

Internal case study — Customer onboarding: KYC to account to card to channel

A retail banking onboarding flow creates a full banking relationship in one customer-initiated action: identity verification (KYC), sanctions screening, core account creation, card issuance, and digital channel entitlement. Each step is owned by a separate bounded context with its own database and team.

The original design called each service synchronously in a chain. A sanctions service degradation at 3% error rate caused 3% of onboardings to fail with no recovery path; customers had to restart from the beginning. Partial completions (account created, card not issued) created orphan records that required manual remediation.

What changed

An orchestrated saga (Temporal) replaced the synchronous chain. Each service became a Temporal Activity; the orchestrator holds the saga state machine and drives each step via command messages.

flowchart TD
    App([Mobile App]) -->|POST /onboarding| API[API Gateway]
    API -->|start workflow| Temporal[Temporal\nOrchestrator]

    Temporal -->|RunKYC| KYC[KYC Service]
    Temporal -->|RunSanctions| Sanctions[Sanctions Service]
    Temporal -->|CreateAccount| Core[Core Banking]
    Temporal -->|IssueCard| Card[Card Issuance]
    Temporal -->|GrantEntitlement| Channel[Channel Service]

    Temporal --> DB[(Temporal DB\nPostgreSQL)]

    Temporal -->|compensate: CloseAccount| Core
    Temporal -->|compensate: VoidCard| Card

    Temporal -->|onboarding.completed / failed| Events[Kafka\nNotification bus]

Compensation map

Step Compensating action Idempotency key
KYC check Cancel KYC session customerId + applicationId
Sanctions screen Mark screening voided customerId + applicationId
Core account creation Close account (zero-balance) accountId
Card issuance Void unactivated card cardId
Channel entitlement Revoke entitlement customerId + channelId

Outcomes

Metric Before After
Onboarding success rate 94% (during sanctions degradation) 99.6% (retries absorb transient failures)
Orphan records requiring manual fix ~200 / month 0 (compensation path handles all partial failures)
Median end-to-end onboarding time 8 s 6 s (parallel KYC + sanctions via Temporal async activities)
Audit trail completeness Application logs only Full step-by-step history in Temporal + event log

Gotchas observed

  • Sanctions service had no idempotent cancel — the vendor API did not support cancel-by-reference-ID, so the compensating action was a no-op (mark voided in our own DB only). Acceptable because a voided screening has no downstream effect; document the gap explicitly in the compensation map.
  • Temporal worker cold-start delayed first retry — workers scaled to zero overnight; the first retry after an early-morning failure waited 40 s for a worker to start. Mitigated by keeping a minimum of one warm worker per saga type at all times.
  • History size limit hit on a long-running saga — Temporal caps workflow history at 50,000 events; a saga with many polling loops exhausted this. Solved by using ContinueAsNew to reset history at a clean checkpoint.

Internal case study — Cross-border payment: FX + sanctions + ledger + correspondent dispatch

A cross-border payment involves FX rate lock, sanctions screening of the beneficiary, ledger debit of the sender, and dispatch to a correspondent banking network. The final step is externally irreversible once the SWIFT message is sent.

Saga design rule: place the irreversible step last. Everything before it can be compensated; once the correspondent message is dispatched, compensation means raising a manual recall — a business process, not a system one.

flowchart LR
    Pay([Payment\nInstruction]) --> FX[1 Lock FX Rate\ncompensate: release lock]
    FX --> Sanctions[2 Screen Beneficiary\ncompensate: void screening]
    Sanctions --> Ledger[3 Debit Sender Ledger\ncompensate: credit back]
    Ledger --> Dispatch[4 Dispatch to Correspondent\nIRREVERSIBLE after ACK]
    Dispatch --> Complete([Payment\nDispatched])

Gotchas observed

  • FX rate lock expiry during downstream delay — sanctions screening occasionally took longer than the 60 s FX lock window. Resolved by extending the lock to 5 min and adding a lock-expiry check before the ledger debit step; if expired, re-lock at current rate and notify the customer of the revised rate.
  • Duplicate SWIFT message on orchestrator retry — a network timeout caused the orchestrator to retry the dispatch step; the correspondent received two messages. Fixed by checking an idempotency table (payment reference + dispatch timestamp) before sending; duplicates are suppressed at the dispatch service, not the orchestrator.
  • ADR-0001: Tenant isolation via PostgreSQL Row-Level Security — saga state tables must follow the same RLS policy as all tenant-scoped tables
  • Candidate ADR: Temporal vs AWS Step Functions as the orchestrator — record when your organisation makes a committed decision
  • Candidate ADR: Orchestrated vs choreographed saga as the organisational default

10. Known risks & gotchas

  • Missing idempotency on participants breaks compensation — the orchestrator retries timed-out commands; a non-idempotent participant creates a second account, double-charges a card, or issues two cards. Mitigation: every command handler must check an idempotency key (unique constraint or INSERT ON CONFLICT IGNORE) before executing. Verify this with forced timeout injection before launch.
  • Irreversible steps placed too early in the chain — a notification email at step 2 tells the customer their account is open; step 4 fails and the account rolls back. The customer calls support. Mitigation: place externally visible or irreversible effects (emails, SMS, external network dispatches) as the last step; all preceding steps must be compensatable.
  • Saga state accumulates without archival — Temporal history, Kafka consumer offsets, and saga state tables grow indefinitely at volume. Mitigation: archive completed and compensated sagas to cold storage after 90 days; set Temporal namespace retention to match your compliance requirement, not the default.
  • Choreographed sagas have no global visibility — when a choreographed saga stalls, no single service knows the full state; debugging requires correlating events across multiple topics by sagaId. Mitigation: emit a structured saga.step.completed / saga.step.failed event from every participant with a consistent sagaId; ingest into a single observability store for end-to-end tracing.
  • Temporal worker restart during an active workflow pauses it — in-flight activities are interrupted on worker shutdown; Temporal re-schedules them, but the delay adds latency to customer-visible flows. Mitigation: graceful shutdown (drain in-flight activities before stopping); set activity heartbeat timeouts short enough to detect worker loss quickly.
  • Compensation failures need a manual fallback — a compensating action can itself fail (card service is down when trying to void a card). Mitigation: compensating commands must be retried with the same durability as forward commands; define a dead-letter escalation path (ops alert, support ticket) for compensations that exhaust all retries.