Pattern: Saga Pattern (Orchestrated & Choreographed)¶

Quick facts

Category: Backend & Distributed Systems
Maturity: Trial
Typical team size: 3-6 engineers
Typical timeline to MVP: 8-14 weeks
Last reviewed: 2026-05-19 by Architecture Team

1. Context¶

Use this pattern when:

A business transaction spans multiple services and must either complete fully or roll back cleanly — but two-phase commit (2PC) is unacceptable due to availability, latency, or vendor lock-in concerns
Each step of the transaction has a well-defined compensating action that can undo its effect
The requirement is specifically distributed transaction semantics: all-or-nothing completion across service boundaries, with explicit rollback on failure
Auditability of each step's outcome and its compensation is a first-class requirement (financial, compliance, regulated industries)

Saga vs Workflow Orchestration

The Saga pattern is a specific pattern for distributed transactions with compensating actions. Workflow Orchestration is the broader architectural style it is usually built on. If your requirement is a long-running process but rollback semantics are not the primary concern, reach for Workflow Orchestration directly. If rollback correctness across service boundaries is the core requirement, add the Saga pattern on top.

Do NOT use this pattern when:

The transaction touches a single database — use a local ACID transaction instead; sagas add complexity with no benefit
No compensating action is possible for a step (e.g. an SMS has been sent, a payment dispatched to an external network with no recall API) — design the process to make that step last, or accept the risk explicitly
The team has not yet adopted event-driven or workflow-orchestration patterns — start with Workflow Orchestration first; the Saga pattern builds on it

2. Problem it solves¶

A business operation such as customer onboarding touches KYC, sanctions screening, core account creation, card issuance, and channel entitlement — each owned by a different service with its own database. Running these as a single synchronous chain means the slowest or least-available service determines the latency and reliability of the whole flow. Using 2PC locks rows across multiple databases for the duration, which breaks at scale and conflicts with cloud-managed data stores. The Saga pattern runs each step as an independent local transaction and, on any failure, runs compensating transactions in reverse order to restore a consistent state — without any cross-service locking.

3. Solution overview¶

Sagas come in two flavours. Choose based on team size and coordination needs.

Orchestrated saga¶

A central coordinator (the orchestrator) issues commands to participants and reacts to their replies. State lives in the orchestrator. Easier to reason about, debug, and visualise; preferred for complex or regulated flows.

flowchart TD
    Client([Client]) -->|start saga| Orchestrator[Saga Orchestrator]

    Orchestrator -->|1 run KYC check| KYC[KYC Service]
    KYC -->|success / fail| Orchestrator

    Orchestrator -->|2 run sanctions screen| Sanctions[Sanctions Service]
    Sanctions -->|clear / hit| Orchestrator

    Orchestrator -->|3 create core account| Core[Core Banking]
    Core -->|account created / fail| Orchestrator

    Orchestrator -->|4 issue card| Card[Card Service]
    Card -->|card issued / fail| Orchestrator

    Orchestrator -->|5 grant channel access| Channel[Channel Service]
    Channel -->|entitled / fail| Orchestrator

    Orchestrator -->|compensate: close account| Core
    Orchestrator -->|compensate: void card| Card

Choreographed saga¶

Participants react to domain events published by the previous step. No central coordinator; each service knows only its own trigger event and what to publish next. Lower coupling; harder to trace end-to-end.

flowchart LR
    KYC[KYC Service] -->|kyc.passed| Sanctions[Sanctions Service]
    Sanctions -->|sanctions.cleared| Core[Core Banking]
    Core -->|account.created| Card[Card Service]
    Card -->|card.issued| Channel[Channel Service]
    Channel -->|onboarding.complete| Notify[Notification Service]

    Core -->|account.creation.failed| Compensate1[Compensate:\nCancel KYC reservation]
    Card -->|card.issuance.failed| Compensate2[Compensate:\nClose account]

Choosing orchestrated vs choreographed¶

Factor	Prefer Orchestrated	Prefer Choreographed
Number of steps	4+	2-3
Regulatory auditability	Required	Not required
Teams owning participants	Multiple	One or two
Debugging and observability needs	High	Low
Coupling tolerance	Central coordinator acceptable	Fully decoupled preferred

For financial and regulated workflows, default to orchestrated. The visibility and explicit state machine outweigh the coupling cost.

4. Technology stack¶

Layer	Primary choice	Alternatives	Notes
Orchestrator runtime	Temporal	AWS Step Functions, Conductor, Axon Server	Temporal provides durable execution, built-in retry, timeout, compensation, and full history — the production default for orchestrated sagas; Step Functions for AWS-native teams who prefer managed infrastructure
Event bus (choreographed)	Apache Kafka	AWS EventBridge, NATS JetStream	Kafka guarantees durability and ordering per partition; use `sagaId` as the partition key so all events for one saga instance arrive in order
State store (orchestrated, if not Temporal)	PostgreSQL (saga state table)	Redis, DynamoDB	Persist the current saga step and all step outcomes; this is your audit log — never store it in memory
Idempotency	Idempotency key on every command + database unique constraint	Redis SETNX	Every participant must be idempotent; the orchestrator will retry commands on timeout; a duplicate command must not double-post a charge or create a second account
Schema / contract	AsyncAPI (choreographed events) + Protobuf	Avro, JSON Schema	Contract-first; publish the AsyncAPI spec before the first consumer is built
Observability	OpenTelemetry trace propagation through saga context	Datadog, Grafana Tempo	Propagate `sagaId` and `correlationId` through every command and event; without this, debugging a failed saga in production is extremely painful

5. Non-functional characteristics¶

Concern	Profile
Consistency model	Eventually consistent. Each step commits locally; the system converges to a consistent state only after all steps (or all compensations) complete. Design UIs and downstream consumers to handle intermediate states explicitly.
Scalability	Scales horizontally at each participant independently. The orchestrator (Temporal) scales via worker pools; Kafka scales via partition count. A single long-running saga instance does not block others.
Availability target	The saga framework (Temporal / Kafka) targets 99.9%+. A participant outage pauses in-flight sagas at that step; the orchestrator retries with backoff until the participant recovers or a timeout fires the compensation path.
Latency target	End-to-end saga duration is the sum of participant latencies plus queue/retry overhead. Design for p95 completion time based on the slowest participant. Happy-path SLAs should be set per saga type, not per step.
Security posture	Each participant validates the command source (mTLS or signed JWT from the orchestrator). Saga state tables contain financial / PII data — encrypt at rest, access-control by saga type. Compensating commands require the same auth as forward commands.
Compliance fit	GDPR — events may contain PII; apply the same retention and erasure policy as transaction records. SOC 2 — every state transition is timestamped and immutable. PCI-DSS — scope each saga carefully; avoid flowing raw card data through the orchestrator.

6. Cost ballpark¶

Indicative monthly USD cost. Temporal cluster compute and Kafka are the dominant costs.

Scale	Saga instances / day	Monthly cost	Cost drivers
Small	< 10,000	$400 - $1,200	Temporal cluster (3 nodes), Kafka (3 nodes), participant compute
Medium	10k - 500k	$1,500 - $7,000	Larger Temporal and Kafka clusters, dedicated worker pools, full observability stack
Large	500k+	$8,000 - $35,000	Multi-region Temporal, high-throughput Kafka, Confluent Platform, dedicated SRE capacity

7. LLM-assisted development fit¶

Aspect	Rating	Notes
Temporal workflow and activity boilerplate	★★★★	Good — Temporal Go and Java SDKs are well-represented; verify timeout and retry policy values manually
Kafka consumer / producer for choreographed sagas	★★★★★	Excellent — standard Kafka patterns apply directly
Compensation logic design	★★	Understands the concept; correctness of compensating chains for domain-specific invariants requires human design and extensive testing
Idempotency key implementation	★★★	Gets the pattern right; subtle edge cases (retry within a DB transaction vs outside) need manual review
Architecture decisions	★	Don't outsource. Use ADRs. Saga design decisions are expensive to reverse.

Recommended workflow: Map the happy path first, then map every failure mode and its compensating action before writing a line of code. Run chaos/fault injection tests against the compensation path before launch — it is the path you rely on when things go wrong.

8. Reference implementations¶

Public reference: microservices.io/patterns/data/saga — Chris Richardson's authoritative pattern documentation covering both orchestration and choreography with sequence diagrams and implementation guidance (200 OK ✓)
Public reference: github.com/microservices-patterns/ftgo-application — the reference implementation for Richardson's Microservices Patterns book; demonstrates orchestrated sagas, event sourcing, and CQRS in a food-delivery domain (200 OK ✓)
Public reference: learn.microsoft.com — Saga design pattern — Microsoft Azure Architecture Center reference including orchestration vs choreography decision guidance and failure-mode catalogue (200 OK ✓)
Public reference: github.com/eventuate-tram/eventuate-tram-sagas — Java saga framework from Chris Richardson; useful for implementation patterns even without adopting the framework directly (200 OK ✓)
Internal case studies: Digital banking — customer onboarding and cross-border payment (see below)

Internal case study — Customer onboarding: KYC to account to card to channel¶

A retail banking onboarding flow creates a full banking relationship in one customer-initiated action: identity verification (KYC), sanctions screening, core account creation, card issuance, and digital channel entitlement. Each step is owned by a separate bounded context with its own database and team.

The original design called each service synchronously in a chain. A sanctions service degradation at 3% error rate caused 3% of onboardings to fail with no recovery path; customers had to restart from the beginning. Partial completions (account created, card not issued) created orphan records that required manual remediation.

What changed

An orchestrated saga (Temporal) replaced the synchronous chain. Each service became a Temporal Activity; the orchestrator holds the saga state machine and drives each step via command messages.

flowchart TD
    App([Mobile App]) -->|POST /onboarding| API[API Gateway]
    API -->|start workflow| Temporal[Temporal\nOrchestrator]

    Temporal -->|RunKYC| KYC[KYC Service]
    Temporal -->|RunSanctions| Sanctions[Sanctions Service]
    Temporal -->|CreateAccount| Core[Core Banking]
    Temporal -->|IssueCard| Card[Card Issuance]
    Temporal -->|GrantEntitlement| Channel[Channel Service]

    Temporal --> DB[(Temporal DB\nPostgreSQL)]

    Temporal -->|compensate: CloseAccount| Core
    Temporal -->|compensate: VoidCard| Card

    Temporal -->|onboarding.completed / failed| Events[Kafka\nNotification bus]

Compensation map

Step	Compensating action	Idempotency key
KYC check	Cancel KYC session	`customerId + applicationId`
Sanctions screen	Mark screening voided	`customerId + applicationId`
Core account creation	Close account (zero-balance)	`accountId`
Card issuance	Void unactivated card	`cardId`
Channel entitlement	Revoke entitlement	`customerId + channelId`

Outcomes

Metric	Before	After
Onboarding success rate	94% (during sanctions degradation)	99.6% (retries absorb transient failures)
Orphan records requiring manual fix	~200 / month	0 (compensation path handles all partial failures)
Median end-to-end onboarding time	8 s	6 s (parallel KYC + sanctions via Temporal async activities)
Audit trail completeness	Application logs only	Full step-by-step history in Temporal + event log

Gotchas observed

Sanctions service had no idempotent cancel — the vendor API did not support cancel-by-reference-ID, so the compensating action was a no-op (mark voided in our own DB only). Acceptable because a voided screening has no downstream effect; document the gap explicitly in the compensation map.
Temporal worker cold-start delayed first retry — workers scaled to zero overnight; the first retry after an early-morning failure waited 40 s for a worker to start. Mitigated by keeping a minimum of one warm worker per saga type at all times.
History size limit hit on a long-running saga — Temporal caps workflow history at 50,000 events; a saga with many polling loops exhausted this. Solved by using ContinueAsNew to reset history at a clean checkpoint.

Internal case study — Cross-border payment: FX + sanctions + ledger + correspondent dispatch¶

A cross-border payment involves FX rate lock, sanctions screening of the beneficiary, ledger debit of the sender, and dispatch to a correspondent banking network. The final step is externally irreversible once the SWIFT message is sent.

Saga design rule: place the irreversible step last. Everything before it can be compensated; once the correspondent message is dispatched, compensation means raising a manual recall — a business process, not a system one.

flowchart LR
    Pay([Payment\nInstruction]) --> FX[1 Lock FX Rate\ncompensate: release lock]
    FX --> Sanctions[2 Screen Beneficiary\ncompensate: void screening]
    Sanctions --> Ledger[3 Debit Sender Ledger\ncompensate: credit back]
    Ledger --> Dispatch[4 Dispatch to Correspondent\nIRREVERSIBLE after ACK]
    Dispatch --> Complete([Payment\nDispatched])