Pattern: Saga Pattern (Orchestrated & Choreographed)¶
Quick facts
- Category: Backend & Distributed Systems
- Maturity: Trial
- Typical team size: 3-6 engineers
- Typical timeline to MVP: 8-14 weeks
- Last reviewed: 2026-05-19 by Architecture Team
1. Context¶
Use this pattern when:
- A business transaction spans multiple services and must either complete fully or roll back cleanly — but two-phase commit (2PC) is unacceptable due to availability, latency, or vendor lock-in concerns
- Each step of the transaction has a well-defined compensating action that can undo its effect
- The requirement is specifically distributed transaction semantics: all-or-nothing completion across service boundaries, with explicit rollback on failure
- Auditability of each step's outcome and its compensation is a first-class requirement (financial, compliance, regulated industries)
Saga vs Workflow Orchestration
The Saga pattern is a specific pattern for distributed transactions with compensating actions. Workflow Orchestration is the broader architectural style it is usually built on. If your requirement is a long-running process but rollback semantics are not the primary concern, reach for Workflow Orchestration directly. If rollback correctness across service boundaries is the core requirement, add the Saga pattern on top.
Do NOT use this pattern when:
- The transaction touches a single database — use a local ACID transaction instead; sagas add complexity with no benefit
- No compensating action is possible for a step (e.g. an SMS has been sent, a payment dispatched to an external network with no recall API) — design the process to make that step last, or accept the risk explicitly
- The team has not yet adopted event-driven or workflow-orchestration patterns — start with Workflow Orchestration first; the Saga pattern builds on it
2. Problem it solves¶
A business operation such as customer onboarding touches KYC, sanctions screening, core account creation, card issuance, and channel entitlement — each owned by a different service with its own database. Running these as a single synchronous chain means the slowest or least-available service determines the latency and reliability of the whole flow. Using 2PC locks rows across multiple databases for the duration, which breaks at scale and conflicts with cloud-managed data stores. The Saga pattern runs each step as an independent local transaction and, on any failure, runs compensating transactions in reverse order to restore a consistent state — without any cross-service locking.
3. Solution overview¶
Sagas come in two flavours. Choose based on team size and coordination needs.
Orchestrated saga¶
A central coordinator (the orchestrator) issues commands to participants and reacts to their replies. State lives in the orchestrator. Easier to reason about, debug, and visualise; preferred for complex or regulated flows.
flowchart TD
Client([Client]) -->|start saga| Orchestrator[Saga Orchestrator]
Orchestrator -->|1 run KYC check| KYC[KYC Service]
KYC -->|success / fail| Orchestrator
Orchestrator -->|2 run sanctions screen| Sanctions[Sanctions Service]
Sanctions -->|clear / hit| Orchestrator
Orchestrator -->|3 create core account| Core[Core Banking]
Core -->|account created / fail| Orchestrator
Orchestrator -->|4 issue card| Card[Card Service]
Card -->|card issued / fail| Orchestrator
Orchestrator -->|5 grant channel access| Channel[Channel Service]
Channel -->|entitled / fail| Orchestrator
Orchestrator -->|compensate: close account| Core
Orchestrator -->|compensate: void card| Card
Choreographed saga¶
Participants react to domain events published by the previous step. No central coordinator; each service knows only its own trigger event and what to publish next. Lower coupling; harder to trace end-to-end.
flowchart LR
KYC[KYC Service] -->|kyc.passed| Sanctions[Sanctions Service]
Sanctions -->|sanctions.cleared| Core[Core Banking]
Core -->|account.created| Card[Card Service]
Card -->|card.issued| Channel[Channel Service]
Channel -->|onboarding.complete| Notify[Notification Service]
Core -->|account.creation.failed| Compensate1[Compensate:\nCancel KYC reservation]
Card -->|card.issuance.failed| Compensate2[Compensate:\nClose account]
Choosing orchestrated vs choreographed¶
| Factor | Prefer Orchestrated | Prefer Choreographed |
|---|---|---|
| Number of steps | 4+ | 2-3 |
| Regulatory auditability | Required | Not required |
| Teams owning participants | Multiple | One or two |
| Debugging and observability needs | High | Low |
| Coupling tolerance | Central coordinator acceptable | Fully decoupled preferred |
For financial and regulated workflows, default to orchestrated. The visibility and explicit state machine outweigh the coupling cost.
4. Technology stack¶
| Layer | Primary choice | Alternatives | Notes |
|---|---|---|---|
| Orchestrator runtime | Temporal | AWS Step Functions, Conductor, Axon Server | Temporal provides durable execution, built-in retry, timeout, compensation, and full history — the production default for orchestrated sagas; Step Functions for AWS-native teams who prefer managed infrastructure |
| Event bus (choreographed) | Apache Kafka | AWS EventBridge, NATS JetStream | Kafka guarantees durability and ordering per partition; use sagaId as the partition key so all events for one saga instance arrive in order |
| State store (orchestrated, if not Temporal) | PostgreSQL (saga state table) | Redis, DynamoDB | Persist the current saga step and all step outcomes; this is your audit log — never store it in memory |
| Idempotency | Idempotency key on every command + database unique constraint | Redis SETNX | Every participant must be idempotent; the orchestrator will retry commands on timeout; a duplicate command must not double-post a charge or create a second account |
| Schema / contract | AsyncAPI (choreographed events) + Protobuf | Avro, JSON Schema | Contract-first; publish the AsyncAPI spec before the first consumer is built |
| Observability | OpenTelemetry trace propagation through saga context | Datadog, Grafana Tempo | Propagate sagaId and correlationId through every command and event; without this, debugging a failed saga in production is extremely painful |
5. Non-functional characteristics¶
| Concern | Profile |
|---|---|
| Consistency model | Eventually consistent. Each step commits locally; the system converges to a consistent state only after all steps (or all compensations) complete. Design UIs and downstream consumers to handle intermediate states explicitly. |
| Scalability | Scales horizontally at each participant independently. The orchestrator (Temporal) scales via worker pools; Kafka scales via partition count. A single long-running saga instance does not block others. |
| Availability target | The saga framework (Temporal / Kafka) targets 99.9%+. A participant outage pauses in-flight sagas at that step; the orchestrator retries with backoff until the participant recovers or a timeout fires the compensation path. |
| Latency target | End-to-end saga duration is the sum of participant latencies plus queue/retry overhead. Design for p95 completion time based on the slowest participant. Happy-path SLAs should be set per saga type, not per step. |
| Security posture | Each participant validates the command source (mTLS or signed JWT from the orchestrator). Saga state tables contain financial / PII data — encrypt at rest, access-control by saga type. Compensating commands require the same auth as forward commands. |
| Compliance fit | GDPR — events may contain PII; apply the same retention and erasure policy as transaction records. SOC 2 — every state transition is timestamped and immutable. PCI-DSS — scope each saga carefully; avoid flowing raw card data through the orchestrator. |
6. Cost ballpark¶
Indicative monthly USD cost. Temporal cluster compute and Kafka are the dominant costs.
| Scale | Saga instances / day | Monthly cost | Cost drivers |
|---|---|---|---|
| Small | < 10,000 | $400 - $1,200 | Temporal cluster (3 nodes), Kafka (3 nodes), participant compute |
| Medium | 10k - 500k | $1,500 - $7,000 | Larger Temporal and Kafka clusters, dedicated worker pools, full observability stack |
| Large | 500k+ | $8,000 - $35,000 | Multi-region Temporal, high-throughput Kafka, Confluent Platform, dedicated SRE capacity |
7. LLM-assisted development fit¶
| Aspect | Rating | Notes |
|---|---|---|
| Temporal workflow and activity boilerplate | ★★★★ | Good — Temporal Go and Java SDKs are well-represented; verify timeout and retry policy values manually |
| Kafka consumer / producer for choreographed sagas | ★★★★★ | Excellent — standard Kafka patterns apply directly |
| Compensation logic design | ★★ | Understands the concept; correctness of compensating chains for domain-specific invariants requires human design and extensive testing |
| Idempotency key implementation | ★★★ | Gets the pattern right; subtle edge cases (retry within a DB transaction vs outside) need manual review |
| Architecture decisions | ★ | Don't outsource. Use ADRs. Saga design decisions are expensive to reverse. |
Recommended workflow: Map the happy path first, then map every failure mode and its compensating action before writing a line of code. Run chaos/fault injection tests against the compensation path before launch — it is the path you rely on when things go wrong.
8. Reference implementations¶
- Public reference: microservices.io/patterns/data/saga — Chris Richardson's authoritative pattern documentation covering both orchestration and choreography with sequence diagrams and implementation guidance (200 OK ✓)
- Public reference: github.com/microservices-patterns/ftgo-application — the reference implementation for Richardson's Microservices Patterns book; demonstrates orchestrated sagas, event sourcing, and CQRS in a food-delivery domain (200 OK ✓)
- Public reference: learn.microsoft.com — Saga design pattern — Microsoft Azure Architecture Center reference including orchestration vs choreography decision guidance and failure-mode catalogue (200 OK ✓)
- Public reference: github.com/eventuate-tram/eventuate-tram-sagas — Java saga framework from Chris Richardson; useful for implementation patterns even without adopting the framework directly (200 OK ✓)
- Internal case studies: Digital banking — customer onboarding and cross-border payment (see below)
Internal case study — Customer onboarding: KYC to account to card to channel¶
A retail banking onboarding flow creates a full banking relationship in one customer-initiated action: identity verification (KYC), sanctions screening, core account creation, card issuance, and digital channel entitlement. Each step is owned by a separate bounded context with its own database and team.
The original design called each service synchronously in a chain. A sanctions service degradation at 3% error rate caused 3% of onboardings to fail with no recovery path; customers had to restart from the beginning. Partial completions (account created, card not issued) created orphan records that required manual remediation.
What changed
An orchestrated saga (Temporal) replaced the synchronous chain. Each service became a Temporal Activity; the orchestrator holds the saga state machine and drives each step via command messages.
flowchart TD
App([Mobile App]) -->|POST /onboarding| API[API Gateway]
API -->|start workflow| Temporal[Temporal\nOrchestrator]
Temporal -->|RunKYC| KYC[KYC Service]
Temporal -->|RunSanctions| Sanctions[Sanctions Service]
Temporal -->|CreateAccount| Core[Core Banking]
Temporal -->|IssueCard| Card[Card Issuance]
Temporal -->|GrantEntitlement| Channel[Channel Service]
Temporal --> DB[(Temporal DB\nPostgreSQL)]
Temporal -->|compensate: CloseAccount| Core
Temporal -->|compensate: VoidCard| Card
Temporal -->|onboarding.completed / failed| Events[Kafka\nNotification bus]
Compensation map
| Step | Compensating action | Idempotency key |
|---|---|---|
| KYC check | Cancel KYC session | customerId + applicationId |
| Sanctions screen | Mark screening voided | customerId + applicationId |
| Core account creation | Close account (zero-balance) | accountId |
| Card issuance | Void unactivated card | cardId |
| Channel entitlement | Revoke entitlement | customerId + channelId |
Outcomes
| Metric | Before | After |
|---|---|---|
| Onboarding success rate | 94% (during sanctions degradation) | 99.6% (retries absorb transient failures) |
| Orphan records requiring manual fix | ~200 / month | 0 (compensation path handles all partial failures) |
| Median end-to-end onboarding time | 8 s | 6 s (parallel KYC + sanctions via Temporal async activities) |
| Audit trail completeness | Application logs only | Full step-by-step history in Temporal + event log |
Gotchas observed
- Sanctions service had no idempotent cancel — the vendor API did not support cancel-by-reference-ID, so the compensating action was a no-op (mark voided in our own DB only). Acceptable because a voided screening has no downstream effect; document the gap explicitly in the compensation map.
- Temporal worker cold-start delayed first retry — workers scaled to zero overnight; the first retry after an early-morning failure waited 40 s for a worker to start. Mitigated by keeping a minimum of one warm worker per saga type at all times.
- History size limit hit on a long-running saga — Temporal caps workflow history at 50,000 events; a saga with many polling loops exhausted this. Solved by using
ContinueAsNewto reset history at a clean checkpoint.
Internal case study — Cross-border payment: FX + sanctions + ledger + correspondent dispatch¶
A cross-border payment involves FX rate lock, sanctions screening of the beneficiary, ledger debit of the sender, and dispatch to a correspondent banking network. The final step is externally irreversible once the SWIFT message is sent.
Saga design rule: place the irreversible step last. Everything before it can be compensated; once the correspondent message is dispatched, compensation means raising a manual recall — a business process, not a system one.
flowchart LR
Pay([Payment\nInstruction]) --> FX[1 Lock FX Rate\ncompensate: release lock]
FX --> Sanctions[2 Screen Beneficiary\ncompensate: void screening]
Sanctions --> Ledger[3 Debit Sender Ledger\ncompensate: credit back]
Ledger --> Dispatch[4 Dispatch to Correspondent\nIRREVERSIBLE after ACK]
Dispatch --> Complete([Payment\nDispatched])
Gotchas observed
- FX rate lock expiry during downstream delay — sanctions screening occasionally took longer than the 60 s FX lock window. Resolved by extending the lock to 5 min and adding a lock-expiry check before the ledger debit step; if expired, re-lock at current rate and notify the customer of the revised rate.
- Duplicate SWIFT message on orchestrator retry — a network timeout caused the orchestrator to retry the dispatch step; the correspondent received two messages. Fixed by checking an idempotency table (payment reference + dispatch timestamp) before sending; duplicates are suppressed at the dispatch service, not the orchestrator.
9. Related decisions (ADRs)¶
- ADR-0001: Tenant isolation via PostgreSQL Row-Level Security — saga state tables must follow the same RLS policy as all tenant-scoped tables
- Candidate ADR: Temporal vs AWS Step Functions as the orchestrator — record when your organisation makes a committed decision
- Candidate ADR: Orchestrated vs choreographed saga as the organisational default
10. Known risks & gotchas¶
- Missing idempotency on participants breaks compensation — the orchestrator retries timed-out commands; a non-idempotent participant creates a second account, double-charges a card, or issues two cards. Mitigation: every command handler must check an idempotency key (unique constraint or
INSERT ON CONFLICT IGNORE) before executing. Verify this with forced timeout injection before launch. - Irreversible steps placed too early in the chain — a notification email at step 2 tells the customer their account is open; step 4 fails and the account rolls back. The customer calls support. Mitigation: place externally visible or irreversible effects (emails, SMS, external network dispatches) as the last step; all preceding steps must be compensatable.
- Saga state accumulates without archival — Temporal history, Kafka consumer offsets, and saga state tables grow indefinitely at volume. Mitigation: archive completed and compensated sagas to cold storage after 90 days; set Temporal namespace retention to match your compliance requirement, not the default.
- Choreographed sagas have no global visibility — when a choreographed saga stalls, no single service knows the full state; debugging requires correlating events across multiple topics by
sagaId. Mitigation: emit a structuredsaga.step.completed/saga.step.failedevent from every participant with a consistentsagaId; ingest into a single observability store for end-to-end tracing. - Temporal worker restart during an active workflow pauses it — in-flight activities are interrupted on worker shutdown; Temporal re-schedules them, but the delay adds latency to customer-visible flows. Mitigation: graceful shutdown (drain in-flight activities before stopping); set activity heartbeat timeouts short enough to detect worker loss quickly.
- Compensation failures need a manual fallback — a compensating action can itself fail (card service is down when trying to void a card). Mitigation: compensating commands must be retried with the same durability as forward commands; define a dead-letter escalation path (ops alert, support ticket) for compensations that exhaust all retries.