Pattern: Workflow Orchestration¶
Quick facts
- Category: Backend & Distributed Systems
- Maturity: Trial
- Typical team size: 1-3 engineers
- Typical timeline to MVP: 4-8 weeks
- Last reviewed: 2026-05-03 by Architecture Team
1. Context¶
Use this pattern when:
- Business processes span multiple sequential or parallel steps, each of which can fail independently and needs retry semantics
- A process runs for minutes to days — payment processing, user onboarding, data migration, approval chains
- Human-in-the-loop steps or external callbacks (waiting for a webhook, a Stripe event, a human decision) are required mid-process
- You need a central, durable state machine as the architectural backbone — one place that owns process state, drives each step, and holds the full execution history
Workflow orchestration vs Saga
Workflow orchestration is the architectural style: a central coordinator owns state and drives steps. The Saga pattern is a specific pattern built on top of it — use Saga when the requirement is specifically a distributed transaction with compensating rollback. Most orchestrated sagas are implemented using a workflow orchestrator (Temporal, Step Functions), but not every workflow is a saga.
Do NOT use this pattern when:
- The process is a single, fast, synchronous API call — no orchestration layer needed
- Simple fire-and-forget background tasks (send an email, resize an image) — a job queue (BullMQ, Celery, SQS) is sufficient and simpler
- The process is purely data-pipeline work — use the ETL/ELT Job or ML Training Pipeline patterns instead
2. Problem it solves¶
A business process like "onboard a new enterprise customer" involves 12 steps across 6 services — create account, provision infrastructure, send welcome email, schedule an onboarding call, wait for the customer to complete setup, then enable billing. Implementing this as a chain of Kafka events or a cron-polled state machine is brittle: each step can fail silently, retries must be re-implemented from scratch for every flow, and debugging a stuck workflow requires trawling through logs across six services. Workflow orchestration gives each execution a durable, observable state machine with automatic retries, timeouts, and a complete execution history in a single UI.
3. Solution overview¶
System context (C4 Level 1)¶
flowchart LR
Trigger[Trigger\nAPI call / event / schedule] --> Temporal[Temporal\nWorkflow Engine]
Temporal --> Workers[Worker Pool\nworkflow + activity workers]
Workers --> EmailSvc[Email Service]
Workers --> PaymentAPI[Payment API]
Workers --> InfraSvc[Infrastructure\nProvisioning]
Workers --> HumanApproval[Human Approval\nwebhook callback]
Temporal --> WebUI[Temporal Web UI\nworkflow visibility]
Temporal --> DB[(Temporal DB\nPostgres)]
Container view (C4 Level 2)¶
flowchart TB
subgraph Temporal Server
Frontend[Frontend Service\nAPI entrypoint]
History[History Service\ndurable event sourcing]
Matching[Matching Service\ntask queue dispatch]
TemporalDB[(PostgreSQL\nworkflow history)]
end
subgraph Workers["Workers — user code"]
WfWorker[Workflow Worker\norchestration logic only]
ActWorker[Activity Worker\nside-effect implementations]
end
subgraph Activities
EmailAct[Email Activity]
PaymentAct[Payment Activity]
InfraAct[Infrastructure Activity]
HumanAct[Human Approval\nSignal listener]
end
subgraph Ops
WebUI[Temporal Web UI]
OTel[OpenTelemetry traces]
Alerts[Alerts\non workflow timeout / failure]
end
Trigger --> Frontend
Frontend --> Matching --> WfWorker
WfWorker --> History
WfWorker --> ActWorker
ActWorker --> EmailAct
ActWorker --> PaymentAct
ActWorker --> InfraAct
ActWorker --> HumanAct
History --> TemporalDB
Frontend --> WebUI
WfWorker --> OTel
History --> Alerts
4. Technology stack¶
| Layer | Primary choice | Alternatives | Notes |
|---|---|---|---|
| Workflow engine | Temporal | AWS Step Functions, Apache Airflow, Netflix Conductor | See ADR-0009; Temporal for code-first, durable, language-native workflows; Step Functions for AWS-native simple state machines; Airflow for data pipeline orchestration |
| Worker language | Python | Go, Java, TypeScript | Python for data-heavy or ML activities; Go for high-throughput, low-latency workers; all have official Temporal SDKs |
| Temporal database | PostgreSQL | Apache Cassandra | Postgres for most deployments (simpler to operate); Cassandra for global multi-cluster deployments with very high workflow volume (> 10k starts/second) |
| Temporal hosting | Temporal Cloud | Self-hosted on Kubernetes | Temporal Cloud eliminates the operational overhead of running Temporal Server; evaluate cost vs operational burden |
| Activity retries | Temporal RetryPolicy (exponential backoff, max attempts, non-retryable errors) |
Custom retry in activity code | Configure retry policy at the activity level; never implement retry logic inside an activity implementation — let Temporal handle it |
| Human-in-the-loop | Temporal Signal | Temporal Update (newer, for synchronous response) | Signal allows external systems (human approval UI, webhook receiver) to resume a waiting workflow asynchronously |
| Testing | Temporal TestWorkflowEnvironment with time-skipping |
End-to-end integration tests | Temporal's test environment lets you fast-forward time, making tests for "wait 7 days" steps run in milliseconds |
| Observability | Temporal Web UI + OpenTelemetry | Datadog APM | Temporal Web UI shows full execution history per workflow; add OpenTelemetry for cross-service traces |
5. Non-functional characteristics¶
| Concern | Profile |
|---|---|
| Scalability | Temporal workers scale horizontally — add worker pods to increase concurrent workflow executions. Temporal Server itself requires careful horizontal scaling for > 1,000 workflow starts/second (add History shards). Activity workers are stateless and scale independently. |
| Availability target | 99.9%+ with Temporal Server HA deployment (multiple frontend and history replicas). A Temporal Server outage delays new workflow starts; in-progress workflows resume from their last durable checkpoint automatically on recovery — no data is lost. |
| Latency target | Temporal overhead per activity dispatch: 10–50 ms. Total workflow duration is the sum of activity execution times plus queue delays. Design for the happy-path duration; add explicit timeout budgets at the activity and workflow level to bound the worst case. |
| Security posture | Temporal Server behind an internal load balancer — no public exposure. mTLS between workers and the Temporal Frontend service. Namespace-level isolation in Temporal for multi-tenant deployments. Workflow history containing PII uses a Data Converter to encrypt payloads at rest. |
| Data residency | Workflow history (including activity inputs and outputs) persists in Temporal's Postgres database. If activities exchange PII, use a custom Data Converter that encrypts payloads before storage. |
| Compliance fit | SOC 2 ✓ — Temporal provides a complete, tamper-evident execution history with timestamps for every step. GDPR: configure namespace history retention (the default 72 hours is too short for most business workflows; set to 30–90 days and encrypt PII payloads). HIPAA ✓ with encrypted Postgres + Data Converter + BAA on infrastructure. |
6. Cost ballpark¶
Indicative monthly USD cost. Temporal Cloud pricing is per "workflow action" (each activity schedule, start, completion, signal counts).
| Scale | Workflow executions / month | Monthly cost | Cost drivers |
|---|---|---|---|
| Small | < 10,000 | $200 - $700 | Self-hosted Temporal on 2 K8s pods + Postgres, or Temporal Cloud free tier |
| Medium | 10k - 1M | $700 - $5,000 | Temporal Cloud (~$0.002/action), worker compute, larger Postgres |
| Large | 1M+ | $5,000 - $20,000 | Temporal Cloud at scale, or self-hosted K8s fleet + Cassandra, dedicated worker fleet |
7. LLM-assisted development fit¶
| Aspect | Rating | Notes |
|---|---|---|
| Workflow definition scaffolding | ★★★★★ | Excellent — Temporal SDK patterns for Go, Python, and TypeScript are well-represented in training data. |
| Activity implementation boilerplate | ★★★★★ | Excellent — activity code is ordinary Go/Python; generate freely and review business logic. |
| RetryPolicy and timeout configuration | ★★★★ | Good; timeout values require domain knowledge (how long should a payment API call be allowed to take?). |
| Compensating transaction (saga) design | ★★★ | Understands the pattern; the correctness of multi-step rollback chains requires careful human design and explicit tests for every failure branch. |
| Architecture decisions | ★ | Don't outsource. Use ADRs. |
Recommended workflow: Start with a three-step workflow (start → one activity → complete) to validate the Temporal setup end-to-end before building complex flows. Write the compensating transaction path before the happy path — if you design the happy path first, the compensation is often an afterthought that breaks under real failures.
8. Reference implementations¶
- Public reference: temporalio/temporal — the Temporal server; architecture documentation and the Go SDK reference (200 OK ✓)
- Public reference: temporalio/samples-python — official Python SDK samples covering child workflows, signals, timers, human-in-the-loop, and saga patterns (200 OK ✓)
- Internal case study: Add your anonymised internal example here
9. Related decisions (ADRs)¶
10. Known risks & gotchas¶
- Non-deterministic workflow code breaks replay — Temporal reconstructs workflow state by replaying execution history; if workflow code contains
time.Now(), random numbers, or direct API calls (not wrapped as activities), replay produces a different result and the workflow panics with a non-determinism error. Rule: workflow functions must be deterministic; every side effect and every external call goes in an activity. - Activity timeouts not set cause permanent hangs — an activity calls an external API with no timeout; the API hangs indefinitely; the workflow waits forever with no alert. Mitigation: every activity must set a
StartToCloseTimeout(maximum time for a single attempt) and aScheduleToCloseTimeout(maximum total time including retries); treat timeout omission as a code review failure. - Workflow history grows too large for long-running processes — a monthly billing workflow that accumulates thousands of events across a year causes replay to slow to seconds. Mitigation: use
ContinueAsNewto reset workflow history periodically in any workflow that runs for more than a few days. - Human approval step with no timeout waits forever — a workflow waits for a Signal (human clicks "approve"); the human is on holiday; the workflow blocks the associated business object for weeks. Mitigation: always pair a Signal wait with a timer; automatically escalate to a secondary approver or cancel with a notification after a configured timeout.
- Worker deployment removes a registered activity — a rolling deploy removes an activity function that in-flight workflows are currently calling; those workflows get stuck with "activity type not registered" errors. Mitigation: use Temporal's versioning API (
workflow.GetVersionin Go,workflow.patchedin Python) to branch logic on version; keep old activity implementations deployed until all in-flight workflows using them complete.