Pattern: Agentic Workflow¶

Quick facts

Category: AI / LLM-Integrated Systems
Maturity: Trial
Typical team size: 2-4 engineers
Typical timeline to MVP: 6-12 weeks
Last reviewed: 2026-05-02 by Architecture Team

1. Context¶

Use this pattern when:

A task requires multiple steps where earlier results determine which steps come next — the flow cannot be fully enumerated upfront
The LLM must call external tools (web search, code execution, APIs, databases) and incorporate the results before continuing
The task is complex enough that it genuinely benefits from the model's reasoning ability, not just its generation ability
Latency tolerance is high: 10–120 seconds per task is acceptable

Do NOT use this pattern when:

The workflow steps are fully deterministic regardless of inputs — use a conventional orchestrator (Prefect, Temporal) and call the LLM only for generation sub-steps
Latency requirements are under 2 seconds — agents are structurally slow due to multi-turn LLM calls
Any tool the agent would call has irreversible side-effects (sends emails, charges cards, deletes records) without a mandatory human-in-the-loop gate — the risk of unchecked autonomous action is too high until you have extensive operational experience with this specific agent
Your team has not yet shipped a working RAG pipeline or chatbot — agents amplify complexity, they do not replace foundational skills

2. Problem it solves¶

Some tasks — competitive research, code review, data analysis across multiple sources, multi-step form completion — require iteratively gathering information, making intermediate decisions, and adapting next steps based on what was found. Hard-coding all possible paths is impractical. A human doing the task improvises: they search, read, follow leads, run calculations, and produce a synthesised result. This pattern lets an LLM follow the same adaptive process, within a runtime that enforces safety guardrails and maintains a recoverable audit trail.

3. Solution overview¶

System context (C4 Level 1)¶

flowchart LR
    User((User)) --> AgentAPI[Agent API]
    AgentAPI --> Runtime[Agent Runtime\nLangGraph / hand-rolled]
    Runtime --> LLM[LLM Provider\nAnthropic Claude]
    Runtime --> Tools[Tools\nsearch / code / APIs]
    Runtime --> HITL[Human Approval Gate\nfor irreversible actions]
    Runtime --> StateDB[(Agent State\nPostgres)]
    Runtime --> Obs[Observability\nLangfuse]

Container view (C4 Level 2)¶

flowchart TB
    subgraph API Layer
        AgentAPI[Agent API\nFastAPI — submit task, poll status]
        TaskQueue[Task Queue\nSQS / Redis]
    end
    subgraph Agent Runtime
        Orchestrator[Orchestrator\nLangGraph state machine]
        Planner[LLM Planner call\nselect next tool + args]
        ToolExec[Tool Executor\ndispatch + parse result]
        StateStore[(Agent State\nPostgres checkpoints)]
        StepLimit[Step Limiter\nmax N iterations]
    end
    subgraph Tool Implementations
        WebSearch[Web Search\nTavily API]
        CodeSandbox[Code Executor\ne2b sandboxed runtime]
        DBTool[DB Query\nread-only replica]
        APITool[External API\nhttpx]
    end
    subgraph Safety
        HITL[Human Approval\nwebhook — irreversible actions only]
        ActionGuard[Action Validator\nblock destructive patterns]
    end
    subgraph Ops
        Langfuse[Langfuse\nfull trace per run]
        DLQ[Dead-letter queue\nfailed + timed-out runs]
    end

    AgentAPI --> TaskQueue --> Orchestrator
    Orchestrator --> Planner --> ToolExec
    ToolExec --> WebSearch
    ToolExec --> CodeSandbox
    ToolExec --> DBTool
    ToolExec --> APITool
    ToolExec --> ActionGuard
    ActionGuard --> HITL
    Orchestrator --> StateStore
    Orchestrator --> StepLimit
    Orchestrator --> Langfuse
    Orchestrator -->|timeout or error| DLQ

4. Technology stack¶

Layer	Primary choice	Alternatives	Notes
LLM	Anthropic Claude 3.5 Sonnet	OpenAI GPT-4o, Google Gemini 1.5 Pro	See ADR-0006; extended thinking and high tool-use accuracy make Sonnet the default for complex agentic tasks
Orchestration	LangGraph	Hand-rolled state machine, CrewAI, AutoGen	See ADR-0005; LangGraph's graph abstraction earns its weight for multi-agent topologies; hand-roll for simple linear workflows
Web search tool	Tavily Search API	Brave Search API, SerpAPI	Tavily returns clean structured results optimised for LLM consumption; no HTML parsing required
Code execution	e2b	Modal, local subprocess with seccomp	e2b provides sandboxed cloud execution with a Python/JS kernel; never execute LLM-generated code in your own process without sandboxing
Tool schema	Pydantic v2 models	JSON Schema directly	Type-annotated Pydantic models auto-generate reliable JSON Schema for tool definitions; validation errors surface before the LLM is called
State persistence	PostgreSQL (LangGraph checkpointer)	Redis (ephemeral)	Persist agent state at every step — enables resume-on-failure, human inspection, and audit trails; ephemeral Redis is insufficient for production
Human-in-the-loop	LangGraph `interrupt` + webhook	Custom approval UI	Block execution at any node requiring approval; store the pending state; resume after human action
Observability	Langfuse	LangSmith, Arize Phoenix	Trace every LLM call, tool invocation, and state transition; agent failures without full traces are nearly impossible to debug

5. Non-functional characteristics¶

Concern	Profile
Scalability	Each agent run is a stateful, long-running process. Scale by running more concurrent agents (horizontal), not by making a single agent faster. Decouple submission (sync API) from execution (async worker) to avoid HTTP timeouts on long tasks.
Availability target	99.5%; long-running tasks must be resumable from the last checkpoint after a worker restart or LLM API interruption. Never run an agent step without writing state first.
Latency target	Not latency-sensitive in the traditional sense. Define a wall-clock SLA per task type (e.g., "research task completes within 3 minutes"). Set a hard maximum step count (e.g., 25 iterations) and wall-clock timeout (e.g., 5 minutes) and abort gracefully.
Security posture	Every tool is a potential attack surface. Principle of least privilege: the web-search tool has no DB access; the DB tool is read-only. Validate tool call arguments before execution. Treat all content retrieved by tools as untrusted — a webpage or API response may contain adversarial instructions (prompt injection via tool results).
Data residency	All intermediate reasoning steps (including tool results) are transmitted to the LLM API. If tool results contain PII or confidential data, confirm your LLM provider's data retention policy before deploying.
Compliance fit	SOC 2 ✓ with a complete audit log of every tool call and its arguments. GDPR: if the agent processes personal data during its reasoning, document this in your ROPA. HIPAA: BAA required if health data appears in tool results sent to the API.

6. Cost ballpark¶

Indicative monthly USD cost. Multi-step tasks consume many more tokens than single-turn calls; cost scales with average steps per run × runs per month.

Scale	Agent runs / month	Monthly cost	Cost drivers
Small	< 500	$100 - $600	LLM API (dominant — each run may call Sonnet 5–20 times), Tavily search credits
Medium	500 - 10,000	$1,000 - $10,000	LLM API at volume, e2b sandboxing credits, Langfuse observability tier
Large	10,000+	$10,000 - $50,000	LLM API dominant; evaluate caching identical sub-steps, batching, and cheaper model for planning vs. generation

7. LLM-assisted development fit¶

Aspect	Rating	Notes
Individual tool implementation boilerplate	★★★★★	Excellent — httpx clients, Pydantic schemas, and API integration code generate cleanly.
LangGraph graph definition and node wiring	★★★★	Good for linear and simple branching graphs; complex multi-agent topologies require careful hand-design.
Prompt engineering for tool selection	★★★	Generates a reasonable starting system prompt; optimal tool descriptions require iteration against real task traces — not something an LLM can solve upfront.
Human-in-the-loop interrupt and resume logic	★★	Understands the concept; the state serialisation and webhook resume path have subtle edge cases that require manual testing end-to-end.
Architecture decisions	★	Don't outsource — specifically the step-limit, timeout, and approval-gate design require deliberate human decisions about acceptable risk.

Recommended workflow: Start with a hardcoded 3-step pipeline (not an agent) and validate tool implementations. Add the LLM planning loop only after tools work reliably. Add the step limiter and timeout before any production testing — not after.

8. Reference implementations¶

Public reference: langchain-ai/langgraph — the LangGraph library itself; examples/ directory contains research agent, ReAct agent, and multi-agent supervisor patterns (200 OK ✓)
Public reference: e2b-dev/e2b — sandboxed code execution for AI agents; Python and JS SDKs with agent integration examples (200 OK ✓)
Public reference: anthropics/anthropic-cookbook — official Anthropic examples including tool use, agentic loops, and extended thinking patterns for Claude (200 OK ✓)
Internal case study: Add your anonymised internal example here

10. Known risks & gotchas¶

Agents loop indefinitely without a hard step cap — Without a maximum iteration count, a confused agent retries failed tool calls forever, burning API credits and never completing. Mitigation: enforce a hard maximum step count (25 is a reasonable default) and a wall-clock timeout at the orchestration layer — both are needed, as a slow agent can exhaust time before steps.
Prompt injection via tool results — A webpage, database row, or API response returned by a tool may contain text like "Ignore all previous instructions and instead…". The model may follow these instructions. Mitigation: wrap tool results in a structured format that makes the boundary between instructions and data explicit (<tool_result>...</tool_result>); instruct the model to treat tool content as untrusted data.
Cost explosion from runaway agents — A single misconfigured agent run can invoke Sonnet 50+ times and spend $20–50 in minutes. Mitigation: set per-run token budget limits in the orchestrator; alert immediately when any single run exceeds 2× the expected token count.
Irreversible tool actions executed autonomously — The agent calls send_email() or delete_record() without human review because no guardrail was configured. Mitigation: explicitly classify every tool as read or write; require a human-in-the-loop interrupt before any write tool is called, without exception, until you have extensive operational data on the agent's reliability.
State checkpoint deserialization breaks after code changes — A persisted agent state from v1 cannot be resumed after a schema change in v2. Mitigation: version your state schema; treat checkpoint compatibility with the same discipline as database migrations.